diff --git a/dip2 b/dip2 index 5009412..c73b8c0 100644 --- a/dip2 +++ b/dip2 @@ -1,210 +1,3 @@ - - - -Dive Into Python - - -

Dive Into Python

-

20 May 2004 -

This book lives at http://diveintopython3.org/. If you're reading it somewhere else, you may not have the latest version. -

-

Table of Contents -

- -

Chapter 1. Installing Python

Welcome to Python. Let's dive in. In this chapter, you'll install the version of Python that's right for you. @@ -538,24 +331,6 @@ hello world -

2.3. Documenting Functions

-

You can document a Python function by giving it a docstring. -

Example 2.2. Defining the buildConnectionString Function's docstring


-def buildConnectionString(params):
-    """Build a connection string from a dictionary of parameters.
-
-    Returns string."""

Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including - carriage returns and other quote characters. You can use them anywhere, but you'll see them most often used when defining - a docstring. - - -
NoteTriple quotes are also an easy way to define a string with both single and double quotes, like qq/.../ in Perl. -

Everything between the triple quotes is the function's docstring, which documents what the function does. A docstring, if it exists, must be the first thing defined in a function (that is, the first thing after the colon). You don't technically -need to give your function a docstring, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the docstring is available at runtime as an attribute of the function. - - -
NoteMany Python IDEs use the docstring to provide context-sensitive documentation, so that when you type a function name, its docstring appears as a tooltip. This can be incredibly helpful, but it's only as good as the docstrings you write. - @@ -1930,238 +1705,20 @@ exceptions, errors occur immediately, and you can handle them in a standard way
  • Python Reference Manual discusses the inner workings of the try...except block. -

    6.2. Working with File Objects

    -

    Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file. -

    Example 6.3. Opening a File

    >>> f = open("/music/_singles/kairo.mp3", "rb") 
    ->>> f       
    -<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
    ->>> f.mode  
    -'rb'
    ->>> f.name  
    -'/music/_singles/kairo.mp3'
    -
      -
    1. The open method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, - is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode. - (print open.__doc__ displays a great explanation of all the possible modes.) -
    2. The open function returns an object (by now, this should not surprise you). A file object has several useful attributes. -
    3. The mode attribute of a file object tells you in which mode the file was opened. -
    4. The name attribute of a file object tells you the name of the file that the file object has open. -

      6.2.1. Reading Files

      -

      After you open a file, the first thing you'll want to do is read from it, as shown in the next example. -

      Example 6.4. Reading a File

      ->>> f
      -<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
      ->>> f.tell()              
      -0
      ->>> f.seek(-128, 2)       
      ->>> f.tell()              
      -7542909
      ->>> tagData = f.read(128) 
      ->>> tagData
      -'TAGKAIRO****THE BEST GOA         ***DJ MARY-JANE***            
      -Rave Mix    2000http://mp3.com/DJMARYJANE     \037'
      ->>> f.tell()              
      -7543037
      -
        -
      1. A file object maintains state about the file it has open. The tell method of a file object tells you your current position in the open file. Since you haven't done anything with this file - yet, the current position is 0, which is the beginning of the file. -
      2. The seek method of a file object moves to another position in the open file. The second parameter specifies what the first one means; -0 means move to an absolute position (counting from the start of the file), 1 means move to a relative position (counting from the current position), and 2 means move to a position relative to the end of the file. Since the MP3 tags you're looking for are stored at the end of the file, you use 2 and tell the file object to move to a position 128 bytes from the end of the file. -
      3. The tell method confirms that the current file position has moved. -
      4. The read method reads a specified number of bytes from the open file and returns a string with the data that was read. The optional - parameter specifies the maximum number of bytes to read. If no parameter is specified, read will read until the end of the file. (You could have simply said read() here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data - is assigned to the tagData variable, and the current position is updated based on how many bytes were read. -
      5. The tell method confirms that the current position has moved. If you do the math, you'll see that after reading 128 bytes, the position - has been incremented by 128. -

        6.2.2. Closing Files

        -

        Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's - important to close files as soon as you're finished with them. -

        Example 6.5. Closing a File

        ->>> f
        -<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
        ->>> f.closed       
        -False
        ->>> f.close()      
        ->>> f
        -<closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
        ->>> f.closed       
        -True
        ->>> f.seek(0)      
        -Traceback (innermost last):
        -  File "<interactive input>", line 1, in ?
        -ValueError: I/O operation on closed file
        ->>> f.tell()
        -Traceback (innermost last):
        -  File "<interactive input>", line 1, in ?
        -ValueError: I/O operation on closed file
        ->>> f.read()
        -Traceback (innermost last):
        -  File "<interactive input>", line 1, in ?
        -ValueError: I/O operation on closed file
        ->>> f.close()      
        -
          -
        1. The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is False). -
        2. To close a file, call the close method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) - that the system hadn't gotten around to actually writing yet, and releases the system resources. -
        3. The closed attribute confirms that the file is closed. -
        4. Just because a file is closed doesn't mean that the file object ceases to exist. The variable f will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed; - they all raise an exception. -
        5. Calling close on a file object whose file is already closed does not raise an exception; it fails silently. -

          6.2.3. Handling I/O Errors

          -

          Now you've seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle - errors. -

          Example 6.6. File Objects in MP3FileInfo

          
          -        try:              
          -            fsock = open(filename, "rb", 0) 
          -            try:         
          -                fsock.seek(-128, 2)         
          -                tagdata = fsock.read(128)   
          -            finally:      
          -                fsock.close()              
          -            .
          -            .
          -            .
          -        except IOError:   
          -            pass         
          -
            -
          1. Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a try...except block. (Hey, isn't standardized indentation great? This is where you start to appreciate it.) -
          2. The open function may raise an IOError. (Maybe the file doesn't exist.) -
          3. The seek method may raise an IOError. (Maybe the file is smaller than 128 bytes.) -
          4. The read method may raise an IOError. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.) -
          5. This is new: a try...finally block. Once the file has been opened successfully by the open function, you want to make absolutely sure that you close it, even if an exception is raised by the seek or read methods. That's what a try...finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before. -
          6. At last, you handle your IOError exception. This could be the IOError exception raised by the call to open, seek, or read. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember, pass is a Python statement that does nothing.) That's perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the - next line of code after the try...except block. -

            6.2.4. Writing to Files

            -

            As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes: -

            -
              -
            • "Append" mode will add data to the end of the file. -
            • "write" mode will overwrite the file. -
            -

            Either mode will create the file automatically if it doesn't already exist, so there's never a need for any sort of fiddly - "if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open - it and start writing. -

            Example 6.7. Writing to Files

            ->>> logfile = open('test.log', 'w') 
            ->>> logfile.write('test succeeded') 
            ->>> logfile.close()
            ->>> print file('test.log').read()   
            -test succeeded
            ->>> logfile = open('test.log', 'a') 
            ->>> logfile.write('line 2')
            ->>> logfile.close()
            ->>> print file('test.log').read()   
            -test succeededline 2
            -
            -
              -
            1. You start boldly by creating either the new file test.log or overwrites the existing file, and opening the file for writing. (The second parameter "w" means open the file for writing.) Yes, that's all as dangerous as it sounds. I hope you didn't care about the previous - contents of that file, because it's gone now. -
            2. You can add data to the newly opened file with the write method of the file object returned by open. -
            3. file is a synonym for open. This one-liner opens the file, reads its contents, and prints them. -
            4. You happen to know that test.log exists (since you just finished writing to it), so you can open it and append to it. (The "a" parameter means open the file for appending.) Actually you could do this even if the file didn't exist, because opening - the file for appending will create the file if necessary. But appending will never harm the existing contents of the file. -
            5. As you can see, both the original line you wrote and the second line you appended are now in test.log. Also note that carriage returns are not included. Since you didn't write them explicitly to the file either time, the - file doesn't include them. You can write a carriage return with the "\n" character. Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line. -
              -

              Further Reading on File Handling

              - -

              6.3. Iterating with for Loops

              -

              Like most other languages, Python has for loops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often. -

              Most other languages don't have a powerful list datatype like Python, so you end up doing a lot of manual work, specifying a start, end, and step to define a range of integers or characters -or other iteratable entities. But in Python, a for loop simply iterates over a list, the same way list comprehensions work. -

              Example 6.8. Introducing the for Loop

              >>> li = ['a', 'b', 'e']
              ->>> for s in li:         
              -...    print s          
              -a
              -b
              -e
              ->>> print "\n".join(li)  
              -a
              -b
              -e
              -
                -
              1. The syntax for a for loop is similar to list comprehensions. li is a list, and s will take the value of each element in turn, starting from the first element. -
              2. Like an if statement or any other indented block, a for loop can have any number of lines of code in it. -
              3. This is the reason you haven't seen the for loop yet: you haven't needed it yet. It's amazing how often you use for loops in other languages when all you really want is a join or a list comprehension. -

                Doing a “normal” (by Visual Basic standards) counter for loop is also simple. -

                Example 6.9. Simple Counters

                ->>> for i in range(5):             
                -...    print i
                -0
                -1
                -2
                -3
                -4
                ->>> li = ['a', 'b', 'c', 'd', 'e']
                ->>> for i in range(len(li)):       
                -...    print li[i]
                -a
                -b
                -c
                -d
                -e
                -
                -
                  -
                1. As you saw in Example 3.20, “Assigning Consecutive Values”, range produces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress -occasionally) useful to have a counter loop. -
                2. Don't ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example. -

                  for loops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using a for loop to iterate through a dictionary. -

                  Example 6.10. Iterating Through a Dictionary

                  ->>> import os
                  ->>> for k, v in os.environ.items():       
                  -...    print "%s=%s" % (k, v)
                  -USERPROFILE=C:\Documents and Settings\mpilgrim
                  -OS=Windows_NT
                  -COMPUTERNAME=MPILGRIM
                  -USERNAME=mpilgrim
                   
                  -[...snip...]
                  ->>> print "\n".join(["%s=%s" % (k, v)
                  -...    for k, v in os.environ.items()]) 
                  -USERPROFILE=C:\Documents and Settings\mpilgrim
                  -OS=Windows_NT
                  -COMPUTERNAME=MPILGRIM
                  -USERNAME=mpilgrim
                   
                  -[...snip...]
                  -
                    -
                  1. os.environ is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables - accessible from MS-DOS. In UNIX, they are the variables exported in your shell's startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty. -
                  2. os.environ.items() returns a list of tuples: [(key1, value1), (key2, value2), ...]. The for loop iterates through this list. The first round, it assigns key1 to k and value1 to v, so k = USERPROFILE and v = C:\Documents and Settings\mpilgrim. In the second round, k gets the second key, OS, and v gets the corresponding value, Windows_NT. -
                  3. With multi-variable assignment and list comprehensions, you can replace the entire for loop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it - because it makes it clear that what I'm doing is mapping a dictionary into a list, then joining the list into a single string. - Other programmers prefer to write this out as a for loop. The output is the same in either case, although this version is slightly faster, because there is only one print statement instead of many. -

                    Now we can look at the for loop in MP3FileInfo, from the sample fileinfo.py program introduced in Chapter 5. -

                    Example 6.11. for Loop in MP3FileInfo

                    
                    -    tagDataMap = {"title"   : (  3,  33, stripnulls),
                    -"artist"  : ( 33,  63, stripnulls),
                    -"album"   : ( 63,  93, stripnulls),
                    -"year"    : ( 93,  97, stripnulls),
                    -"comment" : ( 97, 126, stripnulls),
                    -"genre"   : (127, 128, ord)}             
                    -    .
                    -    .
                    -    .
                    -            if tagdata[:3] == "TAG":
                    -                for tag, (start, end, parseFunc) in self.tagDataMap.items(): 
                    -  self[tag] = parseFunc(tagdata[start:end])                
                    -
                      -
                    1. tagDataMap is a class attribute that defines the tags you're looking for in an MP3 file. Tags are stored in fixed-length fields. Once you read the last 128 bytes of the file, bytes 3 through 32 of those - are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth. Note - that tagDataMap is a dictionary of tuples, and each tuple contains two integers and a function reference. -
                    2. This looks complicated, but it's not. The structure of the for variables matches the structure of the elements of the list returned by items. Remember that items returns a list of tuples of the form (key, value). The first element of that list is ("title", (3, 33, <function stripnulls>)), so the first time around the loop, tag gets "title", start gets 3, end gets 33, and parseFunc gets the function stripnulls. -
                    3. Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy. You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self. After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like. -

                      6.4. Using sys.modules

                      -

                      Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules. + + +[for loop stuff was here] + + + + +

                      Example 6.12. Introducing sys.modules

                      >>> import sys        
                       >>> print '\n'.join(sys.modules.keys()) 
                       win32api
                      @@ -2353,608 +1910,17 @@ may already be familiar with from working on the command line.
                       
                    4. Python Library Reference documents the os module and the os.path module. -

                      6.6. Putting It All Together

                      -

                      Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all - fits together. -

                      Example 6.21. listDirectory

                      
                      -def listDirectory(directory, fileExtList):     
                      -    "get list of file info objects for files of particular extensions"
                      -    fileList = [os.path.normcase(f)
                      -                for f in os.listdir(directory)]           
                      -    fileList = [os.path.join(directory, f) 
                      -               for f in fileList
                      -                if os.path.splitext(f)[1] in fileExtList]        
                      -    def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):       
                      -        "get file info class from filename extension"           
                      -        subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]        
                      -        return hasattr(module, subclass) and getattr(module, subclass) or FileInfo 
                      -    return [getFileInfoClass(f)(f) for f in fileList]            
                      -
                        -
                      1. listDirectory is the main attraction of this entire module. It takes a directory (like c:\music\_singles\ in my case) and a list of interesting file extensions (like ['.mp3']), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in - that directory. And it does it in just a few straightforward lines of code. -
                      2. As you saw in the previous section, this line of code gets a list of the full pathnames of all the files in directory that have an interesting file extension (as specified by fileExtList). -
                      3. Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions -- literally, a function within a function. The nested function getFileInfoClass can be called only from the function in which it is defined, listDirectory. As with any other function, you don't need an interface declaration or anything fancy; just define the function and code - it. -
                      4. Now that you've seen the os module, this line should make more sense. It gets the extension of the file (os.path.splitext(filename)[1]), forces it to uppercase (.upper()), slices off the dot ([1:]), and constructs a class name out of it with string formatting. So c:\music\ap\mahadeva.mp3 becomes .mp3 becomes .MP3 becomes MP3 becomes MP3FileInfo. -
                      5. Having constructed the name of the handler class that would handle this file, you check to see if that handler class actually - exists in this module. If it does, you return the class, otherwise you return the base class FileInfo. This is a very important point: this function returns a class. Not an instance of a class, but the class itself. -
                      6. For each file in the “interesting files” list (fileList), you call getFileInfoClass with the filename (f). Calling getFileInfoClass(f) returns a class; you don't know exactly which class, but you don't care. You then create an instance of this class (whatever - it is) and pass the filename (f again), to the __init__ method. As you saw earlier in this chapter, the __init__ method of FileInfo sets self["name"], which triggers __setitem__, which is overridden in the descendant (MP3FileInfo) to parse the file appropriately to pull out the file's metadata. You do all that for each interesting file and return a - list of the resulting instances. -

                        Note that listDirectory is completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined -that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own -module to see what special handler classes (like MP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class: -HTMLFileInfo for HTML files, DOCFileInfo for Word .doc files, and so forth. listDirectory will handle them all, without modification, by handing off the real work to the appropriate classes and collating the results. -

                        6.7. Summary

                        -

                        The fileinfo.py program introduced in Chapter 5 should now make perfect sense. -

                        
                        -"""Framework for getting filetype-specific metadata.
                         
                        -Instantiate appropriate class with filename. Returned object acts like a
                        -dictionary, with key-value pairs for each piece of metadata.
                        -    import fileinfo
                        -    info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3")
                        -    print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()])
                         
                        -Or use listDirectory function to get info on all files in a directory.
                        -    for info in fileinfo.listDirectory("/music/ap/", [".mp3"]):
                        -        ...
                         
                        -Framework can be extended by adding classes for particular file types, e.g.
                        -HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for
                        -parsing its files appropriately; see MP3FileInfo for example.
                        -"""
                        -import os
                        -import sys
                        -from UserDict import UserDict
                         
                        -def stripnulls(data):
                        -    "strip whitespace and nulls"
                        -    return data.replace("\00", "").strip()
                         
                        -class FileInfo(UserDict):
                        -    "store file metadata"
                        -    def __init__(self, filename=None):
                        -        UserDict.__init__(self)
                        -        self["name"] = filename
                        +[HTML stuff was here]
                         
                        -class MP3FileInfo(FileInfo):
                        -    "store ID3v1.0 MP3 tags"
                        -    tagDataMap = {"title"   : (  3,  33, stripnulls),
                        -"artist"  : ( 33,  63, stripnulls),
                        -"album"   : ( 63,  93, stripnulls),
                        -"year"    : ( 93,  97, stripnulls),
                        -"comment" : ( 97, 126, stripnulls),
                        -"genre"   : (127, 128, ord)}
                         
                        -    def __parse(self, filename):
                        -        "parse ID3v1.0 tags from MP3 file"
                        -        self.clear()
                        -        try:             
                        -            fsock = open(filename, "rb", 0)
                        -            try:         
                        -                fsock.seek(-128, 2)        
                        -                tagdata = fsock.read(128)  
                        -            finally:     
                        -                fsock.close()              
                        -            if tagdata[:3] == "TAG":
                        -                for tag, (start, end, parseFunc) in self.tagDataMap.items():
                        -  self[tag] = parseFunc(tagdata[start:end])               
                        -        except IOError:  
                        -            pass         
                         
                        -    def __setitem__(self, key, item):
                        -        if key == "name" and item:
                        -            self.__parse(item)
                        -        FileInfo.__setitem__(self, key, item)
                         
                        -def listDirectory(directory, fileExtList):    
                        -    "get list of file info objects for files of particular extensions"
                        -    fileList = [os.path.normcase(f)
                        -                for f in os.listdir(directory)]           
                        -    fileList = [os.path.join(directory, f) 
                        -               for f in fileList
                        -                if os.path.splitext(f)[1] in fileExtList] 
                        -    def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):      
                        -        "get file info class from filename extension"           
                        -        subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]       
                        -        return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
                        -    return [getFileInfoClass(f)(f) for f in fileList]           
                         
                        -if __name__ == "__main__":
                        -    for info in listDirectory("/music/_singles/", [".mp3"]):
                        -        print "\n".join(["%s=%s" % (k, v) for k, v in info.items()])
                        -        print
                        -

                        Before diving into the next chapter, make sure you're comfortable doing the following things: -

                        - -
                        -

                        Chapter 8. HTML Processing

                        -

                        8.1. Diving in

                        -

                        I often see questions on comp.lang.python like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions. -

                        Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the docstrings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how -any of these class methods ever get called. Don't worry, all will be revealed in due time. -

                        Example 8.1. BaseHTMLProcessor.py

                        -

                        If you have not already done so, you can download this and other examples used in this book. -

                        
                        -from sgmllib import SGMLParser
                        -import htmlentitydefs
                        -
                        -class BaseHTMLProcessor(SGMLParser):
                        -    def reset(self):     
                        -        # extend (called by SGMLParser.__init__)
                        -        self.pieces = []
                        -        SGMLParser.reset(self)
                        -
                        -    def unknown_starttag(self, tag, attrs):
                        -        # called for each start tag
                        -        # attrs is a list of (attr, value) tuples
                        -        # e.g. for <pre class=screen>, tag="pre", attrs=[("class", "screen")]
                        -        # Ideally we would like to reconstruct original tag and attributes, but
                        -        # we may end up quoting attribute values that weren't quoted in the source
                        -        # document, or we may change the type of quotes around the attribute value
                        -        # (single to double quotes).
                        -        # Note that improperly embedded non-HTML code (like client-side Javascript)
                        -        # may be parsed incorrectly by the ancestor, causing runtime script errors.
                        -        # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
                        -        # to ensure that it will pass through this parser unaltered (in handle_comment).
                        -        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
                        -        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
                        -
                        -    def unknown_endtag(self, tag):         
                        -        # called for each end tag, e.g. for </pre>, tag will be "pre"
                        -        # Reconstruct the original end tag.
                        -        self.pieces.append("</%(tag)s>" % locals())
                        -
                        -    def handle_charref(self, ref):         
                        -        # called for each character reference, e.g. for "&#160;", ref will be "160"
                        -        # Reconstruct the original character reference.
                        -        self.pieces.append("&#%(ref)s;" % locals())
                        -
                        -    def handle_entityref(self, ref):       
                        -        # called for each entity reference, e.g. for "&copy;", ref will be "copy"
                        -        # Reconstruct the original entity reference.
                        -        self.pieces.append("&%(ref)s" % locals())
                        -        # standard HTML entities are closed with a semicolon; other entities are not
                        -        if htmlentitydefs.entitydefs.has_key(ref):
                        -            self.pieces.append(";")
                        -
                        -    def handle_data(self, text):           
                        -        # called for each block of plain text, i.e. outside of any tag and
                        -        # not containing any character or entity references
                        -        # Store the original text verbatim.
                        -        self.pieces.append(text)
                        -
                        -    def handle_comment(self, text):        
                        -        # called for each HTML comment, e.g. <!-- insert Javascript code here -->
                        -        # Reconstruct the original comment.
                        -        # It is especially important that the source document enclose client-side
                        -        # code (like Javascript) within comments so it can pass through this
                        -        # processor undisturbed; see comments in unknown_starttag for details.
                        -        self.pieces.append("<!--%(text)s-->" % locals())
                        -
                        -    def handle_pi(self, text):             
                        -        # called for each processing instruction, e.g. <?instruction>
                        -        # Reconstruct original processing instruction.
                        -        self.pieces.append("<?%(text)s>" % locals())
                        -
                        -    def handle_decl(self, text):
                        -        # called for the DOCTYPE, if present, e.g.
                        -        # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                        -        #     "http://www.w3.org/TR/html4/loose.dtd">
                        -        # Reconstruct original DOCTYPE
                        -        self.pieces.append("<!%(text)s>" % locals())
                        -
                        -    def output(self):              
                        -        """Return processed HTML as a single string"""
                        -        return "".join(self.pieces)

                        Example 8.2. dialect.py

                        
                        -import re
                        -from BaseHTMLProcessor import BaseHTMLProcessor
                        -
                        -class Dialectizer(BaseHTMLProcessor):
                        -    subs = ()
                        -
                        -    def reset(self):
                        -        # extend (called from __init__ in ancestor)
                        -        # Reset all data attributes
                        -        self.verbatim = 0
                        -        BaseHTMLProcessor.reset(self)
                        -
                        -    def start_pre(self, attrs):            
                        -        # called for every <pre> tag in HTML source
                        -        # Increment verbatim mode count, then handle tag like normal
                        -        self.verbatim += 1                 
                        -        self.unknown_starttag("pre", attrs)
                        -
                        -    def end_pre(self):   
                        -        # called for every </pre> tag in HTML source
                        -        # Decrement verbatim mode count
                        -        self.unknown_endtag("pre")         
                        -        self.verbatim -= 1                 
                        -
                        -    def handle_data(self, text):    
                        -        # override
                        -        # called for every block of text in HTML source
                        -        # If in verbatim mode, save text unaltered;
                        -        # otherwise process the text with a series of substitutions
                        -        self.pieces.append(self.verbatim and text or self.process(text))
                        -
                        -    def process(self, text):
                        -        # called from handle_data
                        -        # Process text block by performing series of regular expression
                        -        # substitutions (actual substitions are defined in descendant)
                        -        for fromPattern, toPattern in self.subs:
                        -            text = re.sub(fromPattern, toPattern, text)
                        -        return text
                        -
                        -class ChefDialectizer(Dialectizer):
                        -    """convert HTML to Swedish Chef-speak
                        -    
                        -    based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman
                        -    """
                        -    subs = ((r'a([nu])', r'u\1'),
                        -            (r'A([nu])', r'U\1'),
                        -            (r'a\B', r'e'),
                        -            (r'A\B', r'E'),
                        -            (r'en\b', r'ee'),
                        -            (r'\Bew', r'oo'),
                        -            (r'\Be\b', r'e-a'),
                        -            (r'\be', r'i'),
                        -            (r'\bE', r'I'),
                        -            (r'\Bf', r'ff'),
                        -            (r'\Bir', r'ur'),
                        -            (r'(\w*?)i(\w*?)$', r'\1ee\2'),
                        -            (r'\bow', r'oo'),
                        -            (r'\bo', r'oo'),
                        -            (r'\bO', r'Oo'),
                        -            (r'the', r'zee'),
                        -            (r'The', r'Zee'),
                        -            (r'th\b', r't'),
                        -            (r'\Btion', r'shun'),
                        -            (r'\Bu', r'oo'),
                        -            (r'\BU', r'Oo'),
                        -            (r'v', r'f'),
                        -            (r'V', r'F'),
                        -            (r'w', r'w'),
                        -            (r'W', r'W'),
                        -            (r'([a-z])[.]', r'\1. Bork Bork Bork!'))
                        -
                        -class FuddDialectizer(Dialectizer):
                        -    """convert HTML to Elmer Fudd-speak"""
                        -    subs = ((r'[rl]', r'w'),
                        -            (r'qu', r'qw'),
                        -            (r'th\b', r'f'),
                        -            (r'th', r'd'),
                        -            (r'n[.]', r'n, uh-hah-hah-hah.'))
                        -
                        -class OldeDialectizer(Dialectizer):
                        -    """convert HTML to mock Middle English"""
                        -    subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
                        -            (r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
                        -            (r'ick\b', r'yk'),
                        -            (r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
                        -            (r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
                        -            (r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
                        -            (r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
                        -            (r'([aeiou])re\b', r'\1r'),
                        -            (r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
                        -            (r'tion\b', r'cioun'),
                        -            (r'ion\b', r'ioun'),
                        -            (r'aid', r'ayde'),
                        -            (r'ai', r'ey'),
                        -            (r'ay\b', r'y'),
                        -            (r'ay', r'ey'),
                        -            (r'ant', r'aunt'),
                        -            (r'ea', r'ee'),
                        -            (r'oa', r'oo'),
                        -            (r'ue', r'e'),
                        -            (r'oe', r'o'),
                        -            (r'ou', r'ow'),
                        -            (r'ow', r'ou'),
                        -            (r'\bhe', r'hi'),
                        -            (r've\b', r'veth'),
                        -            (r'se\b', r'e'),
                        -            (r"'s\b", r'es'),
                        -            (r'ic\b', r'ick'),
                        -            (r'ics\b', r'icc'),
                        -            (r'ical\b', r'ick'),
                        -            (r'tle\b', r'til'),
                        -            (r'll\b', r'l'),
                        -            (r'ould\b', r'olde'),
                        -            (r'own\b', r'oune'),
                        -            (r'un\b', r'onne'),
                        -            (r'rry\b', r'rye'),
                        -            (r'est\b', r'este'),
                        -            (r'pt\b', r'pte'),
                        -            (r'th\b', r'the'),
                        -            (r'ch\b', r'che'),
                        -            (r'ss\b', r'sse'),
                        -            (r'([wybdp])\b', r'\1e'),
                        -            (r'([rnt])\b', r'\1\1e'),
                        -            (r'from', r'fro'),
                        -            (r'when', r'whan'))
                        -
                        -def translate(url, dialectName="chef"):
                        -    """fetch URL and translate using dialect
                        -    
                        -    dialect in ("chef", "fudd", "olde")"""
                        -    import urllib    
                        -    sock = urllib.urlopen(url)         
                        -    htmlSource = sock.read()           
                        -    sock.close()     
                        -    parserName = "%sDialectizer" % dialectName.capitalize()
                        -    parserClass = globals()[parserName]  
                        -    parser = parserClass()               
                        -    parser.feed(htmlSource)
                        -    parser.close()         
                        -    return parser.output() 
                        -
                        -def test(url):
                        -    """test all dialects against URL"""
                        -    for dialect in ("chef", "fudd", "olde"):
                        -        outfile = "%s.html" % dialect
                        -        fsock = open(outfile, "wb")
                        -        fsock.write(translate(url, dialect))
                        -        fsock.close()
                        -        import webbrowser
                        -        webbrowser.open_new(outfile)
                        -
                        -if __name__ == "__main__":
                        -    test("http://diveintopython3.org/odbchelper_list.html")

                        Example 8.3. Output of dialect.py

                        -

                        Running this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the - code listings and screen examples were left untouched. -

                        
                        -<div class=abstract>
                        -<p>Lists awe <span class=application>Pydon</span>'s wowkhowse datatype.
                        -If youw onwy expewience wif wists is awways in
                        -<span class=application>Visuaw Basic</span> ow (God fowbid) de datastowe
                        -in <span class=application>Powewbuiwdew</span>, bwace youwsewf fow
                        -<span class=application>Pydon</span> wists.</p>
                        -</div>
                        -

                        8.2. Introducing sgmllib.py

                        -

                        HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library. -

                        The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags -and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally. -

                        sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece, -it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method. -

                        SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them: -

                        -
                        -
                        Start tag
                        -
                        An HTML tag that starts a block, like <html>, <head>, <body>, or <pre>, or a standalone tag like <br> or <img>. When it finds a start tag tagname, SGMLParser will look for a method called start_tagname or do_tagname. For instance, when it finds a <pre> tag, it will look for a start_pre or do_pre method. If found, SGMLParser calls this method with a list of the tag's attributes; otherwise, it calls unknown_starttag with the tag name and list of attributes. -
                        -
                        End tag
                        -
                        An HTML tag that ends a block, like </html>, </head>, </body>, or </pre>. When it finds an end tag, SGMLParser will look for a method called end_tagname. If found, SGMLParser calls this method, otherwise it calls unknown_endtag with the tag name. -
                        -
                        Character reference
                        -
                        An escaped character referenced by its decimal or hexadecimal equivalent, like &#160;. When found, SGMLParser calls handle_charref with the text of the decimal or hexadecimal character equivalent. -
                        -
                        Entity reference
                        -
                        An HTML entity, like &copy;. When found, SGMLParser calls handle_entityref with the name of the HTML entity. -
                        -
                        Comment
                        -
                        An HTML comment, enclosed in <!-- ... -->. When found, SGMLParser calls handle_comment with the body of the comment. -
                        -
                        Processing instruction
                        -
                        An HTML processing instruction, enclosed in <? ... >. When found, SGMLParser calls handle_pi with the body of the processing instruction. -
                        -
                        Declaration
                        -
                        An HTML declaration, such as a DOCTYPE, enclosed in <! ... >. When found, SGMLParser calls handle_decl with the body of the declaration. -
                        -
                        Text data
                        -
                        A block of text. Anything that doesn't fit into the other 7 categories. When found, SGMLParser calls handle_data with the text. -
                        -
                        - - -
                        ImportantPython 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1. -

                        sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing -the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments. - - -
                        TipIn the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces. -

                        Example 8.4. Sample test of sgmllib.py

                        -

                        Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython3.org/. -

                        -c:\python23\lib> type "c:\downloads\diveintopython3\html\toc\index.html"
                        -
                        -<!DOCTYPE html
                        -  PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
                        -<html>
                        -   <head>
                        -      <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
                        -   
                        -      <title>Dive Into Python</title>
                        -      <link rel="stylesheet" href="diveintopython3.css" type="text/css">
                        -
                        -... rest of file omitted for brevity ...
                        -

                        Running this through the test suite of sgmllib.py yields this output:

                        -c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython3\html\toc\index.html"
                        -data: '\n\n'
                        -start tag: <html >
                        -data: '\n   '
                        -start tag: <head>
                        -data: '\n      '
                        -start tag: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >
                        -data: '\n   \n      '
                        -start tag: <title>
                        -data: 'Dive Into Python'
                        -end tag: </title>
                        -data: '\n      '
                        -start tag: <link rel="stylesheet" href="diveintopython3.css" type="text/css" >
                        -data: '\n      '
                        -
                        -... rest of output omitted for brevity ...
                        -

                        Here's the roadmap for the rest of the chapter: -

                        -
                          -
                        • Subclass SGMLParser to create classes that extract interesting data out of HTML documents. - -
                        • Subclass SGMLParser to create BaseHTMLProcessor, which overrides all 8 handler methods and uses them to reconstruct the original HTML from the pieces. - -
                        • Subclass BaseHTMLProcessor to create Dialectizer, which adds some methods to process specific HTML tags specially, and overrides the handle_data method to provide a framework for processing the text blocks between the HTML tags. - -
                        • Subclass Dialectizer to create classes that define text processing rules used by Dialectizer.handle_data. - -
                        • Write a test suite that grabs a real web page from http://diveintopython3.org/ and processes it. - -
                        -

                        Along the way, you'll also learn about locals, globals, and dictionary-based string formatting. -

                        8.3. Extracting data from HTML documents

                        -

                        To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture. -

                        The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages. -

                        Example 8.5. Introducing urllib

                        ->>> import urllib   
                        ->>> sock = urllib.urlopen("http://diveintopython3.org/") 
                        ->>> htmlSource = sock.read()          
                        ->>> sock.close()    
                        ->>> print htmlSource
                        -<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
                        -      <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
                        -   <title>Dive Into Python</title>
                        -<link rel='stylesheet' href='diveintopython3.css' type='text/css'>
                        -<link rev='made' href='mailto:mark@diveintopython3.org'>
                        -<meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
                        -<meta name='description' content='a free Python tutorial for experienced programmers'>
                        -</head>
                        -<body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
                        -<table cellpadding='0' cellspacing='0' border='0' width='100%'>
                        -<tr><td class='header' width='1%' valign='top'>diveintopython3.org</td>
                        -<td width='99%' align='right'><hr size='1' noshade></td></tr>
                        -<tr><td class='tagline' colspan='2'>Python&nbsp;for&nbsp;experienced&nbsp;programmers</td></tr>
                        -
                        -[...snip...]
                        -
                          -
                        1. The urllib module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages). -
                        2. The simplest use of urllib is to retrieve the entire text of a web page using the urlopen function. Opening a URL is similar to opening a file. The return value of urlopen is a file-like object, which has some of the same methods as a file object. -
                        3. The simplest thing to do with the file-like object returned by urlopen is read, which reads the entire HTML of the web page into a single string. The object also supports readlines, which reads the text line by line into a list. -
                        4. When you're done with the object, make sure to close it, just like a normal file object. -
                        5. You now have the complete HTML of the home page of http://diveintopython3.org/ in a string, and you're ready to parse it. -
                          -

                          If you have not already done so, you can download this and other examples used in this book. -

                          
                          -from sgmllib import SGMLParser
                          -
                          -class URLLister(SGMLParser):
                          -    def reset(self):            
                          -        SGMLParser.reset(self)
                          -        self.urls = []
                          -
                          -    def start_a(self, attrs):   
                          -        href = [v for k, v in attrs if k=='href']  
                          -        if href:
                          -            self.urls.extend(href)
                          -
                            -
                          1. reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization, - do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance. -
                          2. start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list. -
                          3. You can find out whether this <a> tag has an href attribute with a simple multi-variable list comprehension. -
                          4. String comparisons like k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs. -

                            Example 8.7. Using urllister.py

                            ->>> import urllib, urllister
                            ->>> usock = urllib.urlopen("http://diveintopython3.org/")
                            ->>> parser = urllister.URLLister()
                            ->>> parser.feed(usock.read())         
                            ->>> usock.close()   
                            ->>> parser.close()  
                            ->>> for url in parser.urls: print url 
                            -toc/index.html
                            -#download
                            -#languages
                            -toc/index.html
                            -appendix/history.html
                            -download/diveintopython3-html-5.0.zip
                            -download/diveintopython3-pdf-5.0.zip
                            -download/diveintopython3-word-5.0.zip
                            -download/diveintopython3-text-5.0.zip
                            -download/diveintopython3-html-flat-5.0.zip
                            -download/diveintopython3-xml-5.0.zip
                            -download/diveintopython3-common-5.0.zip
                            -
                            -
                            -... rest of output omitted for brevity ...
                            -
                              -
                            1. Call the feed method, defined in SGMLParser, to get HTML into the parser. -[1] It takes a string, which is what usock.read() returns. -
                            2. Like files, you should close your URL objects as soon as you're done with them. -
                            3. You should close your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed. -
                            4. Once the parser is closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.) -

                              8.4. Introducing BaseHTMLProcessor.py

                              -

                              SGMLParser doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it - finds, but the methods don't do anything. SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll - take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer. -

                              BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi, handle_decl, and handle_data. -

                              Example 8.8. Introducing BaseHTMLProcessor

                              
                              -class BaseHTMLProcessor(SGMLParser):
                              -    def reset(self):      
                              -        self.pieces = []
                              -        SGMLParser.reset(self)
                              -
                              -    def unknown_starttag(self, tag, attrs): 
                              -        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
                              -        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
                              -
                              -    def unknown_endtag(self, tag):          
                              -        self.pieces.append("</%(tag)s>" % locals())
                              -
                              -    def handle_charref(self, ref):          
                              -        self.pieces.append("&#%(ref)s;" % locals())
                              -
                              -    def handle_entityref(self, ref):        
                              -        self.pieces.append("&%(ref)s" % locals())
                              -        if htmlentitydefs.entitydefs.has_key(ref):
                              -            self.pieces.append(";")
                              -
                              -    def handle_data(self, text):            
                              -        self.pieces.append(text)
                              -
                              -    def handle_comment(self, text):         
                              -        self.pieces.append("<!--%(text)s-->" % locals())
                              -
                              -    def handle_pi(self, text):              
                              -        self.pieces.append("<?%(text)s>" % locals())
                              -
                              -    def handle_decl(self, text):
                              -        self.pieces.append("<!%(text)s>" % locals())
                              -
                                -
                              1. reset, called by SGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method. self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing. Each handler method will reconstruct the HTML that SGMLParser parsed, and each method will append that string to self.pieces. Note that self.pieces is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but -Python is much more efficient at dealing with lists. -[2]
                              2. Since BaseHTMLProcessor does not define any methods for specific tags (like the start_a method in URLLister), SGMLParser will call unknown_starttag for every start tag. This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces. The string formatting here is a little strange; you'll untangle that (and also the odd-looking locals function) later in this chapter. -
                              3. Reconstructing end tags is much simpler; just take the tag name and wrap it in the </...> brackets. -
                              4. When SGMLParser finds a character reference, it calls handle_charref with the bare reference. If the HTML document contains the reference &#160;, ref will be 160. Reconstructing the original complete character reference just involves wrapping ref in &#...; characters. -
                              5. Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference - requires wrapping ref in &...; characters. (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this. Only certain standard -HTML entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module called htmlentitydefs. Hence the extra if statement.) -
                              6. Blocks of text are simply appended to self.pieces unaltered. -
                              7. HTML comments are wrapped in <!--...--> characters. -
                              8. Processing instructions are wrapped in <?...> characters. - - -
                                ImportantThe HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). BaseHTMLProcessor is not forgiving; if script is improperly embedded, it will be parsed as if it were HTML. For instance, if the script contains less-than and equals signs, SGMLParser may incorrectly think that it has found tags and attributes. SGMLParser always converts tags and attribute names to lowercase, which may break the script, and BaseHTMLProcessor always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script - within HTML comments. -

                                Example 8.9. BaseHTMLProcessor output

                                
                                -    def output(self):               
                                -        """Return processed HTML as a single string"""
                                -        return "".join(self.pieces) 
                                -
                                  -
                                1. This is the one method in BaseHTMLProcessor that is never called by the ancestor SGMLParser. Since the other handler methods store their reconstructed HTML in self.pieces, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it. -
                                2. If you prefer, you could use the join method of the string module instead: string.join(self.pieces, "")
                                  -

                                  Further reading

                                  -

                                  8.5. locals and globals

                                  Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables.

                                  Remember locals? You first saw it here: @@ -3050,605 +2016,17 @@ print "z=",z

                                3. This prints x= 1, not x= 2.
                                4. After being burned by locals, you might think that this wouldn't change the value of z, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), globals returns the actual global namespace, not a copy: the exact opposite behavior of locals. So any changes to the dictionary returned by globals directly affect your global variables.
                                5. This prints z= 8, not z= 7. -

                                  8.6. Dictionary-based string formatting

                                  -

                                  Why did you learn about locals and globals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in -place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple -values are being inserted. You can't simply scan through the string in one pass and understand what the result will be; you're -constantly switching between reading the string and reading the tuple of values. -

                                  There is an alternative form of string formatting that uses dictionaries instead of tuples of values. -

                                  Example 8.13. Introducing dictionary-based string formatting

                                  ->>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
                                  ->>> "%(pwd)s" % params
                                  -'secret'
                                  ->>> "%(pwd)s is not a good password for %(uid)s" % params 
                                  -'secret is not a good password for sa'
                                  ->>> "%(database)s of mind, %(database)s of body" % params 
                                  -'master of mind, master of body'
                                  -
                                    -
                                  1. Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple %s marker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value, secret, in place of the %(pwd)s marker. -
                                  2. Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the - formatting will fail with a KeyError. -
                                  3. You can even specify the same key twice; each occurrence will be replaced with the same value. -

                                    So why would you use dictionary-based string formatting? Well, it does seem like overkill to set up a dictionary of keys -and values simply to do string formatting in the next line; it's really most useful when you happen to have a dictionary of -meaningful keys and values already. Like locals. -

                                    Example 8.14. Dictionary-based string formatting in BaseHTMLProcessor.py

                                    
                                    -    def handle_comment(self, text):        
                                    -        self.pieces.append("<!--%(text)s-->" % locals()) 
                                    -
                                    -
                                      -
                                    1. Using the built-in locals function is the most common use of dictionary-based string formatting. It means that you can use the names of local variables - within your string (in this case, text, which was passed to the class method as an argument) and each named variable will be replaced by its value. If text is 'Begin page footer', the string formatting "<!--%(text)s-->" % locals() will resolve to the string '<!--Begin page footer-->'. -

                                      Example 8.15. More dictionary-based string formatting

                                      
                                      -    def unknown_starttag(self, tag, attrs):
                                      -        strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) 
                                      -        self.pieces.append("<%(tag)s%(strattrs)s>" % locals())    
                                      -
                                      -
                                        -
                                      1. When this method is called, attrs is a list of key/value tuples, just like the items of a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down: -
                                        -
                                          -
                                        1. Suppose attrs is [('href', 'index.html'), ('title', 'Go to home page')]. -
                                        2. In the first round of the list comprehension, key will get 'href', and value will get 'index.html'. -
                                        3. The string formatting ' %s="%s"' % (key, value) will resolve to ' href="index.html"'. This string becomes the first element of the list comprehension's return value. -
                                        4. In the second round, key will get 'title', and value will get 'Go to home page'. -
                                        5. The string formatting will resolve to ' title="Go to home page"'. -
                                        6. The list comprehension returns a list of these two resolved strings, and strattrs will join both elements of this list together to form ' href="index.html" title="Go to home page"'. +[XML stuff was here] -
                                        -
                                      2. Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is 'a', the final result would be '<a href="index.html" title="Go to home page">', and that is what gets appended to self.pieces. - -
                                        ImportantUsing dictionary-based string formatting with locals is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a - slight performance hit in making the call to locals, since locals builds a copy of the local namespace. -

                                        8.7. Quoting attribute values

                                        -

                                        A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through BaseHTMLProcessor. -

                                        BaseHTMLProcessor consumes HTML (since it's descended from SGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase -or mixed case, and attribute values will be enclosed in double quotes, even if they started in single quotes or with no quotes -at all. It is this last side effect that you can take advantage of. -

                                        Example 8.16. Quoting attribute values

                                        ->>> htmlSource = """        
                                        -...    <html>
                                        -...    <head>
                                        -...    <title>Test page</title>
                                        -...    </head>
                                        -...    <body>
                                        -...    <ul>
                                        -...    <li><a href=index.html>Home</a></li>
                                        -...    <li><a href=toc.html>Table of contents</a></li>
                                        -...    <li><a href=history.html>Revision history</a></li>
                                        -...    </body>
                                        -...    </html>
                                        -...    """
                                        ->>> from BaseHTMLProcessor import BaseHTMLProcessor
                                        ->>> parser = BaseHTMLProcessor()
                                        ->>> parser.feed(htmlSource) 
                                        ->>> print parser.output()   
                                        -<html>
                                        -<head>
                                        -<title>Test page</title>
                                        -</head>
                                        -<body>
                                        -<ul>
                                        -<li><a href="index.html">Home</a></li>
                                        -<li><a href="toc.html">Table of contents</a></li>
                                        -<li><a href="history.html">Revision history</a></li>
                                        -</body>
                                        -</html>
                                        -
                                          -
                                        1. Note that the attribute values of the href attributes in the <a> tags are not properly quoted. (Also note that you're using triple quotes for something other than a docstring. And directly in the IDE, no less. They're very useful.) -
                                        2. Feed the parser. -
                                        3. Using the output function defined in BaseHTMLProcessor, you get the output as a single string, complete with quoted attribute values. While this may seem anti-climactic, think - about how much has actually happened here: SGMLParser parsed the entire HTML document, breaking it down into tags, refs, data, and so forth; BaseHTMLProcessor used those elements to reconstruct pieces of HTML (which are still stored in parser.pieces, if you want to see them); finally, you called parser.output, which joined all the pieces of HTML into one string. -

                                          8.8. Introducing dialect.py

                                          -

                                          Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <pre>...</pre> block passes through unaltered. -

                                          To handle the <pre> blocks, you define two methods in Dialectizer: start_pre and end_pre. -

                                          Example 8.17. Handling specific tags

                                          
                                          -    def start_pre(self, attrs):             
                                          -        self.verbatim += 1
                                          -        self.unknown_starttag("pre", attrs) 
                                           
                                          -    def end_pre(self):    
                                          -        self.unknown_endtag("pre")          
                                          -        self.verbatim -= 1
                                          -
                                            -
                                          1. start_pre is called every time SGMLParser finds a <pre> tag in the HTML source. (In a minute, you'll see exactly how this happens.) The method takes a single parameter, attrs, which contains the attributes of the tag (if any). attrs is a list of key/value tuples, just like unknown_starttag takes. -
                                          2. In the reset method, you initialize a data attribute that serves as a counter for <pre> tags. Every time you hit a <pre> tag, you increment the counter; every time you hit a </pre> tag, you'll decrement the counter. (You could just use this as a flag and set it to 1 and reset it to 0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested <pre> tags.) In a minute, you'll see how this counter is put to good use. -
                                          3. That's it, that's the only special processing you do for <pre> tags. Now you pass the list of attributes along to unknown_starttag so it can do the default processing. -
                                          4. end_pre is called every time SGMLParser finds a </pre> tag. Since end tags can not contain attributes, the method takes no parameters. -
                                          5. First, you want to do the default processing, just like any other end tag. -
                                          6. Second, you decrement your counter to signal that this <pre> block has been closed. -

                                            At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle <pre> and </pre>. But how does this happen? Well, it's not magic, it's just good Python coding. -

                                            Example 8.18. SGMLParser

                                            
                                            -    def finish_starttag(self, tag, attrs):               
                                            -        try:        
                                            -            method = getattr(self, 'start_' + tag)       
                                            -        except AttributeError:         
                                            -            try:    
                                            -                method = getattr(self, 'do_' + tag)      
                                            -            except AttributeError:    
                                            -                self.unknown_starttag(tag, attrs)        
                                            -                return -1             
                                            -            else:   
                                            -                self.handle_starttag(tag, method, attrs) 
                                            -                return 0              
                                            -        else:       
                                            -            self.stack.append(tag)    
                                            -            self.handle_starttag(tag, method, attrs)    
                                            -            return 1 
                                             
                                            -    def handle_starttag(self, tag, method, attrs):      
                                            -        method(attrs)
                                            -
                                              -
                                            1. At this point, SGMLParser has already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a - specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag). -
                                            2. The “magic” of SGMLParser is nothing more than your old friend, getattr. What you may not have realized before is that getattr will find methods defined in descendants of an object as well as the object itself. Here the object is self, the current instance. So if tag is 'pre', this call to getattr will look for a start_pre method on the current instance, which is an instance of the Dialectizer class. -
                                            3. getattr raises an AttributeError if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped - the call to getattr inside a try...except block and explicitly caught the AttributeError. -
                                            4. Since you didn't find a start_xxx method, you'll also look for a do_xxx method before giving up. This alternate naming scheme is generally used for standalone tags, like <br>, which have no corresponding end tag. But you can use either naming scheme; as you can see, SGMLParser tries both for every tag. (You shouldn't define both a start_xxx and do_xxx handler method for the same tag, though; only the start_xxx method will get called.) -
                                            5. Another AttributeError, which means that the call to getattr failed with do_xxx. Since you found neither a start_xxx nor a do_xxx method for this tag, you catch the exception and fall back on the default method, unknown_starttag. -
                                            6. Remember, try...except blocks can have an else clause, which is called if no exception is raised during the try...except block. Logically, that means that you did find a do_xxx method for this tag, so you're going to call it. -
                                            7. By the way, don't worry about these different return values; in theory they mean something, but they're never actually used. - Don't worry about the self.stack.append(tag) either; SGMLParser keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this - information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably - not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now. -
                                            8. start_xxx and do_xxx methods are not called directly; the tag, method, and attributes are passed to this function, handle_starttag, so that descendants can override it and change the way all start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call - the method (start_xxx or do_xxx) with the list of attributes. Remember, method is a function, returned from getattr, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run - out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and - this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named, - or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs. -

                                              Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for <pre> and </pre> tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that, -you need to override the handle_data method. -

                                              Example 8.19. Overriding the handle_data method

                                              
                                              -    def handle_data(self, text):     
                                              -        self.pieces.append(self.verbatim and text or self.process(text)) 
                                              -
                                                -
                                              1. handle_data is called with only one argument, the text to process. -
                                              2. In the ancestor BaseHTMLProcessor, the handle_data method simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of a <pre>...</pre> block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the - substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using the and-or trick. -

                                                You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes -later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough -for one chapter. -

                                                8.9. Putting it all together

                                                -

                                                It's time to put everything you've learned so far to good use. I hope you were paying attention. -

                                                Example 8.20. The translate function, part 1

                                                
                                                -def translate(url, dialectName="chef"): 
                                                -    import urllib     
                                                -    sock = urllib.urlopen(url)          
                                                -    htmlSource = sock.read()           
                                                -    sock.close()     
                                                -
                                                -
                                                  -
                                                1. The translate function has an optional argument dialectName, which is a string that specifies the dialect you'll be using. You'll see how this is used in a minute. -
                                                2. Hey, wait a minute, there's an import statement in this function! That's perfectly legal in Python. You're used to seeing import statements at the top of a program, which means that the imported module is available anywhere in the program. But you can - also import modules within a function, which means that the imported module is only available within the function. If you - have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find - that your weekend hack has turned into an 800-line work of art and decide to split it up into a dozen reusable modules, you'll - appreciate this.) -
                                                3. Now you get the source of the given URL. -

                                                  Example 8.21. The translate function, part 2: curiouser and curiouser

                                                  
                                                  -    parserName = "%sDialectizer" % dialectName.capitalize() 
                                                  -    parserClass = globals()[parserName]   
                                                  -    parser = parserClass()                
                                                  -
                                                  -
                                                    -
                                                  1. capitalize is a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else - to lowercase. Combined with some string formatting, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If dialectName is the string 'chef', parserName will be the string 'ChefDialectizer'. -
                                                  2. You have the name of a class as a string (parserName), and you have the global namespace as a dictionary (globals()). Combined, you can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If parserName is the string 'ChefDialectizer', parserClass will be the class ChefDialectizer. -
                                                  3. Finally, you have a class object (parserClass), and you want an instance of the class. Well, you already know how to do that: call the class like a function. The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable - like a function, and out pops an instance of the class. If parserClass is the class ChefDialectizer, parser will be an instance of the class ChefDialectizer. -

                                                    Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there's no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a new FooDialectizer tomorrow; translate would work by passing 'foo' as the dialectName. -

                                                    Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. You've already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file. -

                                                    Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted -value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page. -

                                                    Finally, imagine a Dialectizer framework with a plug-in architecture. You could put each Dialectizer class in a separate file, leaving only the translate function in dialect.py. Assuming a consistent naming scheme, the translate function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't -seen dynamic importing yet, but I promise to cover it in a later chapter.) To add a new dialect, you would simply add an -appropriately-named file in the plug-ins directory (like foodialect.py which contains the FooDialectizer class). Calling the translate function with the dialect name 'foo' would find the module foodialect.py, import the class FooDialectizer, and away you go. -

                                                    Example 8.22. The translate function, part 3

                                                    
                                                    -    parser.feed(htmlSource) 
                                                    -    parser.close()          
                                                    -    return parser.output()  
                                                    -
                                                    -
                                                      -
                                                    1. After all that imagining, this is going to seem pretty boring, but the feed function is what does the entire transformation. You had the entire HTML source in a single string, so you only had to call feed once. However, you can call feed as often as you want, and the parser will just keep parsing. So if you were worried about memory usage (or you knew you - were going to be dealing with very large HTML pages), you could set this up in a loop, where you read a few bytes of HTML and fed it to the parser. The result would be the same. -
                                                    2. Because feed maintains an internal buffer, you should always call the parser's close method when you're done (even if you fed it all at once, like you did). Otherwise you may find that your output is missing - the last few bytes. -
                                                    3. Remember, output is the function you defined on BaseHTMLProcessor that joins all the pieces of output you've buffered and returns them in a single string. -

                                                      And just like that, you've “translated” a web page, given nothing but a URL and the name of a dialect. -

                                                      -

                                                      Further reading

                                                      -
                                                        -
                                                      • You thought I was kidding about the server-side scripting idea. So did I, until I found this web-based dialectizer. Unfortunately, source code does not appear to be available. -
                                                      -

                                                      8.10. Summary

                                                      -

                                                      Python provides you with a powerful tool, sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways. -

                                                      -
                                                        -
                                                      • parsing the HTML looking for something specific - -
                                                      • aggregating the results, like the URL lister -
                                                      • altering the structure along the way, like the attribute quoter -
                                                      • transforming the HTML into something else by manipulating the text while leaving the tags alone, like the Dialectizer -
                                                      -

                                                      Along with these examples, you should be comfortable doing all of the following things: -

                                                      - -


                                                      -
                                                      -

                                                      [1] The technical term for a parser like SGMLParser is a consumer: it consumes HTML and breaks it down. Presumably, the name feed was chosen to fit into the whole “consumer” motif. Personally, it makes me think of an exhibit in the zoo where there's just a dark cage with no trees or plants or - evidence of life of any kind, but if you stand perfectly still and look really closely you can make out two beady eyes staring - back at you from the far left corner, but you convince yourself that that's just your mind playing tricks on you, and the - only way you can tell that the whole thing isn't just an empty cage is a small innocuous sign on the railing that reads, “Do not feed the parser.” But maybe that's just me. In any event, it's an interesting mental image. -

                                                      -

                                                      [2] The reason Python is better at lists than strings is that lists are mutable but strings are immutable. This means that appending to a list - just adds the element and updates the index. Since strings can not be changed after they are created, code like s = s + newpiece will create an entirely new string out of the concatenation of the original and the new piece, then throw away the original - string. This involves a lot of expensive memory management, and the amount of effort involved increases as the string gets - longer, so doing s = s + newpiece in a loop is deadly. In technical terms, appending n items to a list is O(n), while appending n items to a string is O(n2). -

                                                      -

                                                      [3] I don't get out much. -

                                                      -

                                                      [4] All right, it's not that common a question. It's not up there with “What editor should I use to write Python code?” (answer: Emacs) or “Is Python better or worse than Perl?” (answer: “Perl is worse than Python because people wanted it worse.” -Larry Wall, 10/14/1998) But questions about HTML processing pop up in one form or another about once a month, and among those questions, this is a popular one. -

                                                      -

                                                      Chapter 9. XML Processing

                                                      -

                                                      9.1. Diving in

                                                      -

                                                      These next two chapters are about XML processing in Python. It would be helpful if you already knew what an XML document looks like, that it's made up of structured tags to form a hierarchy of elements, and so on. If this doesn't make -sense to you, there are many XML tutorials that can explain the basics. -

                                                      If you're not particularly interested in XML, you should still read these chapters, which cover important topics like Python packages, Unicode, command line arguments, and how to use getattr for method dispatching. -

                                                      Being a philosophy major is not required, although if you have ever had the misfortune of being subjected to the writings -of Immanuel Kant, you will appreciate the example program a lot more than if you majored in something useful, like computer -science. -

                                                      There are two basic ways to work with XML. One is called SAX (“Simple API for XML”), and it works by reading the XML a little bit at a time and calling a method for each element it finds. (If you read Chapter 8, HTML Processing, this should sound familiar, because that's how the sgmllib module works.) The other is called DOM (“Document Object Model”), and it works by reading in the entire XML document at once and creating an internal representation of it using native Python classes linked in a tree structure. Python has standard modules for both kinds of parsing, but this chapter will only deal with using the DOM. -

                                                      The following is a complete Python program which generates pseudo-random output based on a context-free grammar defined in an XML format. Don't worry yet if you don't understand what that means; you'll examine both the program's input and its output -in more depth throughout these next two chapters. -

                                                      Example 9.1. kgp.py

                                                      -

                                                      If you have not already done so, you can download this and other examples used in this book. -

                                                      
                                                      -"""Kant Generator for Python
                                                      -
                                                      -Generates mock philosophy based on a context-free grammar
                                                      -
                                                      -Usage: python kgp.py [options] [source]
                                                      -
                                                      -Options:
                                                      -  -g ..., --grammar=...  use specified grammar file or URL
                                                      -  -h, --help              show this help
                                                      -  -d    show debugging information while parsing
                                                      -
                                                      -Examples:
                                                      -  kgp.pygenerates several paragraphs of Kantian philosophy
                                                      -  kgp.py -g husserl.xml   generates several paragraphs of Husserl
                                                      -  kpg.py "<xref id='paragraph'/>"  generates a paragraph of Kant
                                                      -  kgp.py template.xml     reads from template.xml to decide what to generate
                                                      -"""
                                                      -from xml.dom import minidom
                                                      -import random
                                                      -import toolbox
                                                      -import sys
                                                      -import getopt
                                                      -
                                                      -_debug = 0
                                                      -
                                                      -class NoSourceError(Exception): pass
                                                      -
                                                      -class KantGenerator:
                                                      -    """generates mock philosophy based on a context-free grammar"""
                                                      -
                                                      -    def __init__(self, grammar, source=None):
                                                      -        self.loadGrammar(grammar)
                                                      -        self.loadSource(source and source or self.getDefaultSource())
                                                      -        self.refresh()
                                                      -
                                                      -    def _load(self, source):
                                                      -        """load XML input source, return parsed XML document
                                                      -
                                                      -        - a URL of a remote XML file ("http://diveintopython3.org/kant.xml")
                                                      -        - a filename of a local XML file ("~/diveintopython3/common/py/kant.xml")
                                                      -        - standard input ("-")
                                                      -        - the actual XML document, as a string
                                                      -        """
                                                      -        sock = toolbox.openAnything(source)
                                                      -        xmldoc = minidom.parse(sock).documentElement
                                                      -        sock.close()
                                                      -        return xmldoc
                                                      -
                                                      -    def loadGrammar(self, grammar):       
                                                      -        """load context-free grammar"""   
                                                      -        self.grammar = self._load(grammar)
                                                      -        self.refs = {}  
                                                      -        for ref in self.grammar.getElementsByTagName("ref"):
                                                      -            self.refs[ref.attributes["id"].value] = ref     
                                                      -
                                                      -    def loadSource(self, source):
                                                      -        """load source"""
                                                      -        self.source = self._load(source)
                                                      -
                                                      -    def getDefaultSource(self):
                                                      -        """guess default source of the current grammar
                                                      -        
                                                      -        The default source will be one of the <ref>s that is not
                                                      -        cross-referenced. This sounds complicated but it's not.
                                                      -        Example: The default source for kant.xml is
                                                      -        "<xref id='section'/>", because 'section' is the one <ref>
                                                      -        that is not <xref>'d anywhere in the grammar.
                                                      -        In most grammars, the default source will produce the
                                                      -        longest (and most interesting) output.
                                                      -        """
                                                      -        xrefs = {}
                                                      -        for xref in self.grammar.getElementsByTagName("xref"):
                                                      -            xrefs[xref.attributes["id"].value] = 1
                                                      -        xrefs = xrefs.keys()
                                                      -        standaloneXrefs = [e for e in self.refs.keys() if e not in xrefs]
                                                      -        if not standaloneXrefs:
                                                      -            raise NoSourceError, "can't guess source, and no source specified"
                                                      -        return '<xref id="%s"/>' % random.choice(standaloneXrefs)
                                                      -        
                                                      -    def reset(self):
                                                      -        """reset parser"""
                                                      -        self.pieces = []
                                                      -        self.capitalizeNextWord = 0
                                                      -
                                                      -    def refresh(self):
                                                      -        """reset output buffer, re-parse entire source file, and return output
                                                      -        
                                                      -        Since parsing involves a good deal of randomness, this is an
                                                      -        easy way to get new output without having to reload a grammar file
                                                      -        each time.
                                                      -        """
                                                      -        self.reset()
                                                      -        self.parse(self.source)
                                                      -        return self.output()
                                                      -
                                                      -    def output(self):
                                                      -        """output generated text"""
                                                      -        return "".join(self.pieces)
                                                      -
                                                      -    def randomChildElement(self, node):
                                                      -        """choose a random child element of a node
                                                      -        
                                                      -        This is a utility method used by do_xref and do_choice.
                                                      -        """
                                                      -        choices = [e for e in node.childNodes
                                                      - if e.nodeType == e.ELEMENT_NODE]
                                                      -        chosen = random.choice(choices)            
                                                      -        if _debug:               
                                                      -            sys.stderr.write('%s available choices: %s\n' % \
                                                      -                (len(choices), [e.toxml() for e in choices]))
                                                      -            sys.stderr.write('Chosen: %s\n' % chosen.toxml())
                                                      -        return chosen            
                                                      -
                                                      -    def parse(self, node):         
                                                      -        """parse a single XML node
                                                      -        
                                                      -        A parsed XML document (from minidom.parse) is a tree of nodes
                                                      -        of various types. Each node is represented by an instance of the
                                                      -        corresponding Python class (Element for a tag, Text for
                                                      -        text data, Document for the top-level document). The following
                                                      -        statement constructs the name of a class method based on the type
                                                      -        of node we're parsing ("parse_Element" for an Element node,
                                                      -        "parse_Text" for a Text node, etc.) and then calls the method.
                                                      -        """
                                                      -        parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)
                                                      -        parseMethod(node)
                                                      -
                                                      -    def parse_Document(self, node):
                                                      -        """parse the document node
                                                      -        
                                                      -        The document node by itself isn't interesting (to us), but
                                                      -        its only child, node.documentElement, is: it's the root node
                                                      -        of the grammar.
                                                      -        """
                                                      -        self.parse(node.documentElement)
                                                      -
                                                      -    def parse_Text(self, node):    
                                                      -        """parse a text node
                                                      -        
                                                      -        The text of a text node is usually added to the output buffer
                                                      -        verbatim. The one exception is that <p class='sentence'> sets
                                                      -        a flag to capitalize the first letter of the next word. If
                                                      -        that flag is set, we capitalize the text and reset the flag.
                                                      -        """
                                                      -        text = node.data
                                                      -        if self.capitalizeNextWord:
                                                      -            self.pieces.append(text[0].upper())
                                                      -            self.pieces.append(text[1:])
                                                      -            self.capitalizeNextWord = 0
                                                      -        else:
                                                      -            self.pieces.append(text)
                                                      -
                                                      -    def parse_Element(self, node): 
                                                      -        """parse an element
                                                      -        
                                                      -        An XML element corresponds to an actual tag in the source:
                                                      -        <xref id='...'>, <p chance='...'>, <choice>, etc.
                                                      -        Each element type is handled in its own method. Like we did in
                                                      -        parse(), we construct a method name based on the name of the
                                                      -        element ("do_xref" for an <xref> tag, etc.) and
                                                      -        call the method.
                                                      -        """
                                                      -        handlerMethod = getattr(self, "do_%s" % node.tagName)
                                                      -        handlerMethod(node)
                                                      -
                                                      -    def parse_Comment(self, node):
                                                      -        """parse a comment
                                                      -        
                                                      -        The grammar can contain XML comments, but we ignore them
                                                      -        """
                                                      -        pass
                                                      -    
                                                      -    def do_xref(self, node):
                                                      -        """handle <xref id='...'> tag
                                                      -        
                                                      -        An <xref id='...'> tag is a cross-reference to a <ref id='...'>
                                                      -        tag. <xref id='sentence'/> evaluates to a randomly chosen child of
                                                      -        <ref id='sentence'>.
                                                      -        """
                                                      -        id = node.attributes["id"].value
                                                      -        self.parse(self.randomChildElement(self.refs[id]))
                                                      -
                                                      -    def do_p(self, node):
                                                      -        """handle <p> tag
                                                      -        
                                                      -        The <p> tag is the core of the grammar. It can contain almost
                                                      -        anything: freeform text, <choice> tags, <xref> tags, even other
                                                      -        <p> tags. If a "class='sentence'" attribute is found, a flag
                                                      -        is set and the next word will be capitalized. If a "chance='X'"
                                                      -        attribute is found, there is an X% chance that the tag will be
                                                      -        evaluated (and therefore a (100-X)% chance that it will be
                                                      -        completely ignored)
                                                      -        """
                                                      -        keys = node.attributes.keys()
                                                      -        if "class" in keys:
                                                      -            if node.attributes["class"].value == "sentence":
                                                      -                self.capitalizeNextWord = 1
                                                      -        if "chance" in keys:
                                                      -            chance = int(node.attributes["chance"].value)
                                                      -            doit = (chance > random.randrange(100))
                                                      -        else:
                                                      -            doit = 1
                                                      -        if doit:
                                                      -            for child in node.childNodes: self.parse(child)
                                                      -
                                                      -    def do_choice(self, node):
                                                      -        """handle <choice> tag
                                                      -        
                                                      -        A <choice> tag contains one or more <p> tags. One <p> tag
                                                      -        is chosen at random and evaluated; the rest are ignored.
                                                      -        """
                                                      -        self.parse(self.randomChildElement(node))
                                                      -
                                                      -def usage():
                                                      -    print __doc__
                                                      -
                                                      -def main(argv):       
                                                      -    grammar = "kant.xml"                
                                                      -    try:              
                                                      -        opts, args = getopt.getopt(argv, "hg:d", ["help", "grammar="])
                                                      -    except getopt.GetoptError:          
                                                      -        usage()       
                                                      -        sys.exit(2)   
                                                      -    for opt, arg in opts:               
                                                      -        if opt in ("-h", "--help"):     
                                                      -            usage()   
                                                      -            sys.exit()
                                                      -        elif opt == '-d':               
                                                      -            global _debug               
                                                      -            _debug = 1
                                                      -        elif opt in ("-g", "--grammar"):
                                                      -            grammar = arg               
                                                      -    
                                                      -    source = "".join(args)              
                                                      -
                                                      -    k = KantGenerator(grammar, source)
                                                      -    print k.output()
                                                      -
                                                      -if __name__ == "__main__":
                                                      -    main(sys.argv[1:])
                                                      -

                                                      Example 9.2. toolbox.py

                                                      
                                                      -"""Miscellaneous utility functions"""
                                                      -
                                                      -def openAnything(source):            
                                                      -    """URI, filename, or string --> stream
                                                      -
                                                      -    This function lets you define parsers that take any input source
                                                      -    (URL, pathname to local or network file, or actual data as a string)
                                                      -    and deal with it in a uniform manner. Returned object is guaranteed
                                                      -    to have all the basic stdio read methods (read, readline, readlines).
                                                      -    Just .close() the object when you're done with it.
                                                      -    
                                                      -    Examples:
                                                      -    >>> from xml.dom import minidom
                                                      -    >>> sock = openAnything("http://localhost/kant.xml")
                                                      -    >>> doc = minidom.parse(sock)
                                                      -    >>> sock.close()
                                                      -    >>> sock = openAnything("c:\\inetpub\\wwwroot\\kant.xml")
                                                      -    >>> doc = minidom.parse(sock)
                                                      -    >>> sock.close()
                                                      -    >>> sock = openAnything("<ref id='conjunction'><text>and</text><text>or</text></ref>")
                                                      -    >>> doc = minidom.parse(sock)
                                                      -    >>> sock.close()
                                                      -    """
                                                      -    if hasattr(source, "read"):
                                                      -        return source
                                                      -
                                                      -    if source == '-':
                                                      -        import sys
                                                      -        return sys.stdin
                                                      -
                                                      -    # try to open with urllib (if source is http, ftp, or file URL)
                                                      -    import urllib       
                                                      -    try:                
                                                      -        return urllib.urlopen(source)     
                                                      -    except (IOError, OSError):            
                                                      -        pass            
                                                      -    
                                                      -    # try to open with native open function (if source is pathname)
                                                      -    try:                
                                                      -        return open(source)               
                                                      -    except (IOError, OSError):            
                                                      -        pass            
                                                      -    
                                                      -    # treat source as string
                                                      -    import StringIO     
                                                      -    return StringIO.StringIO(str(source)) 
                                                      -

                                                      Run the program kgp.py by itself, and it will parse the default XML-based grammar, in kant.xml, and print several paragraphs worth of philosophy in the style of Immanuel Kant. -

                                                      Example 9.3. Sample output of kgp.py

                                                      [you@localhost kgp]$ python kgp.py
                                                      -     As is shown in the writings of Hume, our a priori concepts, in
                                                      -reference to ends, abstract from all content of knowledge; in the study
                                                      -of space, the discipline of human reason, in accordance with the
                                                      -principles of philosophy, is the clue to the discovery of the
                                                      -Transcendental Deduction. The transcendental aesthetic, in all
                                                      -theoretical sciences, occupies part of the sphere of human reason
                                                      -concerning the existence of our ideas in general; still, the
                                                      -never-ending regress in the series of empirical conditions constitutes
                                                      -the whole content for the transcendental unity of apperception. What
                                                      -we have alone been able to show is that, even as this relates to the
                                                      -architectonic of human reason, the Ideal may not contradict itself, but
                                                      -it is still possible that it may be in contradictions with the
                                                      -employment of the pure employment of our hypothetical judgements, but
                                                      -natural causes (and I assert that this is the case) prove the validity
                                                      -of the discipline of pure reason. As we have already seen, time (and
                                                      -it is obvious that this is true) proves the validity of time, and the
                                                      -architectonic of human reason, in the full sense of these terms,
                                                      -abstracts from all content of knowledge. I assert, in the case of the
                                                      -discipline of practical reason, that the Antinomies are just as
                                                      -necessary as natural causes, since knowledge of the phenomena is a
                                                      -posteriori.
                                                      -    The discipline of human reason, as I have elsewhere shown, is by
                                                      -its very nature contradictory, but our ideas exclude the possibility of
                                                      -the Antinomies. We can deduce that, on the contrary, the pure
                                                      -employment of philosophy, on the contrary, is by its very nature
                                                      -contradictory, but our sense perceptions are a representation of, in
                                                      -the case of space, metaphysics. The thing in itself is a
                                                      -representation of philosophy. Applied logic is the clue to the
                                                      -discovery of natural causes. However, what we have alone been able to
                                                      -show is that our ideas, in other words, should only be used as a canon
                                                      -for the Ideal, because of our necessary ignorance of the conditions.
                                                      -
                                                      -[...snip...]

                                                      This is, of course, complete gibberish. Well, not complete gibberish. It is syntactically and grammatically correct (although -very verbose -- Kant wasn't what you would call a get-to-the-point kind of guy). Some of it may actually be true (or at least -the sort of thing that Kant would have agreed with), some of it is blatantly false, and most of it is simply incoherent. -But all of it is in the style of Immanuel Kant. -

                                                      Let me repeat that this is much, much funnier if you are now or have ever been a philosophy major. -

                                                      The interesting thing about this program is that there is nothing Kant-specific about it. All the content in the previous -example was derived from the grammar file, kant.xml. If you tell the program to use a different grammar file (which you can specify on the command line), the output will be -completely different. -

                                                      Example 9.4. Simpler output from kgp.py

                                                      [you@localhost kgp]$ python kgp.py -g binary.xml
                                                      -00101001
                                                      -[you@localhost kgp]$ python kgp.py -g binary.xml
                                                      -10110100

                                                      You will take a closer look at the structure of the grammar file later in this chapter. For now, all you need to know is -that the grammar file defines the structure of the output, and the kgp.py program reads through the grammar and makes random decisions about which words to plug in where.

                                                      9.2. Packages

                                                      Actually parsing an XML document is very simple: one line of code. However, before you get to that line of code, you need to take a short detour to talk about packages. @@ -3707,111 +2085,6 @@ areas simultaneously). package architecture. It's one of the many things Python is good at, so take advantage of it.

                                                      9.3. Parsing XML

                                                      As I was saying, actually parsing an XML document is very simple: one line of code. Where you go from there is up to you. -

                                                      Example 9.8. Loading an XML document (for real this time)

                                                      ->>> from xml.dom import minidom      
                                                      ->>> xmldoc = minidom.parse('~/diveintopython3/common/py/kgp/binary.xml')  
                                                      ->>> xmldoc         
                                                      -<xml.dom.minidom.Document instance at 010BE87C>
                                                      ->>> print xmldoc.toxml()             
                                                      -<?xml version="1.0" ?>
                                                      -<grammar>
                                                      -<ref id="bit">
                                                      -  <p>0</p>
                                                      -  <p>1</p>
                                                      -</ref>
                                                      -<ref id="byte">
                                                      -  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                                      -<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                                                      -</ref>
                                                      -</grammar>
                                                      -
                                                        -
                                                      1. As you saw in the previous section, this imports the minidom module from the xml.dom package. -
                                                      2. Here is the one line of code that does all the work: minidom.parse takes one argument and returns a parsed representation of the XML document. The argument can be many things; in this case, it's simply a filename of an XML document on my local disk. (To follow along, you'll need to change the path to point to your downloaded examples directory.) - But you can also pass a file object, or even a file-like object. You'll take advantage of this flexibility later in this chapter. -
                                                      3. The object returned from minidom.parse is a Document object, a descendant of the Node class. This Document object is the root level of a complex tree-like structure of interlocking Python objects that completely represent the XML document you passed to minidom.parse. -
                                                      4. toxml is a method of the Node class (and is therefore available on the Document object you got from minidom.parse). toxml prints out the XML that this Node represents. For the Document node, this prints out the entire XML document. -

                                                        Now that you have an XML document in memory, you can start traversing through it. -

                                                        Example 9.9. Getting child nodes

                                                        ->>> xmldoc.childNodes    
                                                        -[<DOM Element: grammar at 17538908>]
                                                        ->>> xmldoc.childNodes[0] 
                                                        -<DOM Element: grammar at 17538908>
                                                        ->>> xmldoc.firstChild    
                                                        -<DOM Element: grammar at 17538908>
                                                        -
                                                          -
                                                        1. Every Node has a childNodes attribute, which is a list of the Node objects. A Document always has only one child node, the root element of the XML document (in this case, the grammar element). -
                                                        2. To get the first (and in this case, the only) child node, just use regular list syntax. Remember, there is nothing special - going on here; this is just a regular Python list of regular Python objects. -
                                                        3. Since getting the first child node of a node is a useful and common activity, the Node class has a firstChild attribute, which is synonymous with childNodes[0]. (There is also a lastChild attribute, which is synonymous with childNodes[-1].) -

                                                          Example 9.10. toxml works on any node

                                                          ->>> grammarNode = xmldoc.firstChild
                                                          ->>> print grammarNode.toxml() 
                                                          -<grammar>
                                                          -<ref id="bit">
                                                          -  <p>0</p>
                                                          -  <p>1</p>
                                                          -</ref>
                                                          -<ref id="byte">
                                                          -  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                                          -<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                                                          -</ref>
                                                          -</grammar>
                                                          -
                                                            -
                                                          1. Since the toxml method is defined in the Node class, it is available on any XML node, not just the Document element. -

                                                            Example 9.11. Child nodes can be text

                                                            ->>> grammarNode.childNodes
                                                            -[<DOM Text node "\n">, <DOM Element: ref at 17533332>, \
                                                            -<DOM Text node "\n">, <DOM Element: ref at 17549660>, <DOM Text node "\n">]
                                                            ->>> print grammarNode.firstChild.toxml()    
                                                            -
                                                            -
                                                            -
                                                            ->>> print grammarNode.childNodes[1].toxml() 
                                                            -<ref id="bit">
                                                            -  <p>0</p>
                                                            -  <p>1</p>
                                                            -</ref>
                                                            ->>> print grammarNode.childNodes[3].toxml() 
                                                            -<ref id="byte">
                                                            -  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                                            -<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                                                            -</ref>
                                                            ->>> print grammarNode.lastChild.toxml()     
                                                            -
                                                            -
                                                            -
                                                            -
                                                              -
                                                            1. Looking at the XML in binary.xml, you might think that the grammar has only two child nodes, the two ref elements. But you're missing something: the carriage returns! After the '<grammar>' and before the first '<ref>' is a carriage return, and this text counts as a child node of the grammar element. Similarly, there is a carriage return after each '</ref>'; these also count as child nodes. So grammar.childNodes is actually a list of 5 objects: 3 Text objects and 2 Element objects. -
                                                            2. The first child is a Text object representing the carriage return after the '<grammar>' tag and before the first '<ref>' tag. -
                                                            3. The second child is an Element object representing the first ref element. -
                                                            4. The fourth child is an Element object representing the second ref element. -
                                                            5. The last child is a Text object representing the carriage return after the '</ref>' end tag and before the '</grammar>' end tag. -

                                                              Example 9.12. Drilling down all the way to text

                                                              ->>> grammarNode
                                                              -<DOM Element: grammar at 19167148>
                                                              ->>> refNode = grammarNode.childNodes[1] 
                                                              ->>> refNode
                                                              -<DOM Element: ref at 17987740>
                                                              ->>> refNode.childNodes
                                                              -[<DOM Text node "\n">, <DOM Text node "  ">, <DOM Element: p at 19315844>, \
                                                              -<DOM Text node "\n">, <DOM Text node "  ">, \
                                                              -<DOM Element: p at 19462036>, <DOM Text node "\n">]
                                                              ->>> pNode = refNode.childNodes[2]
                                                              ->>> pNode
                                                              -<DOM Element: p at 19315844>
                                                              ->>> print pNode.toxml()                 
                                                              -<p>0</p>
                                                              ->>> pNode.firstChild  
                                                              -<DOM Text node "0">
                                                              ->>> pNode.firstChild.data               
                                                              -u'0'
                                                              -
                                                                -
                                                              1. As you saw in the previous example, the first ref element is grammarNode.childNodes[1], since childNodes[0] is a Text node for the carriage return. -
                                                              2. The ref element has its own set of child nodes, one for the carriage return, a separate one for the spaces, one for the p element, and so forth. -
                                                              3. You can even use the toxml method here, deeply nested within the document. -
                                                              4. The p element has only one child node (you can't tell that from this example, but look at pNode.childNodes if you don't believe me), and it is a Text node for the single character '0'. -
                                                              5. The .data attribute of a Text node gives you the actual string that the text node represents. But what is that 'u' in front of the string? The answer to that deserves its own section. - @@ -3823,411 +2096,8 @@ u'0' - - - -

                                                                Remember I said Python usually converted unicode to ASCII whenever it needed to make a regular string out of a unicode string? Well, this default encoding scheme is an option which -you can customize. -

                                                                Example 9.15. sitecustomize.py

                                                                
                                                                -# sitecustomize.py 
                                                                -# this file can be anywhere in your Python path,
                                                                -# but it usually goes in ${pythondir}/lib/site-packages/
                                                                -import sys
                                                                -sys.setdefaultencoding('iso-8859-1') 
                                                                -
                                                                -
                                                                  -
                                                                1. sitecustomize.py is a special script; Python will try to import it on startup, so any code in it will be run automatically. As the comment mentions, it can go anywhere - (as long as import can find it), but it usually goes in the site-packages directory within your Python lib directory. -
                                                                2. setdefaultencoding function sets, well, the default encoding. This is the encoding scheme that Python will try to use whenever it needs to auto-coerce a unicode string into a regular string. -

                                                                  Example 9.16. Effects of setting the default encoding

                                                                  ->>> import sys
                                                                  ->>> sys.getdefaultencoding() 
                                                                  -'iso-8859-1'
                                                                  ->>> s = u'La Pe\xf1a'
                                                                  ->>> print s
                                                                  -La Peña
                                                                  -
                                                                    -
                                                                  1. This example assumes that you have made the changes listed in the previous example to your sitecustomize.py file, and restarted Python. If your default encoding still says 'ascii', you didn't set up your sitecustomize.py properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even - call sys.setdefaultencoding after Python has started up. Dig into site.py and search for “setdefaultencoding” to find out how.) -
                                                                  2. Now that the default encoding scheme includes all the characters you use in your string, Python has no problem auto-coercing the string and printing it. - - - - - -(More Unicode stuff was here) - - - - - - - -

                                                                    9.5. Searching for elements

                                                                    -

                                                                    Traversing XML documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within - your XML document, there is a shortcut you can use to find it quickly: getElementsByTagName. -

                                                                    For this section, you'll be using the binary.xml grammar file, which looks like this: -

                                                                    Example 9.20. binary.xml

                                                                    <?xml version="1.0"?>
                                                                    -<!DOCTYPE grammar PUBLIC "-//diveintopython3.org//DTD Kant Generator Pro v1.0//EN" "kgp.dtd">
                                                                    -<grammar>
                                                                    -<ref id="bit">
                                                                    -  <p>0</p>
                                                                    -  <p>1</p>
                                                                    -</ref>
                                                                    -<ref id="byte">
                                                                    -  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                                                    -<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                                                                    -</ref>
                                                                    -</grammar>

                                                                    It has two refs, 'bit' and 'byte'. A bit is either a '0' or '1', and a byte is 8 bits. -

                                                                    Example 9.21. Introducing getElementsByTagName

                                                                    ->>> from xml.dom import minidom
                                                                    ->>> xmldoc = minidom.parse('binary.xml')
                                                                    ->>> reflist = xmldoc.getElementsByTagName('ref') 
                                                                    ->>> reflist
                                                                    -[<DOM Element: ref at 136138108>, <DOM Element: ref at 136144292>]
                                                                    ->>> print reflist[0].toxml()
                                                                    -<ref id="bit">
                                                                    -  <p>0</p>
                                                                    -  <p>1</p>
                                                                    -</ref>
                                                                    ->>> print reflist[1].toxml()
                                                                    -<ref id="byte">
                                                                    -  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                                                    -<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                                                                    -</ref>
                                                                    -
                                                                    -
                                                                      -
                                                                    1. getElementsByTagName takes one argument, the name of the element you wish to find. It returns a list of Element objects, corresponding to the XML elements that have that name. In this case, you find two ref elements. -

                                                                      Example 9.22. Every element is searchable

                                                                      ->>> firstref = reflist[0]    
                                                                      ->>> print firstref.toxml()
                                                                      -<ref id="bit">
                                                                      -  <p>0</p>
                                                                      -  <p>1</p>
                                                                      -</ref>
                                                                      ->>> plist = firstref.getElementsByTagName("p") 
                                                                      ->>> plist
                                                                      -[<DOM Element: p at 136140116>, <DOM Element: p at 136142172>]
                                                                      ->>> print plist[0].toxml()   
                                                                      -<p>0</p>
                                                                      ->>> print plist[1].toxml()
                                                                      -<p>1</p>
                                                                      -
                                                                        -
                                                                      1. Continuing from the previous example, the first object in your reflist is the 'bit' ref element. -
                                                                      2. You can use the same getElementsByTagName method on this Element to find all the <p> elements within the 'bit' ref element. -
                                                                      3. Just as before, the getElementsByTagName method returns a list of all the elements it found. In this case, you have two, one for each bit. -

                                                                        Example 9.23. Searching is actually recursive

                                                                        ->>> plist = xmldoc.getElementsByTagName("p") 
                                                                        ->>> plist
                                                                        -[<DOM Element: p at 136140116>, <DOM Element: p at 136142172>, <DOM Element: p at 136146124>]
                                                                        ->>> plist[0].toxml()       
                                                                        -'<p>0</p>'
                                                                        ->>> plist[1].toxml()
                                                                        -'<p>1</p>'
                                                                        ->>> plist[2].toxml()       
                                                                        -'<p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                                                        -<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>'
                                                                        -
                                                                          -
                                                                        1. Note carefully the difference between this and the previous example. Previously, you were searching for p elements within firstref, but here you are searching for p elements within xmldoc, the root-level object that represents the entire XML document. This does find the p elements nested within the ref elements within the root grammar element. -
                                                                        2. The first two p elements are within the first ref (the 'bit' ref). -
                                                                        3. The last p element is the one within the second ref (the 'byte' ref). -

                                                                          9.6. Accessing element attributes

                                                                          -

                                                                          XML elements can have one or more attributes, and it is incredibly simple to access them once you have parsed an XML document. -

                                                                          For this section, you'll be using the binary.xml grammar file that you saw in the previous section. - - -
                                                                          NoteThis section may be a little confusing, because of some overlapping terminology. Elements in an XML document have attributes, and Python objects also have attributes. When you parse an XML document, you get a bunch of Python objects that represent all the pieces of the XML document, and some of these Python objects represent attributes of the XML elements. But the (Python) objects that represent the (XML) attributes also have (Python) attributes, which are used to access various parts of the (XML) attribute that the object represents. I told you it was confusing. I am open to suggestions on how to distinguish these - more clearly. -

                                                                          Example 9.24. Accessing element attributes

                                                                          ->>> xmldoc = minidom.parse('binary.xml')
                                                                          ->>> reflist = xmldoc.getElementsByTagName('ref')
                                                                          ->>> bitref = reflist[0]
                                                                          ->>> print bitref.toxml()
                                                                          -<ref id="bit">
                                                                          -  <p>0</p>
                                                                          -  <p>1</p>
                                                                          -</ref>
                                                                          ->>> bitref.attributes          
                                                                          -<xml.dom.minidom.NamedNodeMap instance at 0x81e0c9c>
                                                                          ->>> bitref.attributes.keys()    
                                                                          -[u'id']
                                                                          ->>> bitref.attributes.values() 
                                                                          -[<xml.dom.minidom.Attr instance at 0x81d5044>]
                                                                          ->>> bitref.attributes["id"]    
                                                                          -<xml.dom.minidom.Attr instance at 0x81d5044>
                                                                          -
                                                                            -
                                                                          1. Each Element object has an attribute called attributes, which is a NamedNodeMap object. This sounds scary, but it's not, because a NamedNodeMap is an object that acts like a dictionary, so you already know how to use it. -
                                                                          2. Treating the NamedNodeMap as a dictionary, you can get a list of the names of the attributes of this element by using attributes.keys(). This element has only one attribute, 'id'. -
                                                                          3. Attribute names, like all other text in an XML document, are stored in unicode. -
                                                                          4. Again treating the NamedNodeMap as a dictionary, you can get a list of the values of the attributes by using attributes.values(). The values are themselves objects, of type Attr. You'll see how to get useful information out of this object in the next example. -
                                                                          5. Still treating the NamedNodeMap as a dictionary, you can access an individual attribute by name, using normal dictionary syntax. (Readers who have been - paying extra-close attention will already know how the NamedNodeMap class accomplishes this neat trick: by defining a __getitem__ special method. Other readers can take comfort in the fact that they don't need to understand how it works in order to use it effectively.) -

                                                                            Example 9.25. Accessing individual attributes

                                                                            ->>> a = bitref.attributes["id"]
                                                                            ->>> a
                                                                            -<xml.dom.minidom.Attr instance at 0x81d5044>
                                                                            ->>> a.name  
                                                                            -u'id'
                                                                            ->>> a.value 
                                                                            -u'bit'
                                                                            -
                                                                              -
                                                                            1. The Attr object completely represents a single XML attribute of a single XML element. The name of the attribute (the same name as you used to find this object in the bitref.attributes NamedNodeMap pseudo-dictionary) is stored in a.name. -
                                                                            2. The actual text value of this XML attribute is stored in a.value. - - -
                                                                              NoteLike a dictionary, attributes of an XML element have no ordering. Attributes may happen to be listed in a certain order in the original XML document, and the Attr objects may happen to be listed in a certain order when the XML document is parsed into Python objects, but these orders are arbitrary and should carry no special meaning. You should always access individual attributes - by name, like the keys of a dictionary. -

                                                                              9.7. Segue

                                                                              -

                                                                              OK, that's it for the hard-core XML stuff. The next chapter will continue to use these same example programs, but focus on - other aspects that make the program more flexible: using streams for input processing, using getattr for method dispatching, and using command-line flags to allow users to reconfigure the program without changing the code. -

                                                                              Before moving on to the next chapter, you should be comfortable doing all of these things: -

                                                                              - -


                                                                              -
                                                                              -

                                                                              [5] This, sadly, is still an oversimplification. Unicode now has been extended to handle ancient Chinese, Korean, and Japanese texts, which had so - many different characters that the 2-byte unicode system could not represent them all. But Python doesn't currently support that out of the box, and I don't know if there is a project afoot to add it. You've reached the - limits of my expertise, sorry.

                                                                              Chapter 10. Scripts and Streams

                                                                              -

                                                                              10.1. Abstracting input sources

                                                                              -

                                                                              One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object. -

                                                                              Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close -it when they're done. But they don't. Instead, they take a file-like object. -

                                                                              In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When -called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left -off and returns the next chunk of data. -

                                                                              This is how reading from real files works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on -disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply -calls the object's read method, the function can handle any kind of input source without specific code to handle each kind. -

                                                                              In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object. -

                                                                              Example 10.1. Parsing XML from a file

                                                                              ->>> from xml.dom import minidom
                                                                              ->>> fsock = open('binary.xml')    
                                                                              ->>> xmldoc = minidom.parse(fsock) 
                                                                              ->>> fsock.close()                 
                                                                              ->>> print xmldoc.toxml()          
                                                                              -<?xml version="1.0" ?>
                                                                              -<grammar>
                                                                              -<ref id="bit">
                                                                              -  <p>0</p>
                                                                              -  <p>1</p>
                                                                              -</ref>
                                                                              -<ref id="byte">
                                                                              -  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                                                              -<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                                                                              -</ref>
                                                                              -</grammar>
                                                                              -
                                                                                -
                                                                              1. First, you open the file on disk. This gives you a file object. -
                                                                              2. You pass the file object to minidom.parse, which calls the read method of fsock and reads the XML document from the file on disk. -
                                                                              3. Be sure to call the close method of the file object after you're done with it. minidom.parse will not do this for you. -
                                                                              4. Calling the toxml() method on the returned XML document prints out the entire thing. -

                                                                                Well, that all seems like a colossal waste of time. After all, you've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're -just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet. -

                                                                                Example 10.2. Parsing XML from a URL

                                                                                ->>> import urllib
                                                                                ->>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf') 
                                                                                ->>> xmldoc = minidom.parse(usock)            
                                                                                ->>> usock.close()          
                                                                                ->>> print xmldoc.toxml()   
                                                                                -<?xml version="1.0" ?>
                                                                                -<rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/"
                                                                                - xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
                                                                                -
                                                                                -<channel>
                                                                                -<title>Slashdot</title>
                                                                                -<link>http://slashdot.org/</link>
                                                                                -<description>News for nerds, stuff that matters</description>
                                                                                -</channel>
                                                                                -
                                                                                -<image>
                                                                                -<title>Slashdot</title>
                                                                                -<url>http://images.slashdot.org/topics/topicslashdot.gif</url>
                                                                                -<link>http://slashdot.org/</link>
                                                                                -</image>
                                                                                -
                                                                                -<item>
                                                                                -<title>To HDTV or Not to HDTV?</title>
                                                                                -<link>http://slashdot.org/article.pl?sid=01/12/28/0421241</link>
                                                                                -</item>
                                                                                -
                                                                                -[...snip...]
                                                                                -
                                                                                  -
                                                                                1. As you saw in a previous chapter, urlopen takes a web page URL and returns a file-like object. Most importantly, this object has a read method which returns the HTML source of the web page. -
                                                                                2. Now you pass the file-like object to minidom.parse, which obediently calls the read method of the object and parses the XML data that the read method returns. The fact that this XML data is now coming straight from a web page is completely irrelevant. minidom.parse doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects. -
                                                                                3. As soon as you're done with it, be sure to close the file-like object that urlopen gives you. -
                                                                                4. By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site. -

                                                                                  Example 10.3. Parsing XML from a string (the easy but inflexible way)

                                                                                  ->>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                                                                                  ->>> xmldoc = minidom.parseString(contents) 
                                                                                  ->>> print xmldoc.toxml()
                                                                                  -<?xml version="1.0" ?>
                                                                                  -<grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
                                                                                  -
                                                                                    -
                                                                                  1. minidom has a method, parseString, which takes an entire XML document as a string and parses it. You can use this instead of minidom.parse if you know you already have your entire XML document in a string. -

                                                                                    OK, so you can use the minidom.parse function for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a -file, a URL, or a string, you'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying. -

                                                                                    If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO. -

                                                                                    Example 10.4. Introducing StringIO

                                                                                    ->>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                                                                                    ->>> import StringIO
                                                                                    ->>> ssock = StringIO.StringIO(contents)   
                                                                                    ->>> ssock.read()        
                                                                                    -"<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                                                                                    ->>> ssock.read()        
                                                                                    -''
                                                                                    ->>> ssock.seek(0)       
                                                                                    ->>> ssock.read(15)      
                                                                                    -'<grammar><ref i'
                                                                                    ->>> ssock.read(15)
                                                                                    -"d='bit'><p>0</p"
                                                                                    ->>> ssock.read()
                                                                                    -'><p>1</p></ref></grammar>'
                                                                                    ->>> ssock.close()       
                                                                                    -
                                                                                      -
                                                                                    1. The StringIO module contains a single class, also called StringIO, which allows you to turn a string into a file-like object. The StringIO class takes the string as a parameter when creating an instance. -
                                                                                    2. Now you have a file-like object, and you can do all sorts of file-like things with it. Like read, which returns the original string. -
                                                                                    3. Calling read again returns an empty string. This is how real file objects work too; once you read the entire file, you can't read any - more without explicitly seeking to the beginning of the file. The StringIO object works the same way. -
                                                                                    4. You can explicitly seek to the beginning of the string, just like seeking through a file, by using the seek method of the StringIO object. -
                                                                                    5. You can also read the string in chunks, by passing a size parameter to the read method. -
                                                                                    6. At any time, read will return the rest of the string that you haven't read yet. All of this is exactly how file objects work; hence the term -file-like object. -

                                                                                      Example 10.5. Parsing XML from a string (the file-like object way)

                                                                                      ->>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                                                                                      ->>> ssock = StringIO.StringIO(contents)
                                                                                      ->>> xmldoc = minidom.parse(ssock) 
                                                                                      ->>> ssock.close()
                                                                                      ->>> print xmldoc.toxml()
                                                                                      -<?xml version="1.0" ?>
                                                                                      -<grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
                                                                                      -
                                                                                        -
                                                                                      1. Now you can pass the file-like object (really a StringIO) to minidom.parse, which will call the object's read method and happily parse away, never knowing that its input came from a hard-coded string. -

                                                                                        So now you know how to use a single function, minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you use urlopen to get a file-like object; for a local file, you use open; and for a string, you use StringIO. Now let's take it one step further and generalize these differences as well. -

                                                                                        Example 10.6. openAnything

                                                                                        
                                                                                        -def openAnything(source):
                                                                                        -    # try to open with urllib (if source is http, ftp, or file URL)
                                                                                        -    import urllib       
                                                                                        -    try:                
                                                                                        -        return urllib.urlopen(source)      
                                                                                        -    except (IOError, OSError):            
                                                                                        -        pass            
                                                                                        -
                                                                                        -    # try to open with native open function (if source is pathname)
                                                                                        -    try:                
                                                                                        -        return open(source)                
                                                                                        -    except (IOError, OSError):            
                                                                                        -        pass            
                                                                                        -
                                                                                        -    # treat source as string
                                                                                        -    import StringIO     
                                                                                        -    return StringIO.StringIO(str(source))  
                                                                                        -
                                                                                          -
                                                                                        1. The openAnything function takes a single parameter, source, and returns a file-like object. source is a string of some sort; it can either be a URL (like 'http://slashdot.org/slashdot.rdf'), a full or partial pathname to a local file (like 'binary.xml'), or a string that contains actual XML data to be parsed. -
                                                                                        2. First, you see if source is a URL. You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL. This is actually elegant in the sense that, if urllib ever supports new types of URLs in the future, you will also support them without recoding. If urllib is able to open source, then the return kicks you out of the function immediately and the following try statements never execute. -
                                                                                        3. On the other hand, if urllib yelled at you and told you that source wasn't a valid URL, you assume it's a path to a file on disk and try to open it. Again, you don't do anything fancy to check whether source is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably - get them wrong anyway). Instead, you just blindly open the file, and silently trap any errors. -
                                                                                        4. By this point, you need to assume that source is a string that has hard-coded data in it (since nothing else worked), so you use StringIO to create a file-like object out of it and return that. (In fact, since you're using the str function, source doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its __str__ special method.) -

                                                                                          Now you can use this openAnything function in conjunction with minidom.parse to make a function that takes a source that refers to an XML document somehow (either as a URL, or a local filename, or a hard-coded XML document in a string) and parses it. -

                                                                                          Example 10.7. Using openAnything in kgp.py

                                                                                          
                                                                                          -class KantGenerator:
                                                                                          -    def _load(self, source):
                                                                                          -        sock = toolbox.openAnything(source)
                                                                                          -        xmldoc = minidom.parse(sock).documentElement
                                                                                          -        sock.close()
                                                                                          -        return xmldoc

                                                                                          10.2. Standard input, output, and error

                                                                                          -

                                                                                          UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for - the rest of you. -

                                                                                          Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX system. When you print something, it goes to the stdout pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the stderr pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program -prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system -with a window-based Python IDE, stdout and stderr default to your “Interactive Window”.) -

                                                                                          Example 10.8. Introducing stdout and stderr

                                                                                          ->>> for i in range(3):
                                                                                          -...    print 'Dive in'             
                                                                                          -Dive in
                                                                                          -Dive in
                                                                                          -Dive in
                                                                                          ->>> import sys
                                                                                          ->>> for i in range(3):
                                                                                          -...    sys.stdout.write('Dive in') 
                                                                                          -Dive inDive inDive in
                                                                                          ->>> for i in range(3):
                                                                                          -...    sys.stderr.write('Dive in') 
                                                                                          -Dive inDive inDive in
                                                                                          -
                                                                                            -
                                                                                          1. As you saw in Example 6.9, “Simple Counters”, you can use Python's built-in range function to build simple counter loops that repeat something a set number of times. -
                                                                                          2. stdout is a file-like object; calling its write function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you're printing, and calls sys.stdout.write. -
                                                                                          3. In the simplest case, stdout and stderr send their output to the same place: the Python IDE (if you're in one), or the terminal (if you're running Python from the command line). Like stdout, stderr does not add carriage returns for you; if you want them, add them yourself. -

                                                                                            stdout and stderr are both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output. -

                                                                                            Example 10.9. Redirecting output

                                                                                            -[you@localhost kgp]$ python stdout.py
                                                                                            -Dive in
                                                                                            -[you@localhost kgp]$ cat out.log
                                                                                            -This message will be logged instead of displayed

                                                                                            (On Windows, you can use type instead of cat to display the contents of a file.) -

                                                                                            If you have not already done so, you can download this and other examples used in this book. -

                                                                                            
                                                                                            -#stdout.py
                                                                                            -import sys
                                                                                            -
                                                                                            -print 'Dive in'      
                                                                                            -saveout = sys.stdout 
                                                                                            -fsock = open('out.log', 'w')           
                                                                                            -sys.stdout = fsock   
                                                                                            -print 'This message will be logged instead of displayed' 
                                                                                            -sys.stdout = saveout 
                                                                                            -fsock.close()        
                                                                                            -
                                                                                            -
                                                                                              -
                                                                                            1. This will print to the IDE “Interactive Window” (or the terminal, if running the script from the command line). -
                                                                                            2. Always save stdout before redirecting it, so you can set it back to normal later. -
                                                                                            3. Open a file for writing. If the file doesn't exist, it will be created. If the file does exist, it will be overwritten. -
                                                                                            4. Redirect all further output to the new file you just opened. -
                                                                                            5. This will be “printed” to the log file only; it will not be visible in the IDE window or on the screen. -
                                                                                            6. Set stdout back to the way it was before you mucked with it. -
                                                                                            7. Close the log file. -

                                                                                              Redirecting stderr works exactly the same way, using sys.stderr instead of sys.stdout. -

                                                                                              Example 10.10. Redirecting error information

                                                                                              -[you@localhost kgp]$ python stderr.py
                                                                                              -[you@localhost kgp]$ cat error.log
                                                                                              -Traceback (most recent line last):
                                                                                              -  File "stderr.py", line 5, in ?
                                                                                              -    raise Exception, 'this error will be logged'
                                                                                              -Exception: this error will be logged

                                                                                              If you have not already done so, you can download this and other examples used in this book. -

                                                                                              
                                                                                              -#stderr.py
                                                                                              -import sys
                                                                                              -
                                                                                              -fsock = open('error.log', 'w')               
                                                                                              -sys.stderr = fsock         
                                                                                              -raise Exception, 'this error will be logged'  
                                                                                              -
                                                                                              -
                                                                                                -
                                                                                              1. Open the log file where you want to store debugging information. -
                                                                                              2. Redirect standard error by assigning the file object of the newly-opened log file to stderr. -
                                                                                              3. Raise an exception. Note from the screen output that this does not print anything on screen. All the normal traceback information has been written to error.log. -
                                                                                              4. Also note that you're not explicitly closing your log file, nor are you setting stderr back to its original value. This is fine, since once the program crashes (because of the exception), Python will clean up and close the file for us, and it doesn't make any difference that stderr is never restored, since, as I mentioned, the program crashes and Python ends. Restoring the original is more important for stdout, if you expect to go do other stuff within the same script afterwards. -

                                                                                                Since it is so common to write error messages to standard error, there is a shorthand syntax that can be used instead of going -through the hassle of redirecting it outright. -

                                                                                                Example 10.11. Printing to stderr

                                                                                                ->>> print 'entering function'
                                                                                                -entering function
                                                                                                ->>> import sys
                                                                                                ->>> print >> sys.stderr, 'entering function' 
                                                                                                -entering function
                                                                                                -
                                                                                                -
                                                                                                  -
                                                                                                1. This shorthand syntax of the print statement can be used to write to any open file, or file-like object. In this case, you can redirect a single print statement to stderr without affecting subsequent print statements. -

                                                                                                  Standard input, on the other hand, is a read-only file object, and it represents the data flowing into the program from some -previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the MS-DOS command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output -becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any -special redirecting itself, just doing normal print statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting -one program's output to the next program's input.

                                                                                                  Example 10.12. Chaining commands

                                                                                                   [you@localhost kgp]$ python kgp.py -g binary.xml         
                                                                                                   01100111
                                                                                                  @@ -4271,101 +2141,17 @@ def openAnything(source):
                                                                                                   [... snip ...]
                                                                                                  1. This is the openAnything function from toolbox.py, which you previously examined in Section 10.1, “Abstracting input sources”. All you've done is add three lines of code at the beginning of the function to check if the source is “-”; if so, you return sys.stdin. Really, that's it! Remember, stdin is a file-like object with a read method, so the rest of the code (in kgp.py, where you call openAnything) doesn't change a bit. -

                                                                                                    10.3. Caching node lookups

                                                                                                    -

                                                                                                    kgp.py employs several tricks which may or may not be useful to you in your XML processing. The first one takes advantage of the consistent structure of the input documents to build a cache of nodes. -

                                                                                                    A grammar file defines a series of ref elements. Each ref contains one or more p elements, which can contain a lot of different things, including xrefs. Whenever you encounter an xref, you look for a corresponding ref element with the same id attribute, and choose one of the ref element's children and parse it. (You'll see how this random choice is made in the next section.) -

                                                                                                    This is how you build up the grammar: define ref elements for the smallest pieces, then define ref elements which "include" the first ref elements by using xref, and so forth. Then you parse the "largest" reference and follow each xref, and eventually output real text. The text you output depends on the (random) decisions you make each time you fill in an -xref, so the output is different each time. -

                                                                                                    This is all very flexible, but there is one downside: performance. When you find an xref and need to find the corresponding ref element, you have a problem. The xref has an id attribute, and you want to find the ref element that has that same id attribute, but there is no easy way to do that. The slow way to do it would be to get the entire list of ref elements each time, then manually loop through and look at each id attribute. The fast way is to do that once and build a cache, in the form of a dictionary. -

                                                                                                    Example 10.14. loadGrammar

                                                                                                    
                                                                                                    -    def loadGrammar(self, grammar):       
                                                                                                    -        self.grammar = self._load(grammar)
                                                                                                    -        self.refs = {}   
                                                                                                    -        for ref in self.grammar.getElementsByTagName("ref"): 
                                                                                                    -            self.refs[ref.attributes["id"].value] = ref       
                                                                                                    -
                                                                                                      -
                                                                                                    1. Start by creating an empty dictionary, self.refs. -
                                                                                                    2. As you saw in Section 9.5, “Searching for elements”, getElementsByTagName returns a list of all the elements of a particular name. You easily can get a list of all the ref elements, then simply loop through that list. -
                                                                                                    3. As you saw in Section 9.6, “Accessing element attributes”, you can access individual attributes of an element by name, using standard dictionary syntax. So the keys of the self.refs dictionary will be the values of the id attribute of each ref element. -
                                                                                                    4. The values of the self.refs dictionary will be the ref elements themselves. As you saw in Section 9.3, “Parsing XML”, each element, each node, each comment, each piece of text in a parsed XML document is an object. -

                                                                                                      Once you build this cache, whenever you come across an xref and need to find the ref element with the same id attribute, you can simply look it up in self.refs. -

                                                                                                      Example 10.15. Using the ref element cache

                                                                                                      
                                                                                                      -    def do_xref(self, node):
                                                                                                      -        id = node.attributes["id"].value
                                                                                                      -        self.parse(self.randomChildElement(self.refs[id]))

                                                                                                      You'll explore the randomChildElement function in the next section. -

                                                                                                      10.4. Finding direct children of a node

                                                                                                      -

                                                                                                      Another useful techique when parsing XML documents is finding all the direct child elements of a particular element. For instance, in the grammar files, a ref element can have several p elements, each of which can contain many things, including other p elements. You want to find just the p elements that are children of the ref, not p elements that are children of other p elements. -

                                                                                                      You might think you could simply use getElementsByTagName for this, but you can't. getElementsByTagName searches recursively and returns a single list for all the elements it finds. Since p elements can contain other p elements, you can't use getElementsByTagName, because it would return nested p elements that you don't want. To find only direct child elements, you'll need to do it yourself. -

                                                                                                      Example 10.16. Finding direct child elements

                                                                                                      
                                                                                                      -    def randomChildElement(self, node):
                                                                                                      -        choices = [e for e in node.childNodes
                                                                                                      - if e.nodeType == e.ELEMENT_NODE]   
                                                                                                      -        chosen = random.choice(choices)             
                                                                                                      -        return chosen            
                                                                                                      -
                                                                                                        -
                                                                                                      1. As you saw in Example 9.9, “Getting child nodes”, the childNodes attribute returns a list of all the child nodes of an element. -
                                                                                                      2. However, as you saw in Example 9.11, “Child nodes can be text”, the list returned by childNodes contains all different types of nodes, including text nodes. That's not what you're looking for here. You only want the - children that are elements. -
                                                                                                      3. Each node has a nodeType attribute, which can be ELEMENT_NODE, TEXT_NODE, COMMENT_NODE, or any number of other values. The complete list of possible values is in the __init__.py file in the xml.dom package. (See Section 9.2, “Packages” for more on packages.) But you're just interested in nodes that are elements, so you can filter the list to only include - those nodes whose nodeType is ELEMENT_NODE. -
                                                                                                      4. Once you have a list of actual elements, choosing a random one is easy. Python comes with a module called random which includes several useful functions. The random.choice function takes a list of any number of items and returns a random item. For example, if the ref elements contains several p elements, then choices would be a list of p elements, and chosen would end up being assigned exactly one of them, selected at random. -

                                                                                                        10.5. Creating separate handlers by node type

                                                                                                        -

                                                                                                        The third useful XML processing tip involves separating your code into logical functions, based on node types and element names. Parsed XML documents are made up of various types of nodes, each represented by a Python object. The root level of the document itself is represented by a Document object. The Document then contains one or more Element objects (for actual XML tags), each of which may contain other Element objects, Text objects (for bits of text), or Comment objects (for embedded comments). Python makes it easy to write a dispatcher to separate the logic for each node type. -

                                                                                                        Example 10.17. Class names of parsed XML objects

                                                                                                        ->>> from xml.dom import minidom
                                                                                                        ->>> xmldoc = minidom.parse('kant.xml') 
                                                                                                        ->>> xmldoc
                                                                                                        -<xml.dom.minidom.Document instance at 0x01359DE8>
                                                                                                        ->>> xmldoc.__class__ 
                                                                                                        -<class xml.dom.minidom.Document at 0x01105D40>
                                                                                                        ->>> xmldoc.__class__.__name__          
                                                                                                        -'Document'
                                                                                                        -
                                                                                                          -
                                                                                                        1. Assume for a moment that kant.xml is in the current directory. -
                                                                                                        2. As you saw in Section 9.2, “Packages”, the object returned by parsing an XML document is a Document object, as defined in the minidom.py in the xml.dom package. As you saw in Section 5.4, “Instantiating Classes”, __class__ is built-in attribute of every Python object. -
                                                                                                        3. Furthermore, __name__ is a built-in attribute of every Python class, and it is a string. This string is not mysterious; it's the same as the class name you type when you define a class - yourself. (See Section 5.3, “Defining Classes”.) -

                                                                                                          Fine, so now you can get the class name of any particular XML node (since each XML node is represented as a Python object). How can you use this to your advantage to separate the logic of parsing each node type? The answer is getattr, which you first saw in Section 4.4, “Getting Object References With getattr”. -

                                                                                                          Example 10.18. parse, a generic XML node dispatcher

                                                                                                          
                                                                                                          -    def parse(self, node):          
                                                                                                          -        parseMethod = getattr(self, "parse_%s" % node.__class__.__name__)  
                                                                                                          -        parseMethod(node) 
                                                                                                          -
                                                                                                            -
                                                                                                          1. First off, notice that you're constructing a larger string based on the class name of the node you were passed (in the node argument). So if you're passed a Document node, you're constructing the string 'parse_Document', and so forth. -
                                                                                                          2. Now you can treat that string as a function name, and get a reference to the function itself using getattr
                                                                                                          3. Finally, you can call that function and pass the node itself as an argument. The next example shows the definitions of each - of these functions. -

                                                                                                            Example 10.19. Functions called by the parse dispatcher

                                                                                                            
                                                                                                            -    def parse_Document(self, node): 
                                                                                                            -        self.parse(node.documentElement)
                                                                                                             
                                                                                                            -    def parse_Text(self, node):    
                                                                                                            -        text = node.data
                                                                                                            -        if self.capitalizeNextWord:
                                                                                                            -            self.pieces.append(text[0].upper())
                                                                                                            -            self.pieces.append(text[1:])
                                                                                                            -            self.capitalizeNextWord = 0
                                                                                                            -        else:
                                                                                                            -            self.pieces.append(text)
                                                                                                             
                                                                                                            -    def parse_Comment(self, node): 
                                                                                                            -        pass
                                                                                                             
                                                                                                            -    def parse_Element(self, node): 
                                                                                                            -        handlerMethod = getattr(self, "do_%s" % node.tagName)
                                                                                                            -        handlerMethod(node)
                                                                                                            -
                                                                                                              -
                                                                                                            1. parse_Document is only ever called once, since there is only one Document node in an XML document, and only one Document object in the parsed XML representation. It simply turns around and parses the root element of the grammar file. -
                                                                                                            2. parse_Text is called on nodes that represent bits of text. The function itself does some special processing to handle automatic capitalization - of the first word of a sentence, but otherwise simply appends the represented text to a list. -
                                                                                                            3. parse_Comment is just a pass, since you don't care about embedded comments in the grammar files. Note, however, that you still need to define the function - and explicitly make it do nothing. If the function did not exist, the generic parse function would fail as soon as it stumbled on a comment, because it would try to find the non-existent parse_Comment function. Defining a separate function for every node type, even ones you don't use, allows the generic parse function to stay simple and dumb. -
                                                                                                            4. The parse_Element method is actually itself a dispatcher, based on the name of the element's tag. The basic idea is the same: take what distinguishes - elements from each other (their tag names) and dispatch to a separate function for each of them. You construct a string like -'do_xref' (for an <xref> tag), find a function of that name, and call it. And so forth for each of the other tag names that might be found in the - course of parsing a grammar file (<p> tags, <choice> tags). -

                                                                                                              In this example, the dispatch functions parse and parse_Element simply find other methods in the same class. If your processing is very complex (or you have many different tag names), -you could break up your code into separate modules, and use dynamic importing to import each module and call whatever functions -you needed. Dynamic importing will be discussed in Chapter 16, Functional Programming. + + +[more XML stuff was here] + + + + +

                                                                                                              10.6. Handling command-line arguments

                                                                                                              Python fully supports creating programs that can be run on the command line, complete with command-line arguments and either short- or long-style flags to specify various options. None of this is XML-specific, but this script makes good use of command-line processing, so it seemed like a good time to mention it. @@ -4578,184 +2364,11 @@ def main(argv): -

                                                                                                              -

                                                                                                              Chapter 13. Unit Testing

                                                                                                              -

                                                                                                              13.1. Introduction to Roman numerals

                                                                                                              -

                                                                                                              In previous chapters, you “dived in” by immediately looking at code and trying to understand it as quickly as possible. Now that you have some Python under your belt, you're going to step back and look at the steps that happen before the code gets written. -

                                                                                                              In the next few chapters, you're going to write, debug, and optimize a set of utility functions to convert to and from Roman -numerals. You saw the mechanics of constructing and validating Roman numerals in Section 7.3, “Case Study: Roman Numerals”, but now let's step back and consider what it would take to expand that into a two-way utility. -

                                                                                                              The rules for Roman numerals lead to a number of interesting observations: -

                                                                                                              -
                                                                                                                -
                                                                                                              1. There is only one correct way to represent a particular number as Roman numerals. -
                                                                                                              2. The converse is also true: if a string of characters is a valid Roman numeral, it represents only one number (i.e. it can only be read one way). +[unit testing stuff was here] -
                                                                                                              3. There is a limited range of numbers that can be expressed as Roman numerals, specifically 1 through 3999. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent - that its normal value should be multiplied by 1000, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from 1 to 3999.) -
                                                                                                              4. There is no way to represent 0 in Roman numerals. (Amazingly, the ancient Romans had no concept of 0 as a number. Numbers were for counting things you had; how can you count what you don't have?) -
                                                                                                              5. There is no way to represent negative numbers in Roman numerals. -
                                                                                                              6. There is no way to represent fractions or non-integer numbers in Roman numerals. -
                                                                                                              -

                                                                                                              Given all of this, what would you expect out of a set of functions to convert to and from Roman numerals? -

                                                                                                              roman.py requirements

                                                                                                              -
                                                                                                                -
                                                                                                              1. to_roman() should return the Roman numeral representation for all integers 1 to 3999. -
                                                                                                              2. to_roman() should fail when given an integer outside the range 1 to 3999. - -
                                                                                                              3. to_roman() should fail when given a non-integer number. - -
                                                                                                              4. from_roman() should take a valid Roman numeral and return the number that it represents. - -
                                                                                                              5. from_roman() should fail when given an invalid Roman numeral. - -
                                                                                                              6. If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number - you started with. So from_roman(to_roman(n)) == n for all n in 1..3999. - -
                                                                                                              7. to_roman() should always return a Roman numeral using uppercase letters. - -
                                                                                                              8. from_roman() should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input). - -
                                                                                                              -
                                                                                                              -

                                                                                                              Further reading

                                                                                                              -
                                                                                                                -
                                                                                                              • This site has more on Roman numerals, including a fascinating history of how Romans and other civilizations really used them (short answer: haphazardly and inconsistently). - -
                                                                                                              -

                                                                                                              13.5. Testing for failure

                                                                                                              -

                                                                                                              It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect. -

                                                                                                              Remember the other requirements for to_roman(): -

                                                                                                              -
                                                                                                                -
                                                                                                              1. to_roman() should fail when given an integer outside the range 1 to 3999. - -
                                                                                                              2. to_roman() should fail when given a non-integer number. - -
                                                                                                              -

                                                                                                              In Python, functions indicate failure by raising exceptions, and the unittest module provides methods for testing whether a function raises a particular exception when given bad input. -

                                                                                                              Example 13.3. Testing bad input to to_roman()

                                                                                                              
                                                                                                              -class ToRomanBadInput(unittest.TestCase):          
                                                                                                              -    def testTooLarge(self):      
                                                                                                              -        """to_roman should fail with large input""" 
                                                                                                              -        self.assertRaises(roman.OutOfRangeError, roman.to_roman, 4000) 
                                                                                                              -
                                                                                                              -    def testZero(self):          
                                                                                                              -        """to_roman should fail with 0 input"""     
                                                                                                              -        self.assertRaises(roman.OutOfRangeError, roman.to_roman, 0)    
                                                                                                              -
                                                                                                              -    def testNegative(self):      
                                                                                                              -        """to_roman should fail with negative input"""                
                                                                                                              -        self.assertRaises(roman.OutOfRangeError, roman.to_roman, -1)  
                                                                                                              -
                                                                                                              -    def testNonInteger(self):    
                                                                                                              -        """to_roman should fail with non-integer input"""             
                                                                                                              -        self.assertRaises(roman.NotIntegerError, roman.to_roman, 0.5)  
                                                                                                              -
                                                                                                                -
                                                                                                              1. The TestCase class of the unittest provides the assertRaises method, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments - you're passing that function. (If the function you're testing takes more than one argument, pass them all to assertRaises, in order, and it will pass them right along to the function you're testing.) Pay close attention to what you're doing here: - instead of calling to_roman() directly and manually checking that it raises a particular exception (by wrapping it in a try...except block), assertRaises has encapsulated all of that for us. All you do is give it the exception (roman.OutOfRangeError), the function (to_roman()), and to_roman()'s arguments (4000), and assertRaises takes care of calling to_roman() and checking to make sure that it raises roman.OutOfRangeError. (Also note that you're passing the to_roman() function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned - recently how handy it is that everything in Python is an object, including functions and exceptions?) -
                                                                                                              2. Along with testing numbers that are too large, you need to test numbers that are too small. Remember, Roman numerals cannot - express 0 or negative numbers, so you have a test case for each of those (testZero and testNegative). In testZero, you are testing that to_roman() raises a roman.OutOfRangeError exception when called with 0; if it does not raise a roman.OutOfRangeError (either because it returns an actual value, or because it raises some other exception), this test is considered failed. -
                                                                                                              3. Requirement #3 specifies that to_roman() cannot accept a non-integer number, so here you test to make sure that to_roman() raises a roman.NotIntegerError exception when called with 0.5. If to_roman() does not raise a roman.NotIntegerError, this test is considered failed. -

                                                                                                                The next two requirements are similar to the first three, except they apply to from_roman() instead of to_roman(): -

                                                                                                                -
                                                                                                                  -
                                                                                                                1. from_roman() should take a valid Roman numeral and return the number that it represents. - -
                                                                                                                2. from_roman() should fail when given an invalid Roman numeral. - -
                                                                                                                -

                                                                                                                Requirement #4 is handled in the same way as requirement #1, iterating through a sampling of known values and testing each in turn. Requirement #5 is handled in the same way as requirements -#2 and #3, by testing a series of bad inputs and making sure from_roman() raises the appropriate exception. -

                                                                                                                Example 13.4. Testing bad input to from_roman()

                                                                                                                
                                                                                                                -class FromRomanBadInput(unittest.TestCase):  
                                                                                                                -    def testTooManyRepeatedNumerals(self):   
                                                                                                                -        """from_roman should fail with too many repeated numerals"""              
                                                                                                                -        for s in ('MMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'):             
                                                                                                                -            self.assertRaises(roman.InvalidRomanNumeralError, roman.from_roman, s) 
                                                                                                                -
                                                                                                                -    def testRepeatedPairs(self):             
                                                                                                                -        """from_roman should fail with repeated pairs of numerals"""              
                                                                                                                -        for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'):               
                                                                                                                -            self.assertRaises(roman.InvalidRomanNumeralError, roman.from_roman, s)
                                                                                                                -
                                                                                                                -    def testMalformedAntecedent(self):       
                                                                                                                -        """from_roman should fail with malformed antecedents""" 
                                                                                                                -        for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV',
                                                                                                                -'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'):     
                                                                                                                -            self.assertRaises(roman.InvalidRomanNumeralError, roman.from_roman, s)
                                                                                                                -
                                                                                                                  -
                                                                                                                1. Not much new to say about these; the pattern is exactly the same as the one you used to test bad input to to_roman(). I will briefly note that you have another exception: roman.InvalidRomanNumeralError. That makes a total of three custom exceptions that will need to be defined in roman.py (along with roman.OutOfRangeError and roman.NotIntegerError). You'll see how to define these custom exceptions when you actually start writing roman.py, later in this chapter. -

                                                                                                                  13.6. Testing for sanity

                                                                                                                  -

                                                                                                                  Often, you will find that a unit of code contains a set of reciprocal functions, usually in the form of conversion functions - where one converts A to B and the other converts B to A. In these cases, it is useful to create a “sanity check” to make sure that you can convert A to B and back to A without losing precision, incurring rounding errors, or triggering - any other sort of bug. -

                                                                                                                  Consider this requirement: -

                                                                                                                  -
                                                                                                                    -
                                                                                                                  1. If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number - you started with. So from_roman(to_roman(n)) == n for all n in 1..3999. - -
                                                                                                                  -

                                                                                                                  Example 13.5. Testing to_roman() against from_roman()

                                                                                                                  
                                                                                                                  -class SanityCheck(unittest.TestCase):        
                                                                                                                  -    def testSanity(self):  
                                                                                                                  -        """from_roman(to_roman(n))==n for all n"""
                                                                                                                  -        for integer in range(1, 4000):         
                                                                                                                  -            numeral = roman.to_roman(integer) 
                                                                                                                  -            result = roman.from_roman(numeral)
                                                                                                                  -            self.assertEqual(integer, result) 
                                                                                                                  -
                                                                                                                    -
                                                                                                                  1. You've seen the range function before, but here it is called with two arguments, which returns a list of integers starting at the first argument (1) and counting consecutively up to but not including the second argument (4000). Thus, 1..3999, which is the valid range for converting to Roman numerals. -
                                                                                                                  2. I just wanted to mention in passing that integer is not a keyword in Python; here it's just a variable name like any other. -
                                                                                                                  3. The actual testing logic here is straightforward: take a number (integer), convert it to a Roman numeral (numeral), then convert it back to a number (result) and make sure you end up with the same number you started with. If not, assertEqual will raise an exception and the test will immediately be considered failed. If all the numbers match, assertEqual will always return silently, the entire testSanity method will eventually return silently, and the test will be considered passed. -

                                                                                                                    The last two requirements are different from the others because they seem both arbitrary and trivial: -

                                                                                                                    -
                                                                                                                      -
                                                                                                                    1. to_roman() should always return a Roman numeral using uppercase letters. - -
                                                                                                                    2. from_roman() should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input). - -
                                                                                                                    -

                                                                                                                    In fact, they are somewhat arbitrary. You could, for instance, have stipulated that from_roman() accept lowercase and mixed case input. But they are not completely arbitrary; if to_roman() is always returning uppercase output, then from_roman() must at least accept uppercase input, or the “sanity check” (requirement #6) would fail. The fact that it only accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying -the behavior up front. And if it's worth specifying, it's worth testing. -

                                                                                                                    Example 13.6. Testing for case

                                                                                                                    
                                                                                                                    -class CaseCheck(unittest.TestCase): 
                                                                                                                    -    def testToRomanCase(self):      
                                                                                                                    -        """to_roman should always return uppercase"""  
                                                                                                                    -        for integer in range(1, 4000):                
                                                                                                                    -            numeral = roman.to_roman(integer)          
                                                                                                                    -            self.assertEqual(numeral, numeral.upper())         
                                                                                                                    -
                                                                                                                    -    def testFromRomanCase(self):    
                                                                                                                    -        """from_roman should only accept uppercase input"""
                                                                                                                    -        for integer in range(1, 4000):                
                                                                                                                    -            numeral = roman.to_roman(integer)          
                                                                                                                    -            roman.from_roman(numeral.upper())  
                                                                                                                    -            self.assertRaises(roman.InvalidRomanNumeralError,
                                                                                                                    -            roman.from_roman, numeral.lower())   
                                                                                                                    -
                                                                                                                      -
                                                                                                                    1. The most interesting thing about this test case is all the things it doesn't test. It doesn't test that the value returned - from to_roman() is right or even consistent; those questions are answered by separate test cases. You have a whole test case just to test for uppercase-ness. You might - be tempted to combine this with the sanity check, since both run through the entire range of values and call to_roman(). -[6] But that would violate one of the fundamental rules: each test case should answer only a single question. Imagine that you combined this case check with the sanity check, and - then that test case failed. You would need to do further analysis to figure out which part of the test case failed to determine - what the problem was. If you need to analyze the results of your unit testing just to figure out what they mean, it's a sure - sign that you've mis-designed your test cases. -
                                                                                                                    2. There's a similar lesson to be learned here: even though “you know” that to_roman() always returns uppercase, you are explicitly converting its return value to uppercase here to test that from_roman() accepts uppercase input. Why? Because the fact that to_roman() always returns uppercase is an independent requirement. If you changed that requirement so that, for instance, it always - returned lowercase, the testToRomanCase test case would need to change, but this test case would still work. This was another of the fundamental rules: each test case must be able to work in isolation from any of the others. Every test case is an island. -
                                                                                                                    3. Note that you're not assigning the return value of from_roman() to anything. This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value. In this case, that's what you want. This test case doesn't test anything about the return - value; it just tests that from_roman() accepts the uppercase input without raising an exception. -
                                                                                                                    4. This is a complicated line, but it's very similar to what you did in the ToRomanBadInput and FromRomanBadInput tests. You are testing to make sure that calling a particular function (roman.from_roman) with a particular value (numeral.lower(), the lowercase version of the current Roman numeral in the loop) raises a particular exception (roman.InvalidRomanNumeralError). If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different - exception, or returning a value without raising an exception at all), the test fails. -

                                                                                                                      In the next chapter, you'll see how to write code that passes these tests. -



                                                                                                                      -
                                                                                                                      -

                                                                                                                      [6] “I can resist everything except temptation.” --Oscar Wilde

                                                                                                                      Chapter 14. Test-First Programming

                                                                                                                      14.1. roman.py, stage 1

                                                                                                                      @@ -5478,11 +3091,17 @@ OK
                                                                                                                      NoteWhen all of your tests pass, stop coding. -
                                                                                                                      -

                                                                                                                      Chapter 16. Functional Programming

                                                                                                                      -

                                                                                                                      16.1. Diving in

                                                                                                                      -

                                                                                                                      In Chapter 13, Unit Testing, you learned about the philosophy of unit testing. In Chapter 14, Test-First Programming, you stepped through the implementation of basic unit tests in Python. In Chapter 15, Refactoring, you saw how unit testing makes large-scale refactoring easier. This chapter will build on those sample programs, but here - we will focus more on advanced Python-specific techniques, rather than on unit testing itself. + + + + + +[functional programming stuff was here] + + + + +

                                                                                                                      The following is a complete Python program that acts as a cheap and simple regression testing framework. It takes unit tests that you've written for individual modules, collects them all into one big test suite, and runs them all at once. I actually use this script as part of the build process for this book; I have unit tests for several of the example programs (not just the roman.py module featured in Chapter 13, Unit Testing), and the first thing my automated build script does is run this program to make sure all my examples still work. If this @@ -5557,6 +3176,11 @@ OK

                                                                                                                    5. The first 5 tests are from apihelpertest.py, which tests the example script from Chapter 4, The Power Of Introspection.
                                                                                                                    6. The next 5 tests are from odbchelpertest.py, which tests the example script from Chapter 2, Your First Python Program.
                                                                                                                    7. The rest are from romantest.py, which you studied in depth in Chapter 13, Unit Testing. + + + + +

                                                                                                                      16.2. Finding the path

                                                                                                                      When running Python scripts from the command line, it is sometimes useful to know where the currently running script is located on disk.

                                                                                                                      This is one of those obscure little tricks that is virtually impossible to figure out on your own, but simple to remember @@ -5642,116 +3266,17 @@ def regressionTest():

                                                                                                                    8. The rest of the function is the same.

                                                                                                                      This technique will allow you to re-use this regression.py script on multiple projects. Just put the script in a common directory, then change to the project's directory before running it. All of that project's unit tests will be found and tested, instead of the unit tests in the common directory where regression.py is located. -

                                                                                                                      16.3. Filtering lists revisited

                                                                                                                      -

                                                                                                                      You're already familiar with using list comprehensions to filter lists. There is another way to accomplish this same thing, which some people feel is more expressive. -

                                                                                                                      Python has a built-in filter function which takes two arguments, a function and a list, and returns a list. -[7] The function passed as the first argument to filter must itself take one argument, and the list that filter returns will contain all the elements from the list passed to filter for which the function passed to filter returns true. -

                                                                                                                      Got all that? It's not as difficult as it sounds. -

                                                                                                                      Example 16.7. Introducing filter

                                                                                                                      ->>> def odd(n):                 
                                                                                                                      -...    return n % 2
                                                                                                                      -...    
                                                                                                                      ->>> li = [1, 2, 3, 5, 9, 10, 256, -3]
                                                                                                                      ->>> filter(odd, li)             
                                                                                                                      -[1, 3, 5, 9, -3]
                                                                                                                      ->>> [e for e in li if odd(e)]   
                                                                                                                      ->>> filteredList = []
                                                                                                                      ->>> for n in li:                
                                                                                                                      -...    if odd(n):
                                                                                                                      -...        filteredList.append(n)
                                                                                                                      -...    
                                                                                                                      ->>> filteredList
                                                                                                                      -[1, 3, 5, 9, -3]
                                                                                                                      -
                                                                                                                        -
                                                                                                                      1. odd uses the built-in mod function “%” to return True if n is odd and False if n is even. -
                                                                                                                      2. filter takes two arguments, a function (odd) and a list (li). It loops through the list and calls odd with each element. If odd returns a true value (remember, any non-zero value is true in Python), then the element is included in the returned list, otherwise it is filtered out. The result is a list of only the odd - numbers from the original list, in the same order as they appeared in the original. -
                                                                                                                      3. You could accomplish the same thing using list comprehensions, as you saw in Section 4.5, “Filtering Lists”. -
                                                                                                                      4. You could also accomplish the same thing with a for loop. Depending on your programming background, this may seem more “straightforward”, but functions like filter are much more expressive. Not only is it easier to write, it's easier to read, too. Reading the for loop is like standing too close to a painting; you see all the details, but it may take a few seconds to be able to step - back and see the bigger picture: “Oh, you're just filtering the list!” -

                                                                                                                        Example 16.8. filter in regression.py

                                                                                                                        
                                                                                                                        -    files = os.listdir(path)              
                                                                                                                        -    test = re.compile("test\.py$", re.IGNORECASE)           
                                                                                                                        -    files = filter(test.search, files)    
                                                                                                                        -
                                                                                                                          -
                                                                                                                        1. As you saw in Section 16.2, “Finding the path”, path may contain the full or partial pathname of the directory of the currently running script, or it may contain an empty string - if the script is being run from the current directory. Either way, files will end up with the names of the files in the same directory as this script you're running. -
                                                                                                                        2. This is a compiled regular expression. As you saw in Section 15.3, “Refactoring”, if you're going to use the same regular expression over and over, you should compile it for faster performance. The compiled - object has a search method which takes a single argument, the string to search. If the regular expression matches the string, the search method returns a Match object containing information about the regular expression match; otherwise it returns None, the Python null value. -
                                                                                                                        3. For each element in the files list, you're going to call the search method of the compiled regular expression object, test. If the regular expression matches, the method will return a Match object, which Python considers to be true, so the element will be included in the list returned by filter. If the regular expression does not match, the search method will return None, which Python considers to be false, so the element will not be included. -

                                                                                                                          Historical note. Versions of Python prior to 2.0 did not have list comprehensions, so you couldn't filter using list comprehensions; the filter function was the only game in town. Even with the introduction of list comprehensions in 2.0, some people still prefer the -old-style filter (and its companion function, map, which you'll see later in this chapter). Both techniques work at the moment, so which one you use is a matter of style. -There is discussion that map and filter might be deprecated in a future version of Python, but no decision has been made. -

                                                                                                                          Example 16.9. Filtering using list comprehensions instead

                                                                                                                          
                                                                                                                          -    files = os.listdir(path)             
                                                                                                                          -    test = re.compile("test\.py$", re.IGNORECASE)          
                                                                                                                          -    files = [f for f in files if test.search(f)] 
                                                                                                                          -
                                                                                                                            -
                                                                                                                          1. This will accomplish exactly the same result as using the filter function. Which way is more expressive? That's up to you. -

                                                                                                                            16.4. Mapping lists revisited

                                                                                                                            -

                                                                                                                            You're already familiar with using list comprehensions to map one list into another. There is another way to accomplish the same thing, using the built-in map function. It works much the same way as the filter function. -

                                                                                                                            Example 16.10. Introducing map

                                                                                                                            ->>> def double(n):
                                                                                                                            -...    return n*2
                                                                                                                            -...    
                                                                                                                            ->>> li = [1, 2, 3, 5, 9, 10, 256, -3]
                                                                                                                            ->>> map(double, li)     
                                                                                                                            -[2, 4, 6, 10, 18, 20, 512, -6]
                                                                                                                            ->>> [double(n) for n in li]               
                                                                                                                            -[2, 4, 6, 10, 18, 20, 512, -6]
                                                                                                                            ->>> newlist = []
                                                                                                                            ->>> for n in li:        
                                                                                                                            -...    newlist.append(double(n))
                                                                                                                            -...    
                                                                                                                            ->>> newlist
                                                                                                                            -[2, 4, 6, 10, 18, 20, 512, -6]
                                                                                                                            -
                                                                                                                              -
                                                                                                                            1. map takes a function and a list[8] and returns a new list by calling the function with each element of the list in order. In this case, the function simply - multiplies each element by 2. -
                                                                                                                            2. You could accomplish the same thing with a list comprehension. List comprehensions were first introduced in Python 2.0; map has been around forever. -
                                                                                                                            3. You could, if you insist on thinking like a Visual Basic programmer, use a for loop to accomplish the same thing. -

                                                                                                                              Example 16.11. map with lists of mixed datatypes

                                                                                                                              ->>> li = [5, 'a', (2, 'b')]
                                                                                                                              ->>> map(double, li)     
                                                                                                                              -[10, 'aa', (2, 'b', 2, 'b')]
                                                                                                                              -
                                                                                                                                -
                                                                                                                              1. As a side note, I'd like to point out that map works just as well with lists of mixed datatypes, as long as the function you're using correctly handles each type. In this - case, the double function simply multiplies the given argument by 2, and Python Does The Right Thing depending on the datatype of the argument. For integers, this means actually multiplying it by 2; for - strings, it means concatenating the string with itself; for tuples, it means making a new tuple that has all of the elements - of the original, then all of the elements of the original again. -

                                                                                                                                All right, enough play time. Let's look at some real code. -

                                                                                                                                Example 16.12. map in regression.py

                                                                                                                                
                                                                                                                                -    filenameToModuleName = lambda f: os.path.splitext(f)[0] 
                                                                                                                                -    moduleNames = map(filenameToModuleName, files)          
                                                                                                                                -
                                                                                                                                  -
                                                                                                                                1. As you saw in Section 4.7, “Using lambda Functions”, lambda defines an inline function. And as you saw in Example 6.17, “Splitting Pathnames”, os.path.splitext takes a filename and returns a tuple (name, extension). So filenameToModuleName is a function which will take a filename and strip off the file extension, and return just the name. -
                                                                                                                                2. Calling map takes each filename listed in files, passes it to the function filenameToModuleName, and returns a list of the return values of each of those function calls. In other words, you strip the file extension off - of each filename, and store the list of all those stripped filenames in moduleNames. -

                                                                                                                                  As you'll see in the rest of the chapter, you can extend this type of data-centric thinking all the way to the final goal, -which is to define and execute a single test suite that contains the tests from all of those individual test suites. -

                                                                                                                                  16.5. Data-centric programming

                                                                                                                                  -

                                                                                                                                  By now you're probably scratching your head wondering why this is better than using for loops and straight function calls. And that's a perfectly valid question. Mostly, it's a matter of perspective. Using -map and filter forces you to center your thinking around your data. -

                                                                                                                                  In this case, you started with no data at all; the first thing you did was get the directory path of the current script, and got a list of files in that directory. That was the bootstrap, and it gave you real data to work -with: a list of filenames. -

                                                                                                                                  However, you knew you didn't care about all of those files, only the ones that were actually test suites. You had too much data, so you needed to filter it. How did you know which data to keep? You needed a test to decide, so you defined one and passed it to the filter function. In this case you used a regular expression to decide, but the concept would be the same regardless of how you -constructed the test. -

                                                                                                                                  Now you had the filenames of each of the test suites (and only the test suites, since everything else had been filtered out), -but you really wanted module names instead. You had the right amount of data, but it was in the wrong format. So you defined a function that would transform a single filename into a module name, and you mapped that function onto -the entire list. From one filename, you can get a module name; from a list of filenames, you can get a list of module names. -

                                                                                                                                  Instead of filter, you could have used a for loop with an if statement. Instead of map, you could have used a for loop with a function call. But using for loops like that is busywork. At best, it simply wastes time; at worst, it introduces obscure bugs. For instance, you need -to figure out how to test for the condition “is this file a test suite?” anyway; that's the application-specific logic, and no language can write that for us. But once you've figured that out, -do you really want go to all the trouble of defining a new empty list and writing a for loop and an if statement and manually calling append to add each element to the new list if it passes the condition and then keeping track of which variable holds the new filtered -data and which one holds the old unfiltered data? Why not just define the test condition, then let Python do the rest of that work for us? -

                                                                                                                                  Oh sure, you could try to be fancy and delete elements in place without creating a new list. But you've been burned by that -before. Trying to modify a data structure that you're looping through can be tricky. You delete an element, then loop to -the next element, and suddenly you've skipped one. Is Python one of the languages that works that way? How long would it take you to figure it out? Would you remember for certain whether -it was safe the next time you tried? Programmers spend so much time and make so many mistakes dealing with purely technical -issues like this, and it's all pointless. It doesn't advance your program at all; it's just busywork. -

                                                                                                                                  I resisted list comprehensions when I first learned Python, and I resisted filter and map even longer. I insisted on making my life more difficult, sticking to the familiar way of for loops and if statements and step-by-step code-centric programming. And my Python programs looked a lot like Visual Basic programs, detailing every step of every operation in every function. And they had all the same types of little problems -and obscure bugs. And it was all pointless. -

                                                                                                                                  Let it all go. Busywork code is not important. Data is important. And data is not difficult. It's only data. If you have -too much, filter it. If it's not what you want, map it. Focus on the data; leave the busywork behind. + + + + + +[more functional programming stuff was here] + + + + +

                                                                                                                                  16.6. Dynamically importing modules

                                                                                                                                  OK, enough philosophizing. Let's talk about dynamically importing modules.

                                                                                                                                  First, let's look at how you normally import modules. The import module syntax looks in the search path for the named module and imports it by name. You can even import multiple modules at once @@ -5924,6 +3449,16 @@ if __name__ == "__main__":

                                                                                                                                  [7] Technically, the second argument to filter can be any sequence, including lists, tuples, and custom classes that act like lists by defining the __getitem__ special method. If possible, filter will return the same datatype as you give it, so filtering a list returns a list, but filtering a tuple returns a tuple.

                                                                                                                                  [8] Again, I should point out that map can take a list, a tuple, or any object that acts like a sequence. See previous footnote about filter. + + + + + + + + + +

                                                                                                                                  Chapter 18. Performance Tuning

                                                                                                                                  Performance tuning is a many-splendored thing. Just because Python is an interpreted language doesn't mean you shouldn't worry about code optimization. But don't worry about it too much. diff --git a/examples/beauregard-100x100.jpg b/examples/beauregard-100x100.jpg new file mode 100644 index 0000000..5f004a5 Binary files /dev/null and b/examples/beauregard-100x100.jpg differ diff --git a/files.html b/files.html index 6696971..ed89858 100644 --- a/files.html +++ b/files.html @@ -26,6 +26,399 @@ body{counter-reset:h1 12} OK, so a string is a sequence of Unicode characters. But a file on disk is not a sequence of Unicode characters; a file on disk is a sequence of bytes. So if you read a “text file” from disk, how does Python convert that sequence of bytes into a sequence of characters? The answer is that it decodes the bytes according to a specific character encoding algorithm, and returns a sequence of Unicode characters, otherwise known as a string. --> +

                                                                                                                                  File Objects

                                                                                                                                  + +

                                                                                                                                  Python has a built-in function, open(), for opening a file on disk. The open() function returns a file object, which has methods and attributes for getting information about and manipulating the file. + +

                                                                                                                                  +>>> image = open('examples/beauregard-100x100.jpg', 'rb')
                                                                                                                                  +>>> image
                                                                                                                                  +<io.BufferedReader object at 0x00C7A390>
                                                                                                                                  +>>> image.mode
                                                                                                                                  +'rb'
                                                                                                                                  +>>> image.name
                                                                                                                                  +'examples/beauregard-100x100.jpg'
                                                                                                                                  +>>>
                                                                                                                                  +
                                                                                                                                  >>> f = open("/music/_singles/kairo.mp3", "rb") 
                                                                                                                                  +>>> f       
                                                                                                                                  +<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
                                                                                                                                  +>>> f.mode  
                                                                                                                                  +'rb'
                                                                                                                                  +>>> f.name  
                                                                                                                                  +'/music/_singles/kairo.mp3'
                                                                                                                                  +
                                                                                                                                    +
                                                                                                                                  1. The open method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename, is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode. (print open.__doc__ displays a great explanation of all the possible modes.) +
                                                                                                                                  2. The open function returns an object (by now, this should not surprise you). A file object has several useful attributes. +
                                                                                                                                  3. The mode attribute of a file object tells you in which mode the file was opened. +
                                                                                                                                  4. The name attribute of a file object tells you the name of the file that the file object has open. +

                                                                                                                                    6.2.1. Reading Files

                                                                                                                                    +

                                                                                                                                    After you open a file, the first thing you'll want to do is read from it, as shown in the next example. +

                                                                                                                                    Example 6.4. Reading a File

                                                                                                                                    +
                                                                                                                                    +
                                                                                                                                    +>>> image
                                                                                                                                    +<io.BufferedReader object at 0x00C7A390>
                                                                                                                                    +>>> image.tell()
                                                                                                                                    +0
                                                                                                                                    +>>> data = image.read(3)
                                                                                                                                    +>>> data
                                                                                                                                    +b'\xff\xd8\xff'
                                                                                                                                    +>>> image.tell()
                                                                                                                                    +3
                                                                                                                                    +>>> image.seek(0)
                                                                                                                                    +0
                                                                                                                                    +>>> data = image.read()
                                                                                                                                    +>>> len(data)
                                                                                                                                    +3150
                                                                                                                                    +
                                                                                                                                    +>>> f
                                                                                                                                    +<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
                                                                                                                                    +>>> f.tell()              
                                                                                                                                    +0
                                                                                                                                    +>>> f.seek(-128, 2)       
                                                                                                                                    +>>> f.tell()              
                                                                                                                                    +7542909
                                                                                                                                    +>>> tagData = f.read(128) 
                                                                                                                                    +>>> tagData
                                                                                                                                    +'TAGKAIRO****THE BEST GOA         ***DJ MARY-JANE***            
                                                                                                                                    +Rave Mix    2000http://mp3.com/DJMARYJANE     \037'
                                                                                                                                    +>>> f.tell()              
                                                                                                                                    +7543037
                                                                                                                                    +
                                                                                                                                      +
                                                                                                                                    1. A file object maintains state about the file it has open. The tell method of a file object tells you your current position in the open file. Since you haven't done anything with this file yet, the current position is 0, which is the beginning of the file. +
                                                                                                                                    2. The seek method of a file object moves to another position in the open file. The second parameter specifies what the first one means; +0 means move to an absolute position (counting from the start of the file), 1 means move to a relative position (counting from the current position), and 2 means move to a position relative to the end of the file. Since the MP3 tags you're looking for are stored at the end of the file, you use 2 and tell the file object to move to a position 128 bytes from the end of the file. +
                                                                                                                                    3. The tell method confirms that the current file position has moved. +
                                                                                                                                    4. The read method reads a specified number of bytes from the open file and returns a string with the data that was read. The optional parameter specifies the maximum number of bytes to read. If no parameter is specified, read will read until the end of the file. (You could have simply said read() here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data is assigned to the tagData variable, and the current position is updated based on how many bytes were read. +
                                                                                                                                    5. The tell method confirms that the current position has moved. If you do the math, you'll see that after reading 128 bytes, the position has been incremented by 128. +

                                                                                                                                      6.2.2. Closing Files

                                                                                                                                      +

                                                                                                                                      Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's + important to close files as soon as you're finished with them. +

                                                                                                                                      Example 6.5. Closing a File

                                                                                                                                      +>>> f
                                                                                                                                      +<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
                                                                                                                                      +>>> f.closed       
                                                                                                                                      +False
                                                                                                                                      +>>> f.close()      
                                                                                                                                      +>>> f
                                                                                                                                      +<closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
                                                                                                                                      +>>> f.closed       
                                                                                                                                      +True
                                                                                                                                      +>>> f.seek(0)      
                                                                                                                                      +Traceback (innermost last):
                                                                                                                                      +  File "<interactive input>", line 1, in ?
                                                                                                                                      +ValueError: I/O operation on closed file
                                                                                                                                      +>>> f.tell()
                                                                                                                                      +Traceback (innermost last):
                                                                                                                                      +  File "<interactive input>", line 1, in ?
                                                                                                                                      +ValueError: I/O operation on closed file
                                                                                                                                      +>>> f.read()
                                                                                                                                      +Traceback (innermost last):
                                                                                                                                      +  File "<interactive input>", line 1, in ?
                                                                                                                                      +ValueError: I/O operation on closed file
                                                                                                                                      +>>> f.close()      
                                                                                                                                      +
                                                                                                                                        +
                                                                                                                                      1. The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is False). +
                                                                                                                                      2. To close a file, call the close method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any) that the system hadn't gotten around to actually writing yet, and releases the system resources. +
                                                                                                                                      3. The closed attribute confirms that the file is closed. +
                                                                                                                                      4. Just because a file is closed doesn't mean that the file object ceases to exist. The variable f will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed; they all raise an exception. +
                                                                                                                                      5. Calling close on a file object whose file is already closed does not raise an exception; it fails silently. +

                                                                                                                                        6.2.3. Handling I/O Errors

                                                                                                                                        +

                                                                                                                                        Now you've seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle + errors. +

                                                                                                                                        Example 6.6. File Objects in MP3FileInfo

                                                                                                                                        
                                                                                                                                        +        try:               fsock = open(filename, "rb", 0)  try:              fsock.seek(-128, 2)              tagdata = fsock.read(128)    finally:           fsock.close()               . . .
                                                                                                                                        +        except IOError:    pass         
                                                                                                                                        +
                                                                                                                                          +
                                                                                                                                        1. Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a try...except block. (Hey, isn't standardized indentation great? This is where you start to appreciate it.) +
                                                                                                                                        2. The open function may raise an IOError. (Maybe the file doesn't exist.) +
                                                                                                                                        3. The seek method may raise an IOError. (Maybe the file is smaller than 128 bytes.) +
                                                                                                                                        4. The read method may raise an IOError. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.) +
                                                                                                                                        5. This is new: a try...finally block. Once the file has been opened successfully by the open function, you want to make absolutely sure that you close it, even if an exception is raised by the seek or read methods. That's what a try...finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before. +
                                                                                                                                        6. At last, you handle your IOError exception. This could be the IOError exception raised by the call to open, seek, or read. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember, pass is a Python statement that does nothing.) That's perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the next line of code after the try...except block. +

                                                                                                                                          6.2.4. Writing to Files

                                                                                                                                          +

                                                                                                                                          As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes: +

                                                                                                                                          +
                                                                                                                                            +
                                                                                                                                          • "Append" mode will add data to the end of the file. +
                                                                                                                                          • "write" mode will overwrite the file. +
                                                                                                                                          +

                                                                                                                                          Either mode will create the file automatically if it doesn't already exist, so there's never a need for any sort of fiddly + "if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open + it and start writing. +

                                                                                                                                          Example 6.7. Writing to Files

                                                                                                                                          +>>> logfile = open('test.log', 'w') 
                                                                                                                                          +>>> logfile.write('test succeeded') 
                                                                                                                                          +>>> logfile.close()
                                                                                                                                          +>>> print file('test.log').read()   
                                                                                                                                          +test succeeded
                                                                                                                                          +>>> logfile = open('test.log', 'a') 
                                                                                                                                          +>>> logfile.write('line 2')
                                                                                                                                          +>>> logfile.close()
                                                                                                                                          +>>> print file('test.log').read()   
                                                                                                                                          +test succeededline 2
                                                                                                                                          +
                                                                                                                                          +
                                                                                                                                            +
                                                                                                                                          1. You start boldly by creating either the new file test.log or overwrites the existing file, and opening the file for writing. (The second parameter "w" means open the file for writing.) Yes, that's all as dangerous as it sounds. I hope you didn't care about the previous contents of that file, because it's gone now. +
                                                                                                                                          2. You can add data to the newly opened file with the write method of the file object returned by open. +
                                                                                                                                          3. file is a synonym for open. This one-liner opens the file, reads its contents, and prints them. +
                                                                                                                                          4. You happen to know that test.log exists (since you just finished writing to it), so you can open it and append to it. (The "a" parameter means open the file for appending.) Actually you could do this even if the file didn't exist, because opening the file for appending will create the file if necessary. But appending will never harm the existing contents of the file. +
                                                                                                                                          5. As you can see, both the original line you wrote and the second line you appended are now in test.log. Also note that carriage returns are not included. Since you didn't write them explicitly to the file either time, the file doesn't include them. You can write a carriage return with the "\n" character. Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line. +
                                                                                                                                            +

                                                                                                                                            Further Reading on File Handling

                                                                                                                                            + + + + + + +

                                                                                                                                            10.1. Abstracting input sources

                                                                                                                                            +

                                                                                                                                            One of Python's greatest strengths is its dynamic binding, and one powerful use of dynamic binding is the file-like object. +

                                                                                                                                            Many functions which require an input source could simply take a filename, go open the file for reading, read it, and close +it when they're done. But they don't. Instead, they take a file-like object. +

                                                                                                                                            In the simplest case, a file-like object is any object with a read method with an optional size parameter, which returns a string. When called with no size parameter, it reads everything there is to read from the input source and returns all the data as a single string. When +called with a size parameter, it reads that much from the input source and returns that much data; when called again, it picks up where it left +off and returns the next chunk of data. +

                                                                                                                                            This is how reading from real files works; the difference is that you're not limiting yourself to real files. The input source could be anything: a file on +disk, a web page, even a hard-coded string. As long as you pass a file-like object to the function, and the function simply +calls the object's read method, the function can handle any kind of input source without specific code to handle each kind. +

                                                                                                                                            In case you were wondering how this relates to XML processing, minidom.parse is one such function which can take a file-like object. +

                                                                                                                                            Example 10.1. Parsing XML from a file

                                                                                                                                            +>>> from xml.dom import minidom
                                                                                                                                            +>>> fsock = open('binary.xml')    
                                                                                                                                            +>>> xmldoc = minidom.parse(fsock) 
                                                                                                                                            +>>> fsock.close()                 
                                                                                                                                            +>>> print xmldoc.toxml()          
                                                                                                                                            +<?xml version="1.0" ?>
                                                                                                                                            +<grammar>
                                                                                                                                            +<ref id="bit">
                                                                                                                                            +  <p>0</p>
                                                                                                                                            +  <p>1</p>
                                                                                                                                            +</ref>
                                                                                                                                            +<ref id="byte">
                                                                                                                                            +  <p><xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/>\
                                                                                                                                            +<xref id="bit"/><xref id="bit"/><xref id="bit"/><xref id="bit"/></p>
                                                                                                                                            +</ref>
                                                                                                                                            +</grammar>
                                                                                                                                            +
                                                                                                                                              +
                                                                                                                                            1. First, you open the file on disk. This gives you a file object. +
                                                                                                                                            2. You pass the file object to minidom.parse, which calls the read method of fsock and reads the XML document from the file on disk. +
                                                                                                                                            3. Be sure to call the close method of the file object after you're done with it. minidom.parse will not do this for you. +
                                                                                                                                            4. Calling the toxml() method on the returned XML document prints out the entire thing. +

                                                                                                                                              Well, that all seems like a colossal waste of time. After all, you've already seen that minidom.parse can simply take the filename and do all the opening and closing nonsense automatically. And it's true that if you know you're +just going to be parsing a local file, you can pass the filename and minidom.parse is smart enough to Do The Right Thing™. But notice how similar -- and easy -- it is to parse an XML document straight from the Internet. +

                                                                                                                                              Example 10.2. Parsing XML from a URL

                                                                                                                                              +>>> import urllib
                                                                                                                                              +>>> usock = urllib.urlopen('http://slashdot.org/slashdot.rdf') 
                                                                                                                                              +>>> xmldoc = minidom.parse(usock)            
                                                                                                                                              +>>> usock.close()          
                                                                                                                                              +>>> print xmldoc.toxml()   
                                                                                                                                              +<?xml version="1.0" ?>
                                                                                                                                              +<rdf:RDF xmlns="http://my.netscape.com/rdf/simple/0.9/"
                                                                                                                                              + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
                                                                                                                                              +
                                                                                                                                              +<channel>
                                                                                                                                              +<title>Slashdot</title>
                                                                                                                                              +<link>http://slashdot.org/</link>
                                                                                                                                              +<description>News for nerds, stuff that matters</description>
                                                                                                                                              +</channel>
                                                                                                                                              +
                                                                                                                                              +<image>
                                                                                                                                              +<title>Slashdot</title>
                                                                                                                                              +<url>http://images.slashdot.org/topics/topicslashdot.gif</url>
                                                                                                                                              +<link>http://slashdot.org/</link>
                                                                                                                                              +</image>
                                                                                                                                              +
                                                                                                                                              +<item>
                                                                                                                                              +<title>To HDTV or Not to HDTV?</title>
                                                                                                                                              +<link>http://slashdot.org/article.pl?sid=01/12/28/0421241</link>
                                                                                                                                              +</item>
                                                                                                                                              +
                                                                                                                                              +[...snip...]
                                                                                                                                              +
                                                                                                                                                +
                                                                                                                                              1. As you saw in a previous chapter, urlopen takes a web page URL and returns a file-like object. Most importantly, this object has a read method which returns the HTML source of the web page. +
                                                                                                                                              2. Now you pass the file-like object to minidom.parse, which obediently calls the read method of the object and parses the XML data that the read method returns. The fact that this XML data is now coming straight from a web page is completely irrelevant. minidom.parse doesn't know about web pages, and it doesn't care about web pages; it just knows about file-like objects. +
                                                                                                                                              3. As soon as you're done with it, be sure to close the file-like object that urlopen gives you. +
                                                                                                                                              4. By the way, this URL is real, and it really is XML. It's an XML representation of the current headlines on Slashdot, a technical news and gossip site. +

                                                                                                                                                Example 10.3. Parsing XML from a string (the easy but inflexible way)

                                                                                                                                                +>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                                                                                                                                                +>>> xmldoc = minidom.parseString(contents) 
                                                                                                                                                +>>> print xmldoc.toxml()
                                                                                                                                                +<?xml version="1.0" ?>
                                                                                                                                                +<grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
                                                                                                                                                +
                                                                                                                                                  +
                                                                                                                                                1. minidom has a method, parseString, which takes an entire XML document as a string and parses it. You can use this instead of minidom.parse if you know you already have your entire XML document in a string. +

                                                                                                                                                  OK, so you can use the minidom.parse function for parsing both local files and remote URLs, but for parsing strings, you use... a different function. That means that if you want to be able to take input from a +file, a URL, or a string, you'll need special logic to check whether it's a string, and call the parseString function instead. How unsatisfying. +

                                                                                                                                                  If there were a way to turn a string into a file-like object, then you could simply pass this object to minidom.parse. And in fact, there is a module specifically designed for doing just that: StringIO. +

                                                                                                                                                  Example 10.4. Introducing StringIO

                                                                                                                                                  +>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                                                                                                                                                  +>>> import StringIO
                                                                                                                                                  +>>> ssock = StringIO.StringIO(contents)   
                                                                                                                                                  +>>> ssock.read()        
                                                                                                                                                  +"<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                                                                                                                                                  +>>> ssock.read()        
                                                                                                                                                  +''
                                                                                                                                                  +>>> ssock.seek(0)       
                                                                                                                                                  +>>> ssock.read(15)      
                                                                                                                                                  +'<grammar><ref i'
                                                                                                                                                  +>>> ssock.read(15)
                                                                                                                                                  +"d='bit'><p>0</p"
                                                                                                                                                  +>>> ssock.read()
                                                                                                                                                  +'><p>1</p></ref></grammar>'
                                                                                                                                                  +>>> ssock.close()       
                                                                                                                                                  +
                                                                                                                                                    +
                                                                                                                                                  1. The StringIO module contains a single class, also called StringIO, which allows you to turn a string into a file-like object. The StringIO class takes the string as a parameter when creating an instance. +
                                                                                                                                                  2. Now you have a file-like object, and you can do all sorts of file-like things with it. Like read, which returns the original string. +
                                                                                                                                                  3. Calling read again returns an empty string. This is how real file objects work too; once you read the entire file, you can't read any more without explicitly seeking to the beginning of the file. The StringIO object works the same way. +
                                                                                                                                                  4. You can explicitly seek to the beginning of the string, just like seeking through a file, by using the seek method of the StringIO object. +
                                                                                                                                                  5. You can also read the string in chunks, by passing a size parameter to the read method. +
                                                                                                                                                  6. At any time, read will return the rest of the string that you haven't read yet. All of this is exactly how file objects work; hence the term +file-like object. +

                                                                                                                                                    Example 10.5. Parsing XML from a string (the file-like object way)

                                                                                                                                                    +>>> contents = "<grammar><ref id='bit'><p>0</p><p>1</p></ref></grammar>"
                                                                                                                                                    +>>> ssock = StringIO.StringIO(contents)
                                                                                                                                                    +>>> xmldoc = minidom.parse(ssock) 
                                                                                                                                                    +>>> ssock.close()
                                                                                                                                                    +>>> print xmldoc.toxml()
                                                                                                                                                    +<?xml version="1.0" ?>
                                                                                                                                                    +<grammar><ref id="bit"><p>0</p><p>1</p></ref></grammar>
                                                                                                                                                    +
                                                                                                                                                      +
                                                                                                                                                    1. Now you can pass the file-like object (really a StringIO) to minidom.parse, which will call the object's read method and happily parse away, never knowing that its input came from a hard-coded string. +

                                                                                                                                                      So now you know how to use a single function, minidom.parse, to parse an XML document stored on a web page, in a local file, or in a hard-coded string. For a web page, you use urlopen to get a file-like object; for a local file, you use open; and for a string, you use StringIO. Now let's take it one step further and generalize these differences as well. +

                                                                                                                                                      Example 10.6. openAnything

                                                                                                                                                      
                                                                                                                                                      +def openAnything(source):
                                                                                                                                                      +    # try to open with urllib (if source is http, ftp, or file URL)
                                                                                                                                                      +    import urllib       
                                                                                                                                                      +    try:                
                                                                                                                                                      +        return urllib.urlopen(source)      
                                                                                                                                                      +    except (IOError, OSError):            
                                                                                                                                                      +        pass            
                                                                                                                                                      +
                                                                                                                                                      +    # try to open with native open function (if source is pathname)
                                                                                                                                                      +    try:                
                                                                                                                                                      +        return open(source)                
                                                                                                                                                      +    except (IOError, OSError):            
                                                                                                                                                      +        pass            
                                                                                                                                                      +
                                                                                                                                                      +    # treat source as string
                                                                                                                                                      +    import StringIO     
                                                                                                                                                      +    return StringIO.StringIO(str(source))  
                                                                                                                                                      +
                                                                                                                                                        +
                                                                                                                                                      1. The openAnything function takes a single parameter, source, and returns a file-like object. source is a string of some sort; it can either be a URL (like 'http://slashdot.org/slashdot.rdf'), a full or partial pathname to a local file (like 'binary.xml'), or a string that contains actual XML data to be parsed. +
                                                                                                                                                      2. First, you see if source is a URL. You do this through brute force: you try to open it as a URL and silently ignore errors caused by trying to open something which is not a URL. This is actually elegant in the sense that, if urllib ever supports new types of URLs in the future, you will also support them without recoding. If urllib is able to open source, then the return kicks you out of the function immediately and the following try statements never execute. +
                                                                                                                                                      3. On the other hand, if urllib yelled at you and told you that source wasn't a valid URL, you assume it's a path to a file on disk and try to open it. Again, you don't do anything fancy to check whether source is a valid filename or not (the rules for valid filenames vary wildly between different platforms anyway, so you'd probably get them wrong anyway). Instead, you just blindly open the file, and silently trap any errors. +
                                                                                                                                                      4. By this point, you need to assume that source is a string that has hard-coded data in it (since nothing else worked), so you use StringIO to create a file-like object out of it and return that. (In fact, since you're using the str function, source doesn't even need to be a string; it could be any object, and you'll use its string representation, as defined by its __str__ special method.) +

                                                                                                                                                        Now you can use this openAnything function in conjunction with minidom.parse to make a function that takes a source that refers to an XML document somehow (either as a URL, or a local filename, or a hard-coded XML document in a string) and parses it. +

                                                                                                                                                        Example 10.7. Using openAnything in kgp.py

                                                                                                                                                        
                                                                                                                                                        +class KantGenerator:
                                                                                                                                                        +    def _load(self, source):
                                                                                                                                                        +        sock = toolbox.openAnything(source)
                                                                                                                                                        +        xmldoc = minidom.parse(sock).documentElement
                                                                                                                                                        +        sock.close()
                                                                                                                                                        +        return xmldoc

                                                                                                                                                        10.2. Standard input, output, and error

                                                                                                                                                        +

                                                                                                                                                        UNIX users are already familiar with the concept of standard input, standard output, and standard error. This section is for + the rest of you. +

                                                                                                                                                        Standard output and standard error (commonly abbreviated stdout and stderr) are pipes that are built into every UNIX system. When you print something, it goes to the stdout pipe; when your program crashes and prints out debugging information (like a traceback in Python), it goes to the stderr pipe. Both of these pipes are ordinarily just connected to the terminal window where you are working, so when a program +prints, you see the output, and when a program crashes, you see the debugging information. (If you're working on a system +with a window-based Python IDE, stdout and stderr default to your “Interactive Window”.) +

                                                                                                                                                        Example 10.8. Introducing stdout and stderr

                                                                                                                                                        +>>> for i in range(3):
                                                                                                                                                        +...    print 'Dive in'             
                                                                                                                                                        +Dive in
                                                                                                                                                        +Dive in
                                                                                                                                                        +Dive in
                                                                                                                                                        +>>> import sys
                                                                                                                                                        +>>> for i in range(3):
                                                                                                                                                        +...    sys.stdout.write('Dive in') 
                                                                                                                                                        +Dive inDive inDive in
                                                                                                                                                        +>>> for i in range(3):
                                                                                                                                                        +...    sys.stderr.write('Dive in') 
                                                                                                                                                        +Dive inDive inDive in
                                                                                                                                                        +
                                                                                                                                                          +
                                                                                                                                                        1. As you saw in Example 6.9, “Simple Counters”, you can use Python's built-in range function to build simple counter loops that repeat something a set number of times. +
                                                                                                                                                        2. stdout is a file-like object; calling its write function will print out whatever string you give it. In fact, this is what the print function really does; it adds a carriage return to the end of the string you're printing, and calls sys.stdout.write. +
                                                                                                                                                        3. In the simplest case, stdout and stderr send their output to the same place: the Python IDE (if you're in one), or the terminal (if you're running Python from the command line). Like stdout, stderr does not add carriage returns for you; if you want them, add them yourself. +

                                                                                                                                                          stdout and stderr are both file-like objects, like the ones you discussed in Section 10.1, “Abstracting input sources”, but they are both write-only. They have no read method, only write. Still, they are file-like objects, and you can assign any other file- or file-like object to them to redirect their output. +

                                                                                                                                                          Example 10.9. Redirecting output

                                                                                                                                                          +[you@localhost kgp]$ python stdout.py
                                                                                                                                                          +Dive in
                                                                                                                                                          +[you@localhost kgp]$ cat out.log
                                                                                                                                                          +This message will be logged instead of displayed

                                                                                                                                                          (On Windows, you can use type instead of cat to display the contents of a file.) +

                                                                                                                                                          If you have not already done so, you can download this and other examples used in this book. +

                                                                                                                                                          
                                                                                                                                                          +#stdout.py
                                                                                                                                                          +import sys
                                                                                                                                                          +
                                                                                                                                                          +print 'Dive in'      
                                                                                                                                                          +saveout = sys.stdout 
                                                                                                                                                          +fsock = open('out.log', 'w')           
                                                                                                                                                          +sys.stdout = fsock   
                                                                                                                                                          +print 'This message will be logged instead of displayed' 
                                                                                                                                                          +sys.stdout = saveout 
                                                                                                                                                          +fsock.close()        
                                                                                                                                                          +
                                                                                                                                                          +
                                                                                                                                                            +
                                                                                                                                                          1. This will print to the IDE “Interactive Window” (or the terminal, if running the script from the command line). +
                                                                                                                                                          2. Always save stdout before redirecting it, so you can set it back to normal later. +
                                                                                                                                                          3. Open a file for writing. If the file doesn't exist, it will be created. If the file does exist, it will be overwritten. +
                                                                                                                                                          4. Redirect all further output to the new file you just opened. +
                                                                                                                                                          5. This will be “printed” to the log file only; it will not be visible in the IDE window or on the screen. +
                                                                                                                                                          6. Set stdout back to the way it was before you mucked with it. +
                                                                                                                                                          7. Close the log file. +

                                                                                                                                                            Redirecting stderr works exactly the same way, using sys.stderr instead of sys.stdout. +

                                                                                                                                                            Example 10.10. Redirecting error information

                                                                                                                                                            +[you@localhost kgp]$ python stderr.py
                                                                                                                                                            +[you@localhost kgp]$ cat error.log
                                                                                                                                                            +Traceback (most recent line last):
                                                                                                                                                            +  File "stderr.py", line 5, in ?
                                                                                                                                                            +    raise Exception, 'this error will be logged'
                                                                                                                                                            +Exception: this error will be logged

                                                                                                                                                            If you have not already done so, you can download this and other examples used in this book. +

                                                                                                                                                            
                                                                                                                                                            +#stderr.py
                                                                                                                                                            +import sys
                                                                                                                                                            +
                                                                                                                                                            +fsock = open('error.log', 'w')               
                                                                                                                                                            +sys.stderr = fsock         
                                                                                                                                                            +raise Exception, 'this error will be logged'  
                                                                                                                                                            +
                                                                                                                                                            +
                                                                                                                                                              +
                                                                                                                                                            1. Open the log file where you want to store debugging information. +
                                                                                                                                                            2. Redirect standard error by assigning the file object of the newly-opened log file to stderr. +
                                                                                                                                                            3. Raise an exception. Note from the screen output that this does not print anything on screen. All the normal traceback information has been written to error.log. +
                                                                                                                                                            4. Also note that you're not explicitly closing your log file, nor are you setting stderr back to its original value. This is fine, since once the program crashes (because of the exception), Python will clean up and close the file for us, and it doesn't make any difference that stderr is never restored, since, as I mentioned, the program crashes and Python ends. Restoring the original is more important for stdout, if you expect to go do other stuff within the same script afterwards. +

                                                                                                                                                              Since it is so common to write error messages to standard error, there is a shorthand syntax that can be used instead of going +through the hassle of redirecting it outright. +

                                                                                                                                                              Example 10.11. Printing to stderr

                                                                                                                                                              +>>> print 'entering function'
                                                                                                                                                              +entering function
                                                                                                                                                              +>>> import sys
                                                                                                                                                              +>>> print >> sys.stderr, 'entering function' 
                                                                                                                                                              +entering function
                                                                                                                                                              +
                                                                                                                                                              +
                                                                                                                                                                +
                                                                                                                                                              1. This shorthand syntax of the print statement can be used to write to any open file, or file-like object. In this case, you can redirect a single print statement to stderr without affecting subsequent print statements. +

                                                                                                                                                                Standard input, on the other hand, is a read-only file object, and it represents the data flowing into the program from some +previous program. This will likely not make much sense to classic Mac OS users, or even Windows users unless you were ever fluent on the MS-DOS command line. The way it works is that you can construct a chain of commands in a single line, so that one program's output +becomes the input for the next program in the chain. The first program simply outputs to standard output (without doing any +special redirecting itself, just doing normal print statements or whatever), and the next program reads from standard input, and the operating system takes care of connecting +one program's output to the next program's input. + + + + + + +

                                                                                                                                                                © 2001–9 Mark Pilgrim diff --git a/http-web-services.html b/http-web-services.html index ad17fd5..289693d 100644 --- a/http-web-services.html +++ b/http-web-services.html @@ -836,15 +836,31 @@ user-agent: Python-httplib2/$Rev: 259 $

                                                                                                                                                                Further Reading

                                                                                                                                                                +

                                                                                                                                                                httplib2: +

                                                                                                                                                                + +

                                                                                                                                                                HTTP caching: + +

                                                                                                                                                                +

                                                                                                                                                                RFCs: + +

                                                                                                                                                                +

                                                                                                                                                                © 2001–9 Mark Pilgrim