Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including
- carriage returns and other quote characters. You can use them anywhere, but you'll see them most often used when defining
- a docstring.
-
-
- | Triple quotes are also an easy way to define a string with both single and double quotes, like qq/.../ in Perl.
-Everything between the triple quotes is the function's docstring, which documents what the function does. A docstring, if it exists, must be the first thing defined in a function (that is, the first thing after the colon). You don't technically
-need to give your function a docstring, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the docstring is available at runtime as an attribute of the function.
-
-
- | Many Python IDEs use the docstring to provide context-sensitive documentation, so that when you type a function name, its docstring appears as a tooltip. This can be incredibly helpful, but it's only as good as the docstrings you write.
-
@@ -1930,238 +1705,20 @@ exceptions, errors occur immediately, and you can handle them in a standard way
Python Reference Manual discusses the inner workings of the try...except block.
-6.2. Working with File Objects
-Python has a built-in function, open, for opening a file on disk. open returns a file object, which has methods and attributes for getting information about and manipulating the opened file.
- Example 6.3. Opening a File>>> f = open("/music/_singles/kairo.mp3", "rb") ①
->>> f ②
-<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
->>> f.mode ③
-'rb'
->>> f.name ④
-'/music/_singles/kairo.mp3'
-
-- The
open method can take up to three parameters: a filename, a mode, and a buffering parameter. Only the first one, the filename,
- is required; the other two are optional. If not specified, the file is opened for reading in text mode. Here you are opening the file for reading in binary mode.
- (print open.__doc__ displays a great explanation of all the possible modes.)
- - The
open function returns an object (by now, this should not surprise you). A file object has several useful attributes.
- - The mode attribute of a file object tells you in which mode the file was opened.
-
- The name attribute of a file object tells you the name of the file that the file object has open.
-
6.2.1. Reading Files
-After you open a file, the first thing you'll want to do is read from it, as shown in the next example.
- Example 6.4. Reading a File
->>> f
-<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
->>> f.tell() ①
-0
->>> f.seek(-128, 2) ②
->>> f.tell() ③
-7542909
->>> tagData = f.read(128) ④
->>> tagData
-'TAGKAIRO****THE BEST GOA ***DJ MARY-JANE***
-Rave Mix 2000http://mp3.com/DJMARYJANE \037'
->>> f.tell() ⑤
-7543037
-
-- A file object maintains state about the file it has open. The
tell method of a file object tells you your current position in the open file. Since you haven't done anything with this file
- yet, the current position is 0, which is the beginning of the file.
- - The
seek method of a file object moves to another position in the open file. The second parameter specifies what the first one means;
-0 means move to an absolute position (counting from the start of the file), 1 means move to a relative position (counting from the current position), and 2 means move to a position relative to the end of the file. Since the MP3 tags you're looking for are stored at the end of the file, you use 2 and tell the file object to move to a position 128 bytes from the end of the file.
- - The
tell method confirms that the current file position has moved.
- - The
read method reads a specified number of bytes from the open file and returns a string with the data that was read. The optional
- parameter specifies the maximum number of bytes to read. If no parameter is specified, read will read until the end of the file. (You could have simply said read() here, since you know exactly where you are in the file and you are, in fact, reading the last 128 bytes.) The read data
- is assigned to the tagData variable, and the current position is updated based on how many bytes were read.
- - The
tell method confirms that the current position has moved. If you do the math, you'll see that after reading 128 bytes, the position
- has been incremented by 128.
-6.2.2. Closing Files
-Open files consume system resources, and depending on the file mode, other programs may not be able to access them. It's
- important to close files as soon as you're finished with them.
- Example 6.5. Closing a File
->>> f
-<open file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
->>> f.closed ①
-False
->>> f.close() ②
->>> f
-<closed file '/music/_singles/kairo.mp3', mode 'rb' at 010E3988>
->>> f.closed ③
-True
->>> f.seek(0) ④
-Traceback (innermost last):
- File "<interactive input>", line 1, in ?
-ValueError: I/O operation on closed file
->>> f.tell()
-Traceback (innermost last):
- File "<interactive input>", line 1, in ?
-ValueError: I/O operation on closed file
->>> f.read()
-Traceback (innermost last):
- File "<interactive input>", line 1, in ?
-ValueError: I/O operation on closed file
->>> f.close() ⑤
-
-- The closed attribute of a file object indicates whether the object has a file open or not. In this case, the file is still open (closed is
False).
- - To close a file, call the
close method of the file object. This frees the lock (if any) that you were holding on the file, flushes buffered writes (if any)
- that the system hadn't gotten around to actually writing yet, and releases the system resources.
- - The closed attribute confirms that the file is closed.
-
- Just because a file is closed doesn't mean that the file object ceases to exist. The variable f will continue to exist until it goes out of scope or gets manually deleted. However, none of the methods that manipulate an open file will work once the file has been closed;
- they all raise an exception.
-
- Calling
close on a file object whose file is already closed does not raise an exception; it fails silently.
-6.2.3. Handling I/O Errors
-Now you've seen enough to understand the file handling code in the fileinfo.py sample code from teh previous chapter. This example shows how to safely open and read from a file and gracefully handle
- errors.
- Example 6.6. File Objects in MP3FileInfo
- try: ①
- fsock = open(filename, "rb", 0) ②
- try:
- fsock.seek(-128, 2) ③
- tagdata = fsock.read(128) ④
- finally: ⑤
- fsock.close()
- .
- .
- .
- except IOError: ⑥
- pass
-
-- Because opening and reading files is risky and may raise an exception, all of this code is wrapped in a
try...except block. (Hey, isn't standardized indentation great? This is where you start to appreciate it.)
- - The
open function may raise an IOError. (Maybe the file doesn't exist.)
- - The
seek method may raise an IOError. (Maybe the file is smaller than 128 bytes.)
- - The
read method may raise an IOError. (Maybe the disk has a bad sector, or it's on a network drive and the network just went down.)
- - This is new: a
try...finally block. Once the file has been opened successfully by the open function, you want to make absolutely sure that you close it, even if an exception is raised by the seek or read methods. That's what a try...finally block is for: code in the finally block will always be executed, even if something in the try block raises an exception. Think of it as code that gets executed on the way out, regardless of what happened before.
- - At last, you handle your
IOError exception. This could be the IOError exception raised by the call to open, seek, or read. Here, you really don't care, because all you're going to do is ignore it silently and continue. (Remember, pass is a Python statement that does nothing.) That's perfectly legal; “handling” an exception can mean explicitly doing nothing. It still counts as handled, and processing will continue normally on the
- next line of code after the try...except block.
-6.2.4. Writing to Files
-As you would expect, you can also write to files in much the same way that you read from them. There are two basic file modes:
-
-
-- "Append" mode will add data to the end of the file.
-
- "write" mode will overwrite the file.
-
- Either mode will create the file automatically if it doesn't already exist, so there's never a need for any sort of fiddly
- "if the log file doesn't exist yet, create a new empty file just so you can open it for the first time" logic. Just open
- it and start writing.
- Example 6.7. Writing to Files
->>> logfile = open('test.log', 'w') ①
->>> logfile.write('test succeeded') ②
->>> logfile.close()
->>> print file('test.log').read() ③
-test succeeded
->>> logfile = open('test.log', 'a') ④
->>> logfile.write('line 2')
->>> logfile.close()
->>> print file('test.log').read() ⑤
-test succeededline 2
-
-
-- You start boldly by creating either the new file
test.log or overwrites the existing file, and opening the file for writing. (The second parameter "w" means open the file for writing.) Yes, that's all as dangerous as it sounds. I hope you didn't care about the previous
- contents of that file, because it's gone now.
- - You can add data to the newly opened file with the
write method of the file object returned by open.
- file is a synonym for open. This one-liner opens the file, reads its contents, and prints them.
-- You happen to know that
test.log exists (since you just finished writing to it), so you can open it and append to it. (The "a" parameter means open the file for appending.) Actually you could do this even if the file didn't exist, because opening
- the file for appending will create the file if necessary. But appending will never harm the existing contents of the file.
- - As you can see, both the original line you wrote and the second line you appended are now in
test.log. Also note that carriage returns are not included. Since you didn't write them explicitly to the file either time, the
- file doesn't include them. You can write a carriage return with the "\n" character. Since you didn't do this, everything you wrote to the file ended up smooshed together on the same line.
-
- Further Reading on File Handling
-
- 6.3. Iterating with for Loops
- Like most other languages, Python has for loops. The only reason you haven't seen them until now is that Python is good at so many other things that you don't need them as often.
- Most other languages don't have a powerful list datatype like Python, so you end up doing a lot of manual work, specifying a start, end, and step to define a range of integers or characters
-or other iteratable entities. But in Python, a for loop simply iterates over a list, the same way list comprehensions work.
- Example 6.8. Introducing the for Loop>>> li = ['a', 'b', 'e']
->>> for s in li: ①
-... print s ②
-a
-b
-e
->>> print "\n".join(li) ③
-a
-b
-e
-
-- The syntax for a
for loop is similar to list comprehensions. li is a list, and s will take the value of each element in turn, starting from the first element.
- - Like an
if statement or any other indented block, a for loop can have any number of lines of code in it.
- - This is the reason you haven't seen the
for loop yet: you haven't needed it yet. It's amazing how often you use for loops in other languages when all you really want is a join or a list comprehension.
-Doing a “normal” (by Visual Basic standards) counter for loop is also simple.
- Example 6.9. Simple Counters
->>> for i in range(5): ①
-... print i
-0
-1
-2
-3
-4
->>> li = ['a', 'b', 'c', 'd', 'e']
->>> for i in range(len(li)): ②
-... print li[i]
-a
-b
-c
-d
-e
-
-
-- As you saw in Example 3.20, “Assigning Consecutive Values”,
range produces a list of integers, which you then loop through. I know it looks a bit odd, but it is occasionally (and I stress
-occasionally) useful to have a counter loop.
- - Don't ever do this. This is Visual Basic-style thinking. Break out of it. Just iterate through the list, as shown in the previous example.
-
for loops are not just for simple counters. They can iterate through all kinds of things. Here is an example of using a for loop to iterate through a dictionary.
-
Example 6.10. Iterating Through a Dictionary
->>> import os
->>> for k, v in os.environ.items(): ① ②
-... print "%s=%s" % (k, v)
-USERPROFILE=C:\Documents and Settings\mpilgrim
-OS=Windows_NT
-COMPUTERNAME=MPILGRIM
-USERNAME=mpilgrim
-[...snip...]
->>> print "\n".join(["%s=%s" % (k, v)
-... for k, v in os.environ.items()]) ③
-USERPROFILE=C:\Documents and Settings\mpilgrim
-OS=Windows_NT
-COMPUTERNAME=MPILGRIM
-USERNAME=mpilgrim
-[...snip...]
-
-- os.environ is a dictionary of the environment variables defined on your system. In Windows, these are your user and system variables
- accessible from MS-DOS. In UNIX, they are the variables exported in your shell's startup scripts. In Mac OS, there is no concept of environment variables, so this dictionary is empty.
-
os.environ.items() returns a list of tuples: [(key1, value1), (key2, value2), ...]. The for loop iterates through this list. The first round, it assigns key1 to k and value1 to v, so k = USERPROFILE and v = C:\Documents and Settings\mpilgrim. In the second round, k gets the second key, OS, and v gets the corresponding value, Windows_NT.
-- With multi-variable assignment and list comprehensions, you can replace the entire
for loop with a single statement. Whether you actually do this in real code is a matter of personal coding style. I like it
- because it makes it clear that what I'm doing is mapping a dictionary into a list, then joining the list into a single string.
- Other programmers prefer to write this out as a for loop. The output is the same in either case, although this version is slightly faster, because there is only one print statement instead of many.
-Now we can look at the for loop in MP3FileInfo, from the sample fileinfo.py program introduced in Chapter 5.
- Example 6.11. for Loop in MP3FileInfo
- tagDataMap = {"title" : ( 3, 33, stripnulls),
-"artist" : ( 33, 63, stripnulls),
-"album" : ( 63, 93, stripnulls),
-"year" : ( 93, 97, stripnulls),
-"comment" : ( 97, 126, stripnulls),
-"genre" : (127, 128, ord)} ①
- .
- .
- .
- if tagdata[:3] == "TAG":
- for tag, (start, end, parseFunc) in self.tagDataMap.items(): ②
- self[tag] = parseFunc(tagdata[start:end]) ③
-
-- tagDataMap is a class attribute that defines the tags you're looking for in an MP3 file. Tags are stored in fixed-length fields. Once you read the last 128 bytes of the file, bytes 3 through 32 of those
- are always the song title, 33 through 62 are always the artist name, 63 through 92 are the album name, and so forth. Note
- that tagDataMap is a dictionary of tuples, and each tuple contains two integers and a function reference.
-
- This looks complicated, but it's not. The structure of the
for variables matches the structure of the elements of the list returned by items. Remember that items returns a list of tuples of the form (key, value). The first element of that list is ("title", (3, 33, <function stripnulls>)), so the first time around the loop, tag gets "title", start gets 3, end gets 33, and parseFunc gets the function stripnulls.
- - Now that you've extracted all the parameters for a single MP3 tag, saving the tag data is easy. You slice tagdata from start to end to get the actual data for this tag, call parseFunc to post-process the data, and assign this as the value for the key tag in the pseudo-dictionary self. After iterating through all the elements in tagDataMap, self has the values for all the tags, and you know what that looks like.
-
6.4. Using sys.modules
-Modules, like everything else in Python, are objects. Once imported, you can always get a reference to a module through the global dictionary sys.modules.
+
+
+[for loop stuff was here]
+
+
+
+
+
Example 6.12. Introducing sys.modules>>> import sys ①
>>> print '\n'.join(sys.modules.keys()) ②
win32api
@@ -2353,608 +1910,17 @@ may already be familiar with from working on the command line.
- Python Library Reference documents the
os module and the os.path module.
-6.6. Putting It All Together
-Once again, all the dominoes are in place. You've seen how each line of code works. Now let's step back and see how it all
- fits together.
- Example 6.21. listDirectory
-def listDirectory(directory, fileExtList): ①
- "get list of file info objects for files of particular extensions"
- fileList = [os.path.normcase(f)
- for f in os.listdir(directory)]
- fileList = [os.path.join(directory, f)
- for f in fileList
- if os.path.splitext(f)[1] in fileExtList] ②
- def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]): ③
- "get file info class from filename extension"
- subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:] ④
- return hasattr(module, subclass) and getattr(module, subclass) or FileInfo ⑤
- return [getFileInfoClass(f)(f) for f in fileList] ⑥
-
-listDirectory is the main attraction of this entire module. It takes a directory (like c:\music\_singles\ in my case) and a list of interesting file extensions (like ['.mp3']), and it returns a list of class instances that act like dictionaries that contain metadata about each interesting file in
- that directory. And it does it in just a few straightforward lines of code.
-- As you saw in the previous section, this line of code gets a list of the full pathnames of all the files in directory that have an interesting file extension (as specified by fileExtList).
-
- Old-school Pascal programmers may be familiar with them, but most people give me a blank stare when I tell them that Python supports nested functions -- literally, a function within a function. The nested function
getFileInfoClass can be called only from the function in which it is defined, listDirectory. As with any other function, you don't need an interface declaration or anything fancy; just define the function and code
- it.
- - Now that you've seen the
os module, this line should make more sense. It gets the extension of the file (os.path.splitext(filename)[1]), forces it to uppercase (.upper()), slices off the dot ([1:]), and constructs a class name out of it with string formatting. So c:\music\ap\mahadeva.mp3 becomes .mp3 becomes .MP3 becomes MP3 becomes MP3FileInfo.
- - Having constructed the name of the handler class that would handle this file, you check to see if that handler class actually
- exists in this module. If it does, you return the class, otherwise you return the base class
FileInfo. This is a very important point: this function returns a class. Not an instance of a class, but the class itself.
- - For each file in the “interesting files” list (fileList), you call
getFileInfoClass with the filename (f). Calling getFileInfoClass(f) returns a class; you don't know exactly which class, but you don't care. You then create an instance of this class (whatever
- it is) and pass the filename (f again), to the __init__ method. As you saw earlier in this chapter, the __init__ method of FileInfo sets self["name"], which triggers __setitem__, which is overridden in the descendant (MP3FileInfo) to parse the file appropriately to pull out the file's metadata. You do all that for each interesting file and return a
- list of the resulting instances.
-Note that listDirectory is completely generic. It doesn't know ahead of time which types of files it will be getting, or which classes are defined
-that could potentially handle those files. It inspects the directory for the files to process, and then introspects its own
-module to see what special handler classes (like MP3FileInfo) are defined. You can extend this program to handle other types of files simply by defining an appropriately-named class:
-HTMLFileInfo for HTML files, DOCFileInfo for Word .doc files, and so forth. listDirectory will handle them all, without modification, by handing off the real work to the appropriate classes and collating the results.
- 6.7. Summary
-The fileinfo.py program introduced in Chapter 5 should now make perfect sense.
-
-"""Framework for getting filetype-specific metadata.
-Instantiate appropriate class with filename. Returned object acts like a
-dictionary, with key-value pairs for each piece of metadata.
- import fileinfo
- info = fileinfo.MP3FileInfo("/music/ap/mahadeva.mp3")
- print "\\n".join(["%s=%s" % (k, v) for k, v in info.items()])
-Or use listDirectory function to get info on all files in a directory.
- for info in fileinfo.listDirectory("/music/ap/", [".mp3"]):
- ...
-Framework can be extended by adding classes for particular file types, e.g.
-HTMLFileInfo, MPGFileInfo, DOCFileInfo. Each class is completely responsible for
-parsing its files appropriately; see MP3FileInfo for example.
-"""
-import os
-import sys
-from UserDict import UserDict
-def stripnulls(data):
- "strip whitespace and nulls"
- return data.replace("\00", "").strip()
-class FileInfo(UserDict):
- "store file metadata"
- def __init__(self, filename=None):
- UserDict.__init__(self)
- self["name"] = filename
+[HTML stuff was here]
-class MP3FileInfo(FileInfo):
- "store ID3v1.0 MP3 tags"
- tagDataMap = {"title" : ( 3, 33, stripnulls),
-"artist" : ( 33, 63, stripnulls),
-"album" : ( 63, 93, stripnulls),
-"year" : ( 93, 97, stripnulls),
-"comment" : ( 97, 126, stripnulls),
-"genre" : (127, 128, ord)}
- def __parse(self, filename):
- "parse ID3v1.0 tags from MP3 file"
- self.clear()
- try:
- fsock = open(filename, "rb", 0)
- try:
- fsock.seek(-128, 2)
- tagdata = fsock.read(128)
- finally:
- fsock.close()
- if tagdata[:3] == "TAG":
- for tag, (start, end, parseFunc) in self.tagDataMap.items():
- self[tag] = parseFunc(tagdata[start:end])
- except IOError:
- pass
- def __setitem__(self, key, item):
- if key == "name" and item:
- self.__parse(item)
- FileInfo.__setitem__(self, key, item)
-def listDirectory(directory, fileExtList):
- "get list of file info objects for files of particular extensions"
- fileList = [os.path.normcase(f)
- for f in os.listdir(directory)]
- fileList = [os.path.join(directory, f)
- for f in fileList
- if os.path.splitext(f)[1] in fileExtList]
- def getFileInfoClass(filename, module=sys.modules[FileInfo.__module__]):
- "get file info class from filename extension"
- subclass = "%sFileInfo" % os.path.splitext(filename)[1].upper()[1:]
- return hasattr(module, subclass) and getattr(module, subclass) or FileInfo
- return [getFileInfoClass(f)(f) for f in fileList]
-if __name__ == "__main__":
- for info in listDirectory("/music/_singles/", [".mp3"]):
- print "\n".join(["%s=%s" % (k, v) for k, v in info.items()])
- print
- Before diving into the next chapter, make sure you're comfortable doing the following things:
-
-
-
- Chapter 8. HTML Processing
- 8.1. Diving in
- I often see questions on comp.lang.python like “How can I list all the [headers|images|links] in my HTML document?” “How do I parse/translate/munge the text of my HTML document but leave the tags alone?” “How can I add/remove/quote attributes of all my HTML tags at once?” This chapter will answer all of these questions.
- Here is a complete, working Python program in two parts. The first part, BaseHTMLProcessor.py, is a generic tool to help you process HTML files by walking through the tags and text blocks. The second part, dialect.py, is an example of how to use BaseHTMLProcessor.py to translate the text of an HTML document but leave the tags alone. Read the docstrings and comments to get an overview of what's going on. Most of it will seem like black magic, because it's not obvious how
-any of these class methods ever get called. Don't worry, all will be revealed in due time.
- Example 8.1. BaseHTMLProcessor.py
- If you have not already done so, you can download this and other examples used in this book.
-
-from sgmllib import SGMLParser
-import htmlentitydefs
-
-class BaseHTMLProcessor(SGMLParser):
- def reset(self):
- # extend (called by SGMLParser.__init__)
- self.pieces = []
- SGMLParser.reset(self)
-
- def unknown_starttag(self, tag, attrs):
- # called for each start tag
- # attrs is a list of (attr, value) tuples
- # e.g. for <pre class=screen>, tag="pre", attrs=[("class", "screen")]
- # Ideally we would like to reconstruct original tag and attributes, but
- # we may end up quoting attribute values that weren't quoted in the source
- # document, or we may change the type of quotes around the attribute value
- # (single to double quotes).
- # Note that improperly embedded non-HTML code (like client-side Javascript)
- # may be parsed incorrectly by the ancestor, causing runtime script errors.
- # All non-HTML code must be enclosed in HTML comment tags (<!-- code -->)
- # to ensure that it will pass through this parser unaltered (in handle_comment).
- strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
- self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
-
- def unknown_endtag(self, tag):
- # called for each end tag, e.g. for </pre>, tag will be "pre"
- # Reconstruct the original end tag.
- self.pieces.append("</%(tag)s>" % locals())
-
- def handle_charref(self, ref):
- # called for each character reference, e.g. for " ", ref will be "160"
- # Reconstruct the original character reference.
- self.pieces.append("&#%(ref)s;" % locals())
-
- def handle_entityref(self, ref):
- # called for each entity reference, e.g. for "©", ref will be "copy"
- # Reconstruct the original entity reference.
- self.pieces.append("&%(ref)s" % locals())
- # standard HTML entities are closed with a semicolon; other entities are not
- if htmlentitydefs.entitydefs.has_key(ref):
- self.pieces.append(";")
-
- def handle_data(self, text):
- # called for each block of plain text, i.e. outside of any tag and
- # not containing any character or entity references
- # Store the original text verbatim.
- self.pieces.append(text)
-
- def handle_comment(self, text):
- # called for each HTML comment, e.g. <!-- insert Javascript code here -->
- # Reconstruct the original comment.
- # It is especially important that the source document enclose client-side
- # code (like Javascript) within comments so it can pass through this
- # processor undisturbed; see comments in unknown_starttag for details.
- self.pieces.append("<!--%(text)s-->" % locals())
-
- def handle_pi(self, text):
- # called for each processing instruction, e.g. <?instruction>
- # Reconstruct original processing instruction.
- self.pieces.append("<?%(text)s>" % locals())
-
- def handle_decl(self, text):
- # called for the DOCTYPE, if present, e.g.
- # <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
- # "http://www.w3.org/TR/html4/loose.dtd">
- # Reconstruct original DOCTYPE
- self.pieces.append("<!%(text)s>" % locals())
-
- def output(self):
- """Return processed HTML as a single string"""
- return "".join(self.pieces)
Example 8.2. dialect.py
-import re
-from BaseHTMLProcessor import BaseHTMLProcessor
-
-class Dialectizer(BaseHTMLProcessor):
- subs = ()
-
- def reset(self):
- # extend (called from __init__ in ancestor)
- # Reset all data attributes
- self.verbatim = 0
- BaseHTMLProcessor.reset(self)
-
- def start_pre(self, attrs):
- # called for every <pre> tag in HTML source
- # Increment verbatim mode count, then handle tag like normal
- self.verbatim += 1
- self.unknown_starttag("pre", attrs)
-
- def end_pre(self):
- # called for every </pre> tag in HTML source
- # Decrement verbatim mode count
- self.unknown_endtag("pre")
- self.verbatim -= 1
-
- def handle_data(self, text):
- # override
- # called for every block of text in HTML source
- # If in verbatim mode, save text unaltered;
- # otherwise process the text with a series of substitutions
- self.pieces.append(self.verbatim and text or self.process(text))
-
- def process(self, text):
- # called from handle_data
- # Process text block by performing series of regular expression
- # substitutions (actual substitions are defined in descendant)
- for fromPattern, toPattern in self.subs:
- text = re.sub(fromPattern, toPattern, text)
- return text
-
-class ChefDialectizer(Dialectizer):
- """convert HTML to Swedish Chef-speak
-
- based on the classic chef.x, copyright (c) 1992, 1993 John Hagerman
- """
- subs = ((r'a([nu])', r'u\1'),
- (r'A([nu])', r'U\1'),
- (r'a\B', r'e'),
- (r'A\B', r'E'),
- (r'en\b', r'ee'),
- (r'\Bew', r'oo'),
- (r'\Be\b', r'e-a'),
- (r'\be', r'i'),
- (r'\bE', r'I'),
- (r'\Bf', r'ff'),
- (r'\Bir', r'ur'),
- (r'(\w*?)i(\w*?)$', r'\1ee\2'),
- (r'\bow', r'oo'),
- (r'\bo', r'oo'),
- (r'\bO', r'Oo'),
- (r'the', r'zee'),
- (r'The', r'Zee'),
- (r'th\b', r't'),
- (r'\Btion', r'shun'),
- (r'\Bu', r'oo'),
- (r'\BU', r'Oo'),
- (r'v', r'f'),
- (r'V', r'F'),
- (r'w', r'w'),
- (r'W', r'W'),
- (r'([a-z])[.]', r'\1. Bork Bork Bork!'))
-
-class FuddDialectizer(Dialectizer):
- """convert HTML to Elmer Fudd-speak"""
- subs = ((r'[rl]', r'w'),
- (r'qu', r'qw'),
- (r'th\b', r'f'),
- (r'th', r'd'),
- (r'n[.]', r'n, uh-hah-hah-hah.'))
-
-class OldeDialectizer(Dialectizer):
- """convert HTML to mock Middle English"""
- subs = ((r'i([bcdfghjklmnpqrstvwxyz])e\b', r'y\1'),
- (r'i([bcdfghjklmnpqrstvwxyz])e', r'y\1\1e'),
- (r'ick\b', r'yk'),
- (r'ia([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
- (r'e[ea]([bcdfghjklmnpqrstvwxyz])', r'e\1e'),
- (r'([bcdfghjklmnpqrstvwxyz])y', r'\1ee'),
- (r'([bcdfghjklmnpqrstvwxyz])er', r'\1re'),
- (r'([aeiou])re\b', r'\1r'),
- (r'ia([bcdfghjklmnpqrstvwxyz])', r'i\1e'),
- (r'tion\b', r'cioun'),
- (r'ion\b', r'ioun'),
- (r'aid', r'ayde'),
- (r'ai', r'ey'),
- (r'ay\b', r'y'),
- (r'ay', r'ey'),
- (r'ant', r'aunt'),
- (r'ea', r'ee'),
- (r'oa', r'oo'),
- (r'ue', r'e'),
- (r'oe', r'o'),
- (r'ou', r'ow'),
- (r'ow', r'ou'),
- (r'\bhe', r'hi'),
- (r've\b', r'veth'),
- (r'se\b', r'e'),
- (r"'s\b", r'es'),
- (r'ic\b', r'ick'),
- (r'ics\b', r'icc'),
- (r'ical\b', r'ick'),
- (r'tle\b', r'til'),
- (r'll\b', r'l'),
- (r'ould\b', r'olde'),
- (r'own\b', r'oune'),
- (r'un\b', r'onne'),
- (r'rry\b', r'rye'),
- (r'est\b', r'este'),
- (r'pt\b', r'pte'),
- (r'th\b', r'the'),
- (r'ch\b', r'che'),
- (r'ss\b', r'sse'),
- (r'([wybdp])\b', r'\1e'),
- (r'([rnt])\b', r'\1\1e'),
- (r'from', r'fro'),
- (r'when', r'whan'))
-
-def translate(url, dialectName="chef"):
- """fetch URL and translate using dialect
-
- dialect in ("chef", "fudd", "olde")"""
- import urllib
- sock = urllib.urlopen(url)
- htmlSource = sock.read()
- sock.close()
- parserName = "%sDialectizer" % dialectName.capitalize()
- parserClass = globals()[parserName]
- parser = parserClass()
- parser.feed(htmlSource)
- parser.close()
- return parser.output()
-
-def test(url):
- """test all dialects against URL"""
- for dialect in ("chef", "fudd", "olde"):
- outfile = "%s.html" % dialect
- fsock = open(outfile, "wb")
- fsock.write(translate(url, dialect))
- fsock.close()
- import webbrowser
- webbrowser.open_new(outfile)
-
-if __name__ == "__main__":
- test("http://diveintopython3.org/odbchelper_list.html")
Example 8.3. Output of dialect.py
- Running this script will translate Section 3.2, “Introducing Lists” into mock Swedish Chef-speak (from The Muppets), mock Elmer Fudd-speak (from Bugs Bunny cartoons), and mock Middle English (loosely based on Chaucer's The Canterbury Tales). If you look at the HTML source of the output pages, you'll see that all the HTML tags and attributes are untouched, but the text between the tags has been “translated” into the mock language. If you look closer, you'll see that, in fact, only the titles and paragraphs were translated; the
- code listings and screen examples were left untouched.
-
-<div class=abstract>
-<p>Lists awe <span class=application>Pydon</span>'s wowkhowse datatype.
-If youw onwy expewience wif wists is awways in
-<span class=application>Visuaw Basic</span> ow (God fowbid) de datastowe
-in <span class=application>Powewbuiwdew</span>, bwace youwsewf fow
-<span class=application>Pydon</span> wists.</p>
-</div>
-
8.2. Introducing sgmllib.py
- HTML processing is broken into three steps: breaking down the HTML into its constituent pieces, fiddling with the pieces, and reconstructing the pieces into HTML again. The first step is done by sgmllib.py, a part of the standard Python library.
- The key to understanding this chapter is to realize that HTML is not just text, it is structured text. The structure is derived from the more-or-less-hierarchical sequence of start tags
-and end tags. Usually you don't work with HTML this way; you work with it textually in a text editor, or visually in a web browser or web authoring tool. sgmllib.py presents HTML structurally.
- sgmllib.py contains one important class: SGMLParser. SGMLParser parses HTML into useful pieces, like start tags and end tags. As soon as it succeeds in breaking down some data into a useful piece,
-it calls a method on itself based on what it found. In order to use the parser, you subclass the SGMLParser class and override these methods. This is what I meant when I said that it presents HTML structurally: the structure of the HTML determines the sequence of method calls and the arguments passed to each method.
-
SGMLParser parses HTML into 8 kinds of data, and calls a separate method for each of them:
-
-
-- Start tag
-- An HTML tag that starts a block, like
<html>, <head>, <body>, or <pre>, or a standalone tag like <br> or <img>. When it finds a start tag tagname, SGMLParser will look for a method called start_tagname or do_tagname. For instance, when it finds a <pre> tag, it will look for a start_pre or do_pre method. If found, SGMLParser calls this method with a list of the tag's attributes; otherwise, it calls unknown_starttag with the tag name and list of attributes.
-
-- End tag
-- An HTML tag that ends a block, like
</html>, </head>, </body>, or </pre>. When it finds an end tag, SGMLParser will look for a method called end_tagname. If found, SGMLParser calls this method, otherwise it calls unknown_endtag with the tag name.
-
-- Character reference
-- An escaped character referenced by its decimal or hexadecimal equivalent, like
 . When found, SGMLParser calls handle_charref with the text of the decimal or hexadecimal character equivalent.
-
-- Entity reference
-- An HTML entity, like
©. When found, SGMLParser calls handle_entityref with the name of the HTML entity.
-
-- Comment
-- An HTML comment, enclosed in
<!-- ... -->. When found, SGMLParser calls handle_comment with the body of the comment.
-
-- Processing instruction
-- An HTML processing instruction, enclosed in
<? ... >. When found, SGMLParser calls handle_pi with the body of the processing instruction.
-
-- Declaration
-- An HTML declaration, such as a
DOCTYPE, enclosed in <! ... >. When found, SGMLParser calls handle_decl with the body of the declaration.
-
-- Text data
-- A block of text. Anything that doesn't fit into the other 7 categories. When found,
SGMLParser calls handle_data with the text.
-
-
-
-
- | Python 2.0 had a bug where SGMLParser would not recognize declarations at all (handle_decl would never be called), which meant that DOCTYPEs were silently ignored. This is fixed in Python 2.1.
-sgmllib.py comes with a test suite to illustrate this. You can run sgmllib.py, passing the name of an HTML file on the command line, and it will print out the tags and other elements as it parses them. It does this by subclassing
-the SGMLParser class and defining unknown_starttag, unknown_endtag, handle_data and other methods which simply print their arguments.
-
-
- | In the ActivePython IDE on Windows, you can specify command line arguments in the “Run script” dialog. Separate multiple arguments with spaces.
-Example 8.4. Sample test of sgmllib.py
- Here is a snippet from the table of contents of the HTML version of this book. Of course your paths may vary. (If you haven't downloaded the HTML version of the book, you can do so at http://diveintopython3.org/.
-
-c:\python23\lib> type "c:\downloads\diveintopython3\html\toc\index.html"
-
-<!DOCTYPE html
- PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
-<html>
- <head>
- <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
-
- <title>Dive Into Python</title>
- <link rel="stylesheet" href="diveintopython3.css" type="text/css">
-
-... rest of file omitted for brevity ...
- Running this through the test suite of sgmllib.py yields this output:
-c:\python23\lib> python sgmllib.py "c:\downloads\diveintopython3\html\toc\index.html"
-data: '\n\n'
-start tag: <html >
-data: '\n '
-start tag: <head>
-data: '\n '
-start tag: <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" >
-data: '\n \n '
-start tag: <title>
-data: 'Dive Into Python'
-end tag: </title>
-data: '\n '
-start tag: <link rel="stylesheet" href="diveintopython3.css" type="text/css" >
-data: '\n '
-
-... rest of output omitted for brevity ...
- Here's the roadmap for the rest of the chapter:
-
-
-- Subclass
SGMLParser to create classes that extract interesting data out of HTML documents.
-
- - Subclass
SGMLParser to create BaseHTMLProcessor, which overrides all 8 handler methods and uses them to reconstruct the original HTML from the pieces.
-
- - Subclass
BaseHTMLProcessor to create Dialectizer, which adds some methods to process specific HTML tags specially, and overrides the handle_data method to provide a framework for processing the text blocks between the HTML tags.
-
- - Subclass
Dialectizer to create classes that define text processing rules used by Dialectizer.handle_data.
-
- - Write a test suite that grabs a real web page from
http://diveintopython3.org/ and processes it.
-
-
- Along the way, you'll also learn about locals, globals, and dictionary-based string formatting.
-
- To extract data from HTML documents, subclass the SGMLParser class and define methods for each tag or entity you want to capture.
- The first step to extracting data from an HTML document is getting some HTML. If you have some HTML lying around on your hard drive, you can use file functions to read it, but the real fun begins when you get HTML from live web pages.
-
->>> import urllib ①
->>> sock = urllib.urlopen("http://diveintopython3.org/") ②
->>> htmlSource = sock.read() ③
->>> sock.close() ④
->>> print htmlSource⑤
-<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head>
- <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1'>
- <title>Dive Into Python</title>
-<link rel='stylesheet' href='diveintopython3.css' type='text/css'>
-<link rev='made' href='mailto:mark@diveintopython3.org'>
-<meta name='keywords' content='Python, Dive Into Python, tutorial, object-oriented, programming, documentation, book, free'>
-<meta name='description' content='a free Python tutorial for experienced programmers'>
-</head>
-<body bgcolor='white' text='black' link='#0000FF' vlink='#840084' alink='#0000FF'>
-<table cellpadding='0' cellspacing='0' border='0' width='100%'>
-<tr><td class='header' width='1%' valign='top'>diveintopython3.org</td>
-<td width='99%' align='right'><hr size='1' noshade></td></tr>
-<tr><td class='tagline' colspan='2'>Python for experienced programmers</td></tr>
-
-[...snip...]
-
-- The
urllib module is part of the standard Python library. It contains functions for getting information about and actually retrieving data from Internet-based URLs (mainly web pages).
- - The simplest use of
urllib is to retrieve the entire text of a web page using the urlopen function. Opening a URL is similar to opening a file. The return value of urlopen is a file-like object, which has some of the same methods as a file object.
- - The simplest thing to do with the file-like object returned by
urlopen is read, which reads the entire HTML of the web page into a single string. The object also supports readlines, which reads the text line by line into a list.
- - When you're done with the object, make sure to
close it, just like a normal file object.
- - You now have the complete HTML of the home page of
http://diveintopython3.org/ in a string, and you're ready to parse it.
-
- If you have not already done so, you can download this and other examples used in this book.
-
-from sgmllib import SGMLParser
-
-class URLLister(SGMLParser):
- def reset(self): ①
- SGMLParser.reset(self)
- self.urls = []
-
- def start_a(self, attrs): ②
- href = [v for k, v in attrs if k=='href'] ③ ④
- if href:
- self.urls.extend(href)
-
-reset is called by the __init__ method of SGMLParser, and it can also be called manually once an instance of the parser has been created. So if you need to do any initialization,
- do it in reset, not in __init__, so that it will be re-initialized properly when someone re-uses a parser instance.
-start_a is called by SGMLParser whenever it finds an <a> tag. The tag may contain an href attribute, and/or other attributes, like name or title. The attrs parameter is a list of tuples, [(attribute, value), (attribute, value), ...]. Or it may be just an <a>, a valid (if useless) HTML tag, in which case attrs would be an empty list.
-- You can find out whether this
<a> tag has an href attribute with a simple multi-variable list comprehension.
- - String comparisons like
k=='href' are always case-sensitive, but that's safe in this case, because SGMLParser converts attribute names to lowercase while building attrs.
-Example 8.7. Using urllister.py
->>> import urllib, urllister
->>> usock = urllib.urlopen("http://diveintopython3.org/")
->>> parser = urllister.URLLister()
->>> parser.feed(usock.read()) ①
->>> usock.close() ②
->>> parser.close() ③
->>> for url in parser.urls: print url ④
-toc/index.html
-#download
-#languages
-toc/index.html
-appendix/history.html
-download/diveintopython3-html-5.0.zip
-download/diveintopython3-pdf-5.0.zip
-download/diveintopython3-word-5.0.zip
-download/diveintopython3-text-5.0.zip
-download/diveintopython3-html-flat-5.0.zip
-download/diveintopython3-xml-5.0.zip
-download/diveintopython3-common-5.0.zip
-
-
-... rest of output omitted for brevity ...
-
-- Call the
feed method, defined in SGMLParser, to get HTML into the parser.
-[1] It takes a string, which is what usock.read() returns.
- - Like files, you should
close your URL objects as soon as you're done with them.
- - You should
close your parser object, too, but for a different reason. You've read all the data and fed it to the parser, but the feed method isn't guaranteed to have actually processed all the HTML you give it; it may buffer it, waiting for more. Be sure to call close to flush the buffer and force everything to be fully parsed.
- - Once the parser is
closed, the parsing is complete, and parser.urls contains a list of all the linked URLs in the HTML document. (Your output may look different, if the download links have been updated by the time you read this.)
-8.4. Introducing BaseHTMLProcessor.py
-SGMLParser doesn't produce anything by itself. It parses and parses and parses, and it calls a method for each interesting thing it
- finds, but the methods don't do anything. SGMLParser is an HTML consumer: it takes HTML and breaks it down into small, structured pieces. As you saw in the previous section, you can subclass SGMLParser to define classes that catch specific tags and produce useful things, like a list of all the links on a web page. Now you'll
- take this one step further by defining a class that catches everything SGMLParser throws at it and reconstructs the complete HTML document. In technical terms, this class will be an HTML producer.
-
BaseHTMLProcessor subclasses SGMLParser and provides all 8 essential handler methods: unknown_starttag, unknown_endtag, handle_charref, handle_entityref, handle_comment, handle_pi, handle_decl, and handle_data.
-
Example 8.8. Introducing BaseHTMLProcessor
-class BaseHTMLProcessor(SGMLParser):
- def reset(self): ①
- self.pieces = []
- SGMLParser.reset(self)
-
- def unknown_starttag(self, tag, attrs): ②
- strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs])
- self.pieces.append("<%(tag)s%(strattrs)s>" % locals())
-
- def unknown_endtag(self, tag): ③
- self.pieces.append("</%(tag)s>" % locals())
-
- def handle_charref(self, ref): ④
- self.pieces.append("&#%(ref)s;" % locals())
-
- def handle_entityref(self, ref): ⑤
- self.pieces.append("&%(ref)s" % locals())
- if htmlentitydefs.entitydefs.has_key(ref):
- self.pieces.append(";")
-
- def handle_data(self, text): ⑥
- self.pieces.append(text)
-
- def handle_comment(self, text): ⑦
- self.pieces.append("<!--%(text)s-->" % locals())
-
- def handle_pi(self, text): ⑧
- self.pieces.append("<?%(text)s>" % locals())
-
- def handle_decl(self, text):
- self.pieces.append("<!%(text)s>" % locals())
-
-reset, called by SGMLParser.__init__, initializes self.pieces as an empty list before calling the ancestor method. self.pieces is a data attribute which will hold the pieces of the HTML document you're constructing. Each handler method will reconstruct the HTML that SGMLParser parsed, and each method will append that string to self.pieces. Note that self.pieces is a list. You might be tempted to define it as a string and just keep appending each piece to it. That would work, but
-Python is much more efficient at dealing with lists.
-[2]- Since
BaseHTMLProcessor does not define any methods for specific tags (like the start_a method in URLLister), SGMLParser will call unknown_starttag for every start tag. This method takes the tag (tag) and the list of attribute name/value pairs (attrs), reconstructs the original HTML, and appends it to self.pieces. The string formatting here is a little strange; you'll untangle that (and also the odd-looking locals function) later in this chapter.
- - Reconstructing end tags is much simpler; just take the tag name and wrap it in the
</...> brackets.
- - When
SGMLParser finds a character reference, it calls handle_charref with the bare reference. If the HTML document contains the reference  , ref will be 160. Reconstructing the original complete character reference just involves wrapping ref in &#...; characters.
- - Entity references are similar to character references, but without the hash mark. Reconstructing the original entity reference
- requires wrapping ref in
&...; characters. (Actually, as an erudite reader pointed out to me, it's slightly more complicated than this. Only certain standard
-HTML entites end in a semicolon; other similar-looking entities do not. Luckily for us, the set of standard HTML entities is defined in a dictionary in a Python module called htmlentitydefs. Hence the extra if statement.)
- - Blocks of text are simply appended to self.pieces unaltered.
-
- HTML comments are wrapped in
<!--...--> characters.
- - Processing instructions are wrapped in
<?...> characters.
-
-
- | The HTML specification requires that all non-HTML (like client-side JavaScript) must be enclosed in HTML comments, but not all web pages do this properly (and all modern web browsers are forgiving if they don't). BaseHTMLProcessor is not forgiving; if script is improperly embedded, it will be parsed as if it were HTML. For instance, if the script contains less-than and equals signs, SGMLParser may incorrectly think that it has found tags and attributes. SGMLParser always converts tags and attribute names to lowercase, which may break the script, and BaseHTMLProcessor always encloses attribute values in double quotes (even if the original HTML document used single quotes or no quotes), which will certainly break the script. Always protect your client-side script
- within HTML comments.
-Example 8.9. BaseHTMLProcessor output
- def output(self): ①
- """Return processed HTML as a single string"""
- return "".join(self.pieces) ②
-
-- This is the one method in
BaseHTMLProcessor that is never called by the ancestor SGMLParser. Since the other handler methods store their reconstructed HTML in self.pieces, this function is needed to join all those pieces into one string. As noted before, Python is great at lists and mediocre at strings, so you only create the complete string when somebody explicitly asks for it.
- - If you prefer, you could use the
join method of the string module instead: string.join(self.pieces, "")
- Further reading
-
8.5. locals and globals
Let's digress from HTML processing for a minute and talk about how Python handles variables. Python has two built-in functions, locals and globals, which provide dictionary-based access to local and global variables.
Remember locals? You first saw it here:
@@ -3050,605 +2016,17 @@ print "z=",z ⑤
- This prints
x= 1, not x= 2.
- After being burned by
locals, you might think that this wouldn't change the value of z, but it does. Due to internal differences in how Python is implemented (which I'd rather not go into, since I don't fully understand them myself), globals returns the actual global namespace, not a copy: the exact opposite behavior of locals. So any changes to the dictionary returned by globals directly affect your global variables.
- This prints
z= 8, not z= 7.
-8.6. Dictionary-based string formatting
-Why did you learn about locals and globals? So you can learn about dictionary-based string formatting. As you recall, regular string formatting provides an easy way to insert values into strings. Values are listed in a tuple and inserted in order into the string in
-place of each formatting marker. While this is efficient, it is not always the easiest code to read, especially when multiple
-values are being inserted. You can't simply scan through the string in one pass and understand what the result will be; you're
-constantly switching between reading the string and reading the tuple of values.
- There is an alternative form of string formatting that uses dictionaries instead of tuples of values.
- Example 8.13. Introducing dictionary-based string formatting
->>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
->>> "%(pwd)s" % params①
-'secret'
->>> "%(pwd)s is not a good password for %(uid)s" % params ②
-'secret is not a good password for sa'
->>> "%(database)s of mind, %(database)s of body" % params ③
-'master of mind, master of body'
-
-- Instead of a tuple of explicit values, this form of string formatting uses a dictionary, params. And instead of a simple
%s marker in the string, the marker contains a name in parentheses. This name is used as a key in the params dictionary and subsitutes the corresponding value, secret, in place of the %(pwd)s marker.
- - Dictionary-based string formatting works with any number of named keys. Each key must exist in the given dictionary, or the
- formatting will fail with a
KeyError.
- - You can even specify the same key twice; each occurrence will be replaced with the same value.
-
So why would you use dictionary-based string formatting? Well, it does seem like overkill to set up a dictionary of keys
-and values simply to do string formatting in the next line; it's really most useful when you happen to have a dictionary of
-meaningful keys and values already. Like locals.
- Example 8.14. Dictionary-based string formatting in BaseHTMLProcessor.py
- def handle_comment(self, text):
- self.pieces.append("<!--%(text)s-->" % locals()) ①
-
-
-- Using the built-in
locals function is the most common use of dictionary-based string formatting. It means that you can use the names of local variables
- within your string (in this case, text, which was passed to the class method as an argument) and each named variable will be replaced by its value. If text is 'Begin page footer', the string formatting "<!--%(text)s-->" % locals() will resolve to the string '<!--Begin page footer-->'.
-Example 8.15. More dictionary-based string formatting
- def unknown_starttag(self, tag, attrs):
- strattrs = "".join([' %s="%s"' % (key, value) for key, value in attrs]) ①
- self.pieces.append("<%(tag)s%(strattrs)s>" % locals()) ②
-
-
-- When this method is called, attrs is a list of key/value tuples, just like the
items of a dictionary, which means you can use multi-variable assignment to iterate through it. This should be a familiar pattern by now, but there's a lot going on here, so let's break it down:
-
-
-- Suppose attrs is
[('href', 'index.html'), ('title', 'Go to home page')].
- - In the first round of the list comprehension, key will get
'href', and value will get 'index.html'.
- - The string formatting
' %s="%s"' % (key, value) will resolve to ' href="index.html"'. This string becomes the first element of the list comprehension's return value.
- - In the second round, key will get
'title', and value will get 'Go to home page'.
- - The string formatting will resolve to
' title="Go to home page"'.
- - The list comprehension returns a list of these two resolved strings, and strattrs will join both elements of this list together to form
' href="index.html" title="Go to home page"'.
+[XML stuff was here]
-
- - Now, using dictionary-based string formatting, you insert the value of tag and strattrs into a string. So if tag is
'a', the final result would be '<a href="index.html" title="Go to home page">', and that is what gets appended to self.pieces.
-
- | Using dictionary-based string formatting with locals is a convenient way of making complex string formatting expressions more readable, but it comes with a price. There is a
- slight performance hit in making the call to locals, since locals builds a copy of the local namespace.
-8.7. Quoting attribute values
-A common question on comp.lang.python is “I have a bunch of HTML documents with unquoted attribute values, and I want to properly quote them all. How can I do this?”[4] (This is generally precipitated by a project manager who has found the HTML-is-a-standard religion joining a large project and proclaiming that all pages must validate against an HTML validator. Unquoted attribute values are a common violation of the HTML standard.) Whatever the reason, unquoted attribute values are easy to fix by feeding HTML through BaseHTMLProcessor.
- BaseHTMLProcessor consumes HTML (since it's descended from SGMLParser) and produces equivalent HTML, but the HTML output is not identical to the input. Tags and attribute names will end up in lowercase, even if they started in uppercase
-or mixed case, and attribute values will be enclosed in double quotes, even if they started in single quotes or with no quotes
-at all. It is this last side effect that you can take advantage of.
-
Example 8.16. Quoting attribute values
->>> htmlSource = """ ①
-... <html>
-... <head>
-... <title>Test page</title>
-... </head>
-... <body>
-... <ul>
-... <li><a href=index.html>Home</a></li>
-... <li><a href=toc.html>Table of contents</a></li>
-... <li><a href=history.html>Revision history</a></li>
-... </body>
-... </html>
-... """
->>> from BaseHTMLProcessor import BaseHTMLProcessor
->>> parser = BaseHTMLProcessor()
->>> parser.feed(htmlSource) ②
->>> print parser.output() ③
-<html>
-<head>
-<title>Test page</title>
-</head>
-<body>
-<ul>
-<li><a href="index.html">Home</a></li>
-<li><a href="toc.html">Table of contents</a></li>
-<li><a href="history.html">Revision history</a></li>
-</body>
-</html>
-
-- Note that the attribute values of the
href attributes in the <a> tags are not properly quoted. (Also note that you're using triple quotes for something other than a docstring. And directly in the IDE, no less. They're very useful.)
- - Feed the parser.
-
- Using the
output function defined in BaseHTMLProcessor, you get the output as a single string, complete with quoted attribute values. While this may seem anti-climactic, think
- about how much has actually happened here: SGMLParser parsed the entire HTML document, breaking it down into tags, refs, data, and so forth; BaseHTMLProcessor used those elements to reconstruct pieces of HTML (which are still stored in parser.pieces, if you want to see them); finally, you called parser.output, which joined all the pieces of HTML into one string.
-8.8. Introducing dialect.py
-Dialectizer is a simple (and silly) descendant of BaseHTMLProcessor. It runs blocks of text through a series of substitutions, but it makes sure that anything within a <pre>...</pre> block passes through unaltered.
-
To handle the <pre> blocks, you define two methods in Dialectizer: start_pre and end_pre.
-
- def start_pre(self, attrs): ①
- self.verbatim += 1②
- self.unknown_starttag("pre", attrs) ③
- def end_pre(self): ④
- self.unknown_endtag("pre") ⑤
- self.verbatim -= 1⑥
-
-start_pre is called every time SGMLParser finds a <pre> tag in the HTML source. (In a minute, you'll see exactly how this happens.) The method takes a single parameter, attrs, which contains the attributes of the tag (if any). attrs is a list of key/value tuples, just like unknown_starttag takes.
-- In the
reset method, you initialize a data attribute that serves as a counter for <pre> tags. Every time you hit a <pre> tag, you increment the counter; every time you hit a </pre> tag, you'll decrement the counter. (You could just use this as a flag and set it to 1 and reset it to 0, but it's just as easy to do it this way, and this handles the odd (but possible) case of nested <pre> tags.) In a minute, you'll see how this counter is put to good use.
- - That's it, that's the only special processing you do for
<pre> tags. Now you pass the list of attributes along to unknown_starttag so it can do the default processing.
- end_pre is called every time SGMLParser finds a </pre> tag. Since end tags can not contain attributes, the method takes no parameters.
-- First, you want to do the default processing, just like any other end tag.
-
- Second, you decrement your counter to signal that this
<pre> block has been closed.
-At this point, it's worth digging a little further into SGMLParser. I've claimed repeatedly (and you've taken it on faith so far) that SGMLParser looks for and calls specific methods for each tag, if they exist. For instance, you just saw the definition of start_pre and end_pre to handle <pre> and </pre>. But how does this happen? Well, it's not magic, it's just good Python coding.
- Example 8.18. SGMLParser
- def finish_starttag(self, tag, attrs): ①
- try:
- method = getattr(self, 'start_' + tag) ②
- except AttributeError: ③
- try:
- method = getattr(self, 'do_' + tag) ④
- except AttributeError:
- self.unknown_starttag(tag, attrs) ⑤
- return -1
- else:
- self.handle_starttag(tag, method, attrs) ⑥
- return 0
- else:
- self.stack.append(tag)
- self.handle_starttag(tag, method, attrs)
- return 1 ⑦
- def handle_starttag(self, tag, method, attrs):
- method(attrs)⑧
-
-- At this point,
SGMLParser has already found a start tag and parsed the attribute list. The only thing left to do is figure out whether there is a
- specific handler method for this tag, or whether you should fall back on the default method (unknown_starttag).
- - The “magic” of
SGMLParser is nothing more than your old friend, getattr. What you may not have realized before is that getattr will find methods defined in descendants of an object as well as the object itself. Here the object is self, the current instance. So if tag is 'pre', this call to getattr will look for a start_pre method on the current instance, which is an instance of the Dialectizer class.
- getattr raises an AttributeError if the method it's looking for doesn't exist in the object (or any of its descendants), but that's okay, because you wrapped
- the call to getattr inside a try...except block and explicitly caught the AttributeError.
-- Since you didn't find a
start_xxx method, you'll also look for a do_xxx method before giving up. This alternate naming scheme is generally used for standalone tags, like <br>, which have no corresponding end tag. But you can use either naming scheme; as you can see, SGMLParser tries both for every tag. (You shouldn't define both a start_xxx and do_xxx handler method for the same tag, though; only the start_xxx method will get called.)
- - Another
AttributeError, which means that the call to getattr failed with do_xxx. Since you found neither a start_xxx nor a do_xxx method for this tag, you catch the exception and fall back on the default method, unknown_starttag.
- - Remember,
try...except blocks can have an else clause, which is called if no exception is raised during the try...except block. Logically, that means that you did find a do_xxx method for this tag, so you're going to call it.
- - By the way, don't worry about these different return values; in theory they mean something, but they're never actually used.
- Don't worry about the
self.stack.append(tag) either; SGMLParser keeps track internally of whether your start tags are balanced by appropriate end tags, but it doesn't do anything with this
- information either. In theory, you could use this module to validate that your tags were fully balanced, but it's probably
- not worth it, and it's beyond the scope of this chapter. You have better things to worry about right now.
- start_xxx and do_xxx methods are not called directly; the tag, method, and attributes are passed to this function, handle_starttag, so that descendants can override it and change the way all start tags are dispatched. You don't need that level of control, so you just let this method do its thing, which is to call
- the method (start_xxx or do_xxx) with the list of attributes. Remember, method is a function, returned from getattr, and functions are objects. (I know you're getting tired of hearing it, and I promise I'll stop saying it as soon as I run
- out of ways to use it to my advantage.) Here, the function object is passed into this dispatch method as an argument, and
- this method turns around and calls the function. At this point, you don't need to know what the function is, what it's named,
- or where it's defined; the only thing you need to know about the function is that it is called with one argument, attrs.
-Now back to our regularly scheduled program: Dialectizer. When you left, you were in the process of defining specific handler methods for <pre> and </pre> tags. There's only one thing left to do, and that is to process text blocks with the pre-defined substitutions. For that,
-you need to override the handle_data method.
- Example 8.19. Overriding the handle_data method
- def handle_data(self, text): ①
- self.pieces.append(self.verbatim and text or self.process(text)) ②
-
-handle_data is called with only one argument, the text to process.
-- In the ancestor
BaseHTMLProcessor, the handle_data method simply appended the text to the output buffer, self.pieces. Here the logic is only slightly more complicated. If you're in the middle of a <pre>...</pre> block, self.verbatim will be some value greater than 0, and you want to put the text in the output buffer unaltered. Otherwise, you will call a separate method to process the
- substitutions, then put the result of that into the output buffer. In Python, this is a one-liner, using the and-or trick.
-You're close to completely understanding Dialectizer. The only missing link is the nature of the text substitutions themselves. If you know any Perl, you know that when complex text substitutions are required, the only real solution is regular expressions. The classes
-later in dialect.py define a series of regular expressions that operate on the text between the HTML tags. But you just had a whole chapter on regular expressions. You don't really want to slog through regular expressions again, do you? God knows I don't. I think you've learned enough
-for one chapter.
- 8.9. Putting it all together
-It's time to put everything you've learned so far to good use. I hope you were paying attention.
- Example 8.20. The translate function, part 1
-def translate(url, dialectName="chef"): ①
- import urllib ②
- sock = urllib.urlopen(url) ③
- htmlSource = sock.read()
- sock.close()
-
-
-- The
translate function has an optional argument dialectName, which is a string that specifies the dialect you'll be using. You'll see how this is used in a minute.
- - Hey, wait a minute, there's an
import statement in this function! That's perfectly legal in Python. You're used to seeing import statements at the top of a program, which means that the imported module is available anywhere in the program. But you can
- also import modules within a function, which means that the imported module is only available within the function. If you
- have a module that is only ever used in one function, this is an easy way to make your code more modular. (When you find
- that your weekend hack has turned into an 800-line work of art and decide to split it up into a dozen reusable modules, you'll
- appreciate this.)
- - Now you get the source of the given URL.
-
Example 8.21. The translate function, part 2: curiouser and curiouser
- parserName = "%sDialectizer" % dialectName.capitalize() ①
- parserClass = globals()[parserName] ②
- parser = parserClass() ③
-
-
-capitalize is a string method you haven't seen before; it simply capitalizes the first letter of a string and forces everything else
- to lowercase. Combined with some string formatting, you've taken the name of a dialect and transformed it into the name of the corresponding Dialectizer class. If dialectName is the string 'chef', parserName will be the string 'ChefDialectizer'.
-- You have the name of a class as a string (parserName), and you have the global namespace as a dictionary (
globals()). Combined, you can get a reference to the class which the string names. (Remember, classes are objects, and they can be assigned to variables just like any other object.) If parserName is the string 'ChefDialectizer', parserClass will be the class ChefDialectizer.
- - Finally, you have a class object (parserClass), and you want an instance of the class. Well, you already know how to do that: call the class like a function. The fact that the class is being stored in a local variable makes absolutely no difference; you just call the local variable
- like a function, and out pops an instance of the class. If parserClass is the class
ChefDialectizer, parser will be an instance of the class ChefDialectizer.
-Why bother? After all, there are only 3 Dialectizer classes; why not just use a case statement? (Well, there's no case statement in Python, but why not just use a series of if statements?) One reason: extensibility. The translate function has absolutely no idea how many Dialectizer classes you've defined. Imagine if you defined a new FooDialectizer tomorrow; translate would work by passing 'foo' as the dialectName.
- Even better, imagine putting FooDialectizer in a separate module, and importing it with from module import. You've already seen that this includes it in globals(), so translate would still work without modification, even though FooDialectizer was in a separate file.
- Now imagine that the name of the dialect is coming from somewhere outside the program, maybe from a database or from a user-inputted
-value on a form. You can use any number of server-side Python scripting architectures to dynamically generate web pages; this function could take a URL and a dialect name (both strings) in the query string of a web page request, and output the “translated” web page.
- Finally, imagine a Dialectizer framework with a plug-in architecture. You could put each Dialectizer class in a separate file, leaving only the translate function in dialect.py. Assuming a consistent naming scheme, the translate function could dynamic import the appropiate class from the appropriate file, given nothing but the dialect name. (You haven't
-seen dynamic importing yet, but I promise to cover it in a later chapter.) To add a new dialect, you would simply add an
-appropriately-named file in the plug-ins directory (like foodialect.py which contains the FooDialectizer class). Calling the translate function with the dialect name 'foo' would find the module foodialect.py, import the class FooDialectizer, and away you go.
- Example 8.22. The translate function, part 3
- parser.feed(htmlSource) ①
- parser.close() ②
- return parser.output() ③
-
-
-- After all that imagining, this is going to seem pretty boring, but the
feed function is what does the entire transformation. You had the entire HTML source in a single string, so you only had to call feed once. However, you can call feed as often as you want, and the parser will just keep parsing. So if you were worried about memory usage (or you knew you
- were going to be dealing with very large HTML pages), you could set this up in a loop, where you read a few bytes of HTML and fed it to the parser. The result would be the same.
- - Because
feed maintains an internal buffer, you should always call the parser's close method when you're done (even if you fed it all at once, like you did). Otherwise you may find that your output is missing
- the last few bytes.
- - Remember,
output is the function you defined on BaseHTMLProcessor that joins all the pieces of output you've buffered and returns them in a single string.
-And just like that, you've “translated” a web page, given nothing but a URL and the name of a dialect.
-
- Further reading
-
-- You thought I was kidding about the server-side scripting idea. So did I, until I found this web-based dialectizer. Unfortunately, source code does not appear to be available.
-
- 8.10. Summary
- Python provides you with a powerful tool, sgmllib.py, to manipulate HTML by turning its structure into an object model. You can use this tool in many different ways.
-
-
-- parsing the HTML looking for something specific
-
-
- aggregating the results, like the URL lister
-
- altering the structure along the way, like the attribute quoter
-
- transforming the HTML into something else by manipulating the text while leaving the tags alone, like the
Dialectizer
-
- Along with these examples, you should be comfortable doing all of the following things:
- | | | | | |