mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
asterisms for everyone!
This commit is contained in:
@@ -91,6 +91,8 @@ mark{display:inline}
|
||||
</entry>
|
||||
</feed></code></pre>
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=xml-intro>A 5-Minute Crash Course in XML</h2>
|
||||
|
||||
<p>If you already know about <abbr>XML</abbr>, you can skip this section.
|
||||
@@ -173,6 +175,8 @@ mark{display:inline}
|
||||
|
||||
<p>And now you know just enough <abbr>XML</abbr> to be dangerous!
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=xml-structure>The Structure Of An Atom Feed</h2>
|
||||
|
||||
<p>Think of a weblog, or in fact any website with frequently updated content, like <a href=http://www.cnn.com/>CNN.com</a>. The site itself has a title (“CNN.com”), a subtitle (“Breaking News, U.S., World, Weather, Entertainment <i class=baa>&</i> Video News”), a last-updated date (“updated 12:43 p.m. EDT, Sat May 16, 2009”), and a list of articles posted at different times. Each article also has a title, a first-published date (and maybe also a last-updated date, if they published a correction or fixed a typo), and a unique URL.
|
||||
@@ -242,6 +246,8 @@ mark{display:inline}
|
||||
<li>Finally, the end tag for the <code>entry</code> element, signaling the end of the metadata for this article.
|
||||
</ol>
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=xml-parse>Parsing XML</h2>
|
||||
|
||||
<p>Python can parse <abbr>XML</abbr> documents in several ways. It has traditional <a href=http://en.wikipedia.org/wiki/XML#DOM><abbr>DOM</abbr></a> and <a href=http://en.wikipedia.org/wiki/Simple_API_for_XML><abbr>SAX</abbr></a> parsers, but I will focus on a different library called ElementTree.
|
||||
@@ -320,6 +326,8 @@ mark{display:inline}
|
||||
<li>The <code>updated</code> element has no attributes, so its <code>.attrib</code> is just an empty dictionary.
|
||||
</ol>
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=xml-find>Searching For Nodes Within An XML Document</h2>
|
||||
|
||||
<p>So far, we’ve worked with this <abbr>XML</abbr> document “from the top down,” starting with the root element, getting its child elements, and so on throughout the document. But many uses of <abbr>XML</abbr> require you to find specific elements. Etree can do that, too.
|
||||
@@ -433,6 +441,8 @@ StopIteration</samp></pre>
|
||||
|
||||
<p>Overall, ElementTree’s <code>findall()</code> method is a very powerful feature, but the query language can be a bit surprising. It is officially described as “<a href=http://effbot.org/zone/element-xpath.htm>limited support for XPath expressions</a>.” <a href=http://www.w3.org/TR/xpath>XPath</a> is a W3C standard for querying <abbr>XML</abbr> documents. ElementTree’s query language is similar enough to XPath to do basic searching, but dissimilar enough that it may annoy you if you already know XPath. Now let’s look at a third-party <abbr>XML</abbr> library that extends the ElementTree <abbr>API</abbr> with full XPath support.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=xml-lxml>Going Further With lxml</h2>
|
||||
|
||||
<p><a href=http://codespeak.net/lxml/>lxml</a> is an open source third-party library that builds on the popular <a href=http://www.xmlsoft.org/>libxml2 parser</a>. It provides a 100% compatible ElementTree <abbr>API</abbr>, then extends it with full XPath support and a few other niceties. There are <a href=http://pypi.python.org/pypi/lxml/>installers available for Windows</a>; Linux users should always try to use distribution-specific tools like <code>yum</code> or <code>apt-get</code> to install precompiled binaries from their repositories. Otherwise you’ll need to <a href=http://codespeak.net/lxml/installation.html>install lxml manually</a>.
|
||||
@@ -480,6 +490,8 @@ except ImportError:
|
||||
<li>XPath expressions don’t always return a list of elements. Technically, the <abbr>DOM</abbr> of a parsed <abbr>XML</abbr> document doesn’t contain elements; it contains <i>nodes</i>. Depending on their type, nodes can be elements, attributes, or even text content. The result of an XPath query is a list of nodes. This query returns a list of text nodes: the text content (<code>text()</code>) of the <code>title</code> element (<code>atom:title</code>) that is a child of the current element (<code>./</code>).
|
||||
</ol>
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=xml-generate>Generating XML</h2>
|
||||
|
||||
<p>Python’s support for <abbr>XML</abbr> is not limited to parsing existing documents. You can also create <abbr>XML</abbr> documents from scratch.
|
||||
@@ -549,6 +561,8 @@ except ImportError:
|
||||
<li>You can also apply “pretty printing” to the serialization, which inserts line breaks after end tags, and after start tags of elements that contain child elements but no text content. In technical terms, lxml adds “insignificant whitespace” to make the output more readable.
|
||||
</ol>
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=xml-custom-parser>Customizing Your XML Parser</h2>
|
||||
|
||||
<p>The <abbr>XML</abbr> specification mandates that all conforming <abbr>XML</abbr> parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the <abbr>XML</abbr> document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like <abbr>HTML</abbr> — your browser doesn’t stop rendering a web page if you forget to close an <abbr>HTML</abbr> tag or escape an ampersand in an attribute value. (It is a common misconception that <abbr>HTML</abbr> has no defined error handling. <a href=http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#parsing><abbr>HTML</abbr> error handling</a> is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”)
|
||||
@@ -610,6 +624,8 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
|
||||
|
||||
<p>It is important to reiterate that there is <strong>no guarantee of interoperability</strong> with “recovering” <abbr>XML</abbr> parsers. A different parser might decide that it recognized the <code>&hellip;</code> entity from <abbr>HTML</abbr>, and replace it with <code>&amp;hellip;</code> instead. Is that “better”? Maybe. Is it “more correct”? No, they are both equally incorrect. The correct behavior (according to the <abbr>XML</abbr> specification) is to halt and catch fire. If you’ve decided not to do that, you’re on your own.
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=furtherreading>Further Reading</h2>
|
||||
|
||||
<ul>
|
||||
|
||||
Reference in New Issue
Block a user