mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
added note about list concatenation and memory usage. unrelatedly, added nonbreaking spaces around long dashes.
This commit is contained in:
@@ -23,7 +23,7 @@ mark{display:inline}
|
||||
<h2 id=divingin>Diving In</h2>
|
||||
<p class=f>Most of the chapters in this book have centered around a piece of sample code. But <abbr>XML</abbr> isn’t about code; it’s about data. One common use of <abbr>XML</abbr> is “syndication feeds” that list the latest articles on a blog, forum, or other frequently-updated website. Most popular blogging software can produce a feed and update it whenever new articles, discussion threads, or blog posts are published. You can follow a blog by “subscribing” to its feed, and you can follow multiple blogs with a dedicated “<a href=http://en.wikipedia.org/wiki/List_of_feed_aggregators>feed aggregator</a>” like <a href=http://www.google.com/reader/>Google Reader</a>.
|
||||
|
||||
<p>Here, then, is the <abbr>XML</abbr> data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
||||
<p>Here, then, is the <abbr>XML</abbr> data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
||||
|
||||
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
||||
<pre><code class=pp><?xml version='1.0' encoding='utf-8'?>
|
||||
@@ -320,9 +320,9 @@ mark{display:inline}
|
||||
<samp class=pp>{}</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>attrib</code> property is a dictionary of the element’s attributes. The original markup here was <code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'></code>. The <code>xml:</code> prefix refers to a built-in namespace that every <abbr>XML</abbr> document can use without declaring it.
|
||||
<li>The fifth child — <code>[4]</code> in a <code>0</code>-based list — is the <code>link</code> element.
|
||||
<li>The fifth child — <code>[4]</code> in a <code>0</code>-based list — is the <code>link</code> element.
|
||||
<li>The <code>link</code> element has three attributes: <code>href</code>, <code>type</code>, and <code>rel</code>.
|
||||
<li>The fourth child — <code>[3]</code> in a <code>0</code>-based list — is the <code>updated</code> element.
|
||||
<li>The fourth child — <code>[3]</code> in a <code>0</code>-based list — is the <code>updated</code> element.
|
||||
<li>The <code>updated</code> element has no attributes, so its <code>.attrib</code> is just an empty dictionary.
|
||||
</ol>
|
||||
|
||||
@@ -348,7 +348,7 @@ mark{display:inline}
|
||||
<samp class=pp>[]</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>findall()</code> method finds child elements that match a specific query. (More on the query format in a minute.)
|
||||
<li>Each element — including the root element, but also child elements — has a <code>findall()</code> method. It finds all matching elements among the element’s children. But why aren’t there any results? Although it may not be obvious, this particular query only searches the element’s children. Since the root <code>feed</code> element has no child named <code>feed</code>, this query returns an empty list.
|
||||
<li>Each element — including the root element, but also child elements — has a <code>findall()</code> method. It finds all matching elements among the element’s children. But why aren’t there any results? Although it may not be obvious, this particular query only searches the element’s children. Since the root <code>feed</code> element has no child named <code>feed</code>, this query returns an empty list.
|
||||
<li>This result may also surprise you. <a href=#divingin>There is an <code>author</code> element</a> in this document; in fact, there are three (one in each <code>entry</code>). But those <code>author</code> elements are not <em>direct children</em> of the root element; they are “grandchildren” (literally, a child element of a child element). If you want to look for <code>author</code> elements at any nesting level, you can do that, but the query format is slightly different.
|
||||
</ol>
|
||||
|
||||
@@ -391,7 +391,7 @@ mark{display:inline}
|
||||
'type': 'text/html',
|
||||
'rel': 'alternate'}</samp></pre>
|
||||
<ol>
|
||||
<li>This query — <code>//{http://www.w3.org/2005/Atom}link</code> — is very similar to the previous examples, except for the two slashes at the beginning of the query. Those two slashes mean “don’t just look for direct children; I want <em>any</em> elements, regardless of nesting level.” So the result is a list of four <code>link</code> elements, not just one.
|
||||
<li>This query — <code>//{http://www.w3.org/2005/Atom}link</code> — is very similar to the previous examples, except for the two slashes at the beginning of the query. Those two slashes mean “don’t just look for direct children; I want <em>any</em> elements, regardless of nesting level.” So the result is a list of four <code>link</code> elements, not just one.
|
||||
<li>The first result <em>is</em> a direct child of the root element. As you can see from its attributes, this is the feed-level alternate link that points to the <abbr>HTML</abbr> version of the website that the feed describes.
|
||||
<li>The other three results are each entry-level alternate links. Each <code>entry</code> has a single <code>link</code> child element, and because of the double slash at the beginning of the query, this query finds all of them.
|
||||
</ol>
|
||||
@@ -509,7 +509,7 @@ except ImportError:
|
||||
<li>At any time, you can serialize any element (and its children) with the ElementTree <code>tostring()</code> function.
|
||||
</ol>
|
||||
|
||||
<p>Was that serialization surprising to you? The way ElementTree serializes namespaced <abbr>XML</abbr> elements is technically accurate but not optimal. The sample <abbr>XML</abbr> document at the beginning of this chapter defined a <i>default namespace</i> (<code>xmlns='http://www.w3.org/2005/Atom'</code>). Defining a default namespace is useful for documents — like Atom feeds — where every element is in the same namespace, because you can declare the namespace once and declare each element with just its local name (<code><feed></code>, <code><link></code>, <code><entry></code>). There is no need to use any prefixes unless you want to declare elements from another namespace.
|
||||
<p>Was that serialization surprising to you? The way ElementTree serializes namespaced <abbr>XML</abbr> elements is technically accurate but not optimal. The sample <abbr>XML</abbr> document at the beginning of this chapter defined a <i>default namespace</i> (<code>xmlns='http://www.w3.org/2005/Atom'</code>). Defining a default namespace is useful for documents — like Atom feeds — where every element is in the same namespace, because you can declare the namespace once and declare each element with just its local name (<code><feed></code>, <code><link></code>, <code><entry></code>). There is no need to use any prefixes unless you want to declare elements from another namespace.
|
||||
|
||||
<p>An <abbr>XML</abbr> parser won’t “see” any difference between an <abbr>XML</abbr> document with a default namespace and an <abbr>XML</abbr> document with a prefixed namespace. The resulting <abbr>DOM</abbr> of this serialization:
|
||||
|
||||
@@ -566,7 +566,7 @@ except ImportError:
|
||||
|
||||
<h2 id=xml-custom-parser>Parsing Broken XML</h2>
|
||||
|
||||
<p>The <abbr>XML</abbr> specification mandates that all conforming <abbr>XML</abbr> parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the <abbr>XML</abbr> document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like <abbr>HTML</abbr> — your browser doesn’t stop rendering a web page if you forget to close an <abbr>HTML</abbr> tag or escape an ampersand in an attribute value. (It is a common misconception that <abbr>HTML</abbr> has no defined error handling. <a href=http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#parsing><abbr>HTML</abbr> error handling</a> is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”)
|
||||
<p>The <abbr>XML</abbr> specification mandates that all conforming <abbr>XML</abbr> parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the <abbr>XML</abbr> document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like <abbr>HTML</abbr> — your browser doesn’t stop rendering a web page if you forget to close an <abbr>HTML</abbr> tag or escape an ampersand in an attribute value. (It is a common misconception that <abbr>HTML</abbr> has no defined error handling. <a href=http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#parsing><abbr>HTML</abbr> error handling</a> is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”)
|
||||
|
||||
<p>Some people (myself included) believe that it was a mistake for the inventors of <abbr>XML</abbr> to mandate draconian error handling. Don’t get me wrong; I can certainly see the allure of simplifying the error handling rules. But in practice, the concept of “wellformedness” is trickier than it sounds, especially for <code>XML</code> documents (like Atom feeds) that are published on the web and served over <abbr>HTTP</abbr>. Despite the maturity of <abbr>XML</abbr>, which standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom feeds on the web are plagued with wellformedness errors.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user