mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
more xml chapter
This commit is contained in:
@@ -42,7 +42,6 @@
|
||||
<entry>
|
||||
<author>
|
||||
<name>Mark</name>
|
||||
<uri>http://diveintomark.org/</uri>
|
||||
</author>
|
||||
<title>A gentle introduction to video encoding, part 1: container formats</title>
|
||||
<link rel="alternate" type="text/html"
|
||||
|
||||
@@ -18,9 +18,9 @@ mark{display:inline}
|
||||
</blockquote>
|
||||
<p id=toc>
|
||||
<h2 id=divingin>Diving In</h2>
|
||||
<p class=f>Most of the chapters in this book have centered around a piece of sample code. But XML isn’t about code; it’s about data. One common use of XML is “syndication feeds” that list the latest articles on a blog, forum, or other frequently-updated website. Most popular blogging software can produce a feed and update it whenever new articles, discussion threads, or blog posts are published. You can follow a blog by “subscribing” to its feed, and you can follow multiple blogs with a dedicated “<a href=http://en.wikipedia.org/wiki/List_of_feed_aggregators>feed aggregator</a>” like <a href=http://www.google.com/reader/>Google Reader</a>.
|
||||
<p class=f>Most of the chapters in this book have centered around a piece of sample code. But <abbr>XML</abbr> isn’t about code; it’s about data. One common use of <abbr>XML</abbr> is “syndication feeds” that list the latest articles on a blog, forum, or other frequently-updated website. Most popular blogging software can produce a feed and update it whenever new articles, discussion threads, or blog posts are published. You can follow a blog by “subscribing” to its feed, and you can follow multiple blogs with a dedicated “<a href=http://en.wikipedia.org/wiki/List_of_feed_aggregators>feed aggregator</a>” like <a href=http://www.google.com/reader/>Google Reader</a>.
|
||||
|
||||
<p>Here, then, is the XML data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
||||
<p>Here, then, is the <abbr>XML</abbr> data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
||||
|
||||
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
||||
<pre><code><?xml version="1.0" encoding="utf-8"?>
|
||||
@@ -68,7 +68,6 @@ mark{display:inline}
|
||||
<entry>
|
||||
<author>
|
||||
<name>Mark</name>
|
||||
<uri>http://diveintomark.org/</uri>
|
||||
</author>
|
||||
<title>A gentle introduction to video encoding, part 1: container formats</title>
|
||||
<link rel="alternate" type="text/html"
|
||||
@@ -91,9 +90,9 @@ mark{display:inline}
|
||||
|
||||
<h2 id=xml-intro>A 5-Minute Crash Course in XML</h2>
|
||||
|
||||
<p>If you already know about XML, you can skip this section.
|
||||
<p>If you already know about <abbr>XML</abbr>, you can skip this section.
|
||||
|
||||
<p>XML is a generalized way of describing hierarchical structured data. An XML <i>document</i> contains one or more <i>elements</i>, which are delimited by <i>start and end tags</i>. This is a complete (albeit boring) XML document:
|
||||
<p><abbr>XML</abbr> is a generalized way of describing hierarchical structured data. An <abbr>XML</abbr> <i>document</i> contains one or more <i>elements</i>, which are delimited by <i>start and end tags</i>. This is a complete (albeit boring) <abbr>XML</abbr> document:
|
||||
|
||||
<pre class=nd><code><a><foo> <span>①</span></a>
|
||||
<a></foo> <span>②</span></a></code></pre>
|
||||
@@ -109,7 +108,7 @@ mark{display:inline}
|
||||
</foo>
|
||||
</code></pre>
|
||||
|
||||
<p>The first element in every XML document is called the <i>root element</i>. An XML document can only have one root element. The following is <strong>not an XML document</strong>, because it has two root elements:
|
||||
<p>The first element in every <abbr>XML</abbr> document is called the <i>root element</i>. An <abbr>XML</abbr> document can only have one root element. The following is <strong>not an <abbr>XML</abbr> document</strong>, because it has two root elements:
|
||||
|
||||
<pre class=nd><code><foo></foo>
|
||||
<bar></bar></code></pre>
|
||||
@@ -138,11 +137,11 @@ mark{display:inline}
|
||||
|
||||
<pre class=nd><code><foo></foo></code></pre>
|
||||
|
||||
<p>There is a shorthand for writing empty elements. By putting a <code>/</code> character in the start tag, you can skip the end tag altogther. The XML document in the previous example could be written like this instead:
|
||||
<p>There is a shorthand for writing empty elements. By putting a <code>/</code> character in the start tag, you can skip the end tag altogther. The <abbr>XML</abbr> document in the previous example could be written like this instead:
|
||||
|
||||
<pre class=nd><code><foo<mark>/</mark>></code></pre>
|
||||
|
||||
<p>Like Python functions can be declared in different <i>modules</i>, XML elements can be declared in different <i>namespaces</i>. Namespaces usually look like URLs. You use an <code>xmlns</code> declaration to define a <i>default namespace</i>. A namespace declaration looks similar to an attribute, but it has a different purpose.
|
||||
<p>Like Python functions can be declared in different <i>modules</i>, <abbr>XML</abbr> elements can be declared in different <i>namespaces</i>. Namespaces usually look like URLs. You use an <code>xmlns</code> declaration to define a <i>default namespace</i>. A namespace declaration looks similar to an attribute, but it has a different purpose.
|
||||
|
||||
<pre class=nd><code><a><feed <mark>xmlns="http://www.w3.org/2005/Atom"</mark>> <span>①</span></a>
|
||||
<a> <title>dive into mark</title> <span>②</span></a>
|
||||
@@ -163,13 +162,13 @@ mark{display:inline}
|
||||
<li>The <code>title</code> element is also in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
||||
</ol>
|
||||
|
||||
<p>As far as an XML parser is concerned, the previous two XML documents are <em>identical</em>. Namespace + element name = XML identity. Prefixes only exist to refer to namespaces, so the actual prefix name (<code>atom:</code>) is irrelevant. The namespaces match, the element names match, the attributes (or lack of attributes) match, and each element’s text content matches, therefore the XML documents are the same.
|
||||
<p>As far as an <abbr>XML</abbr> parser is concerned, the previous two <abbr>XML</abbr> documents are <em>identical</em>. Namespace + element name = <abbr>XML</abbr> identity. Prefixes only exist to refer to namespaces, so the actual prefix name (<code>atom:</code>) is irrelevant. The namespaces match, the element names match, the attributes (or lack of attributes) match, and each element’s text content matches, therefore the <abbr>XML</abbr> documents are the same.
|
||||
|
||||
<p>Finally, XML documents can contain <a href=strings.html#one-ring-to-rule-them-all>character encoding information</a> on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the document can be parsed, <a href=http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info>Section F of the XML specification</a> details how to resolve this Catch-22.)
|
||||
<p>Finally, <abbr>XML</abbr> documents can contain <a href=strings.html#one-ring-to-rule-them-all>character encoding information</a> on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the document can be parsed, <a href=http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info>Section F of the <abbr>XML</abbr> specification</a> details how to resolve this Catch-22.)
|
||||
|
||||
<pre class=nd><code><?xml version="1.0" <mark>encoding="utf-8"</mark>?></code></pre>
|
||||
|
||||
<p>And now you know just enough XML to be dangerous!
|
||||
<p>And now you know just enough <abbr>XML</abbr> to be dangerous!
|
||||
|
||||
<h2 id=xml-structure>The Structure Of An Atom Feed</h2>
|
||||
|
||||
@@ -199,13 +198,13 @@ mark{display:inline}
|
||||
<li>The subtitle of this feed is <code>currently between addictions</code>.
|
||||
<li>Every feed needs a globally unique identifier. See <a href=http://www.ietf.org/rfc/rfc4151.txt>RFC 4151</a> for how to create one.
|
||||
<li>This feed was last updated on March 27, 2009, at 21:56 GMT. This is usually equivalent to the last-modified date of the most recent article.
|
||||
<li>Now things start to get interesting. This <code>link</code> element has no text content, but it has three attributes: <code>rel</code>, <code>type</code>, and <code>href</code>. The <code>rel</code> value tells you what kind of link this is; <code>rel="alternate"</code> means that this is a link to an alternate representation of this feed. The <code>type="text/html"</code> attribute means that this is a link to an HTML page. And the link target is given in the <code>href</code> attribute.
|
||||
<li>Now things start to get interesting. This <code>link</code> element has no text content, but it has three attributes: <code>rel</code>, <code>type</code>, and <code>href</code>. The <code>rel</code> value tells you what kind of link this is; <code>rel="alternate"</code> means that this is a link to an alternate representation of this feed. The <code>type="text/html"</code> attribute means that this is a link to an <abbr>HTML</abbr> page. And the link target is given in the <code>href</code> attribute.
|
||||
</ol>
|
||||
|
||||
<p>Now we know that this is a feed for a site named “dive into mark“ which is available at <a href=http://diveintomark.org/><code>http://diveintomark.org/</code></a> and was last updated on March 27, 2009.
|
||||
|
||||
<blockquote class=note>
|
||||
<p><span>☞</span>Although the order of elements can be relevant in some XML documents, it is not relevant in an Atom feed.
|
||||
<p><span>☞</span>Although the order of elements can be relevant in some <abbr>XML</abbr> documents, it is not relevant in an Atom feed.
|
||||
</blockquote>
|
||||
|
||||
<p>After the feed-level metadata is the list of the most recent articles. An article looks like this:
|
||||
@@ -232,17 +231,17 @@ mark{display:inline}
|
||||
<ol>
|
||||
<li>The <code>author</code> element tells who wrote this article: some guy named Mark, whom you can find loafing at <code>http://diveintomark.org/</code>. (This is the same as the alternate link in the feed metadata, but it doesn’t have to be. Many weblogs have multiple authors, each with their own personal website.)
|
||||
<li>The <code>title</code> element gives the title of the article, “Dive into history, 2009 edition”.
|
||||
<li>As with the feed-level alternate link, this <code>link</code> element gives the address of the HTML version of this article.
|
||||
<li>As with the feed-level alternate link, this <code>link</code> element gives the address of the <abbr>HTML</abbr> version of this article.
|
||||
<li>Entries, like feeds, need a unique identifier.
|
||||
<li>Entries have two dates: a first-published date (<code>published</code>) and a last-modified date (<code>updated</code>).
|
||||
<li>Entries can have an arbitrary number of categories. This article is filed under <code>diveintopython</code>, <code>docbook</code>, and <code>html</code>.
|
||||
<li>The <code>summary</code> element gives a brief summary of the article. (There is also a <code>content</code> element, not shown here, if you want to include the complete article text in your feed.) This <code>summary</code> element has the Atom-specific <code>type="html"</code> attribute, which specifies that this summary is a snippet of HTML, not plain text. This is important, since it has HTML-specific entities in it (<code>&mdash;</code> and <code>&hellip;</code>) which should be rendered as “—” and “…” rather than displayed directly.
|
||||
<li>The <code>summary</code> element gives a brief summary of the article. (There is also a <code>content</code> element, not shown here, if you want to include the complete article text in your feed.) This <code>summary</code> element has the Atom-specific <code>type="html"</code> attribute, which specifies that this summary is a snippet of <abbr>HTML</abbr>, not plain text. This is important, since it has <abbr>HTML</abbr>-specific entities in it (<code>&mdash;</code> and <code>&hellip;</code>) which should be rendered as “—” and “…” rather than displayed directly.
|
||||
<li>Finally, the end tag for the <code>entry</code> element, signaling the end of the metadata for this article.
|
||||
</ol>
|
||||
|
||||
<h2 id=xml-parse>Parsing XML</h2>
|
||||
|
||||
<p>Python can parse XML documents in several ways. It has traditional <a href=http://en.wikipedia.org/wiki/XML#DOM>DOM</a> and <a href=http://en.wikipedia.org/wiki/Simple_API_for_XML>SAX</a> parsers, but I will focus on a different library called Etree.
|
||||
<p>Python can parse <abbr>XML</abbr> documents in several ways. It has traditional <a href=http://en.wikipedia.org/wiki/XML#DOM>DOM</a> and <a href=http://en.wikipedia.org/wiki/Simple_API_for_XML>SAX</a> parsers, but I will focus on a different library called Etree.
|
||||
|
||||
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
||||
<pre class=screen>
|
||||
@@ -253,13 +252,13 @@ mark{display:inline}
|
||||
<samp><Element {http://www.w3.org/2005/Atom}feed at cd1eb0></samp></pre>
|
||||
<ol>
|
||||
<li>The Etree library is part of the Python standard library, in <code>xml.etree.ElementTree</code>.
|
||||
<li>The primary entry point for the Etree library is the <code>parse()</code> function, which can take a filename or a file-like object [FIXME xref]. This function parses the entire document at once. If memory is tight, there are ways to parse an XML document incrementally instead.
|
||||
<li>The primary entry point for the Etree library is the <code>parse()</code> function, which can take a filename or a file-like object [FIXME xref]. This function parses the entire document at once. If memory is tight, there are ways to parse an <abbr>XML</abbr> document incrementally instead.
|
||||
<li>The <code>parse()</code> function returns an object which represents the entire document. This is <em>not</em> the root element. To get a reference to the root element, call the <code>getroot()</code> method.
|
||||
<li>As expected, the root element is the <code>feed</code> element in the <code>http://www.w3.org/2005/Atom</code> namespace. The string representation of this object reinforces an important point: an XML element is a combination of its namespace and its tag name (also called the <i>local name</i>). Every element in this document is in the Atom namespace, so the root element is represented as <code>{http://www.w3.org/2005/Atom}feed</code>.
|
||||
<li>As expected, the root element is the <code>feed</code> element in the <code>http://www.w3.org/2005/Atom</code> namespace. The string representation of this object reinforces an important point: an <abbr>XML</abbr> element is a combination of its namespace and its tag name (also called the <i>local name</i>). Every element in this document is in the Atom namespace, so the root element is represented as <code>{http://www.w3.org/2005/Atom}feed</code>.
|
||||
</ol>
|
||||
|
||||
<blockquote class=note>
|
||||
<p><span>☞</span>Etree represents XML elements as <code>{<var>namespace</var>}<var>localname</var></code>. You’ll see and use this format in multiple places in the Etree library.
|
||||
<p><span>☞</span>Etree represents <abbr>XML</abbr> elements as <code>{<var>namespace</var>}<var>localname</var></code>. You’ll see and use this format in multiple places in the Etree library.
|
||||
</blockquote>
|
||||
|
||||
<h3 id=xml-elements>Elements Are Lists</h3>
|
||||
@@ -294,7 +293,7 @@ mark{display:inline}
|
||||
|
||||
<h3 id=xml-attributes>Attributes Are Dictonaries</h3>
|
||||
|
||||
<p>XML isn’t just a collection of elements; each element can also have its own set of attributes. Once you have a reference to a specific element, you can easily get its attributes as a Python dictionary.
|
||||
<p><abbr>XML</abbr> isn’t just a collection of elements; each element can also have its own set of attributes. Once you have a reference to a specific element, you can easily get its attributes as a Python dictionary.
|
||||
|
||||
<pre class=screen>
|
||||
# continuing from the previous example
|
||||
@@ -311,7 +310,7 @@ mark{display:inline}
|
||||
<a><samp class=p>>>> </samp><kbd>root[3].attrib</kbd> <span>⑤</span></a>
|
||||
<samp>{}</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>attrib</code> property is a dictionary of the element’s attributes. The original markup here was <code><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"></code>. The <code>xml:</code> prefix refers to a built-in namespace that every XML document can use without declaring it.
|
||||
<li>The <code>attrib</code> property is a dictionary of the element’s attributes. The original markup here was <code><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"></code>. The <code>xml:</code> prefix refers to a built-in namespace that every <abbr>XML</abbr> document can use without declaring it.
|
||||
<li>The fifth child — <code>[4]</code> in a <code>0</code>-based list — is the <code>link</code> element.
|
||||
<li>The <code>link</code> element has three attributes: <code>href</code>, <code>type</code>, and <code>rel</code>.
|
||||
<li>The fourth child — <code>[3]</code> in a <code>0</code>-based list — is the <code>updated</code> element.
|
||||
@@ -320,37 +319,56 @@ mark{display:inline}
|
||||
|
||||
<h2 id=xml-find>Searching For Nodes Within An XML Document</h2>
|
||||
|
||||
<p>FIXME
|
||||
<p>So far, we’ve worked with this <abbr>XML</abbr> document “from the top down,” starting with the root element, getting its child elements, and so on throughout the document. But many uses of <abbr>XML</abbr> require you to find specific elements. Etree can do that, too.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = etree.parse("examples/feed.xml")</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree.findall("{http://www.w3.org/2005/Atom}entry")</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}entry")</kbd> <span>①</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp></pre>
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>root.tag</kbd>
|
||||
<samp>'{http://www.w3.org/2005/Atom}feed'</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}feed")</kbd> <span>②</span></a>
|
||||
<samp>[]</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}author")</kbd> <span>③</span></a>
|
||||
<samp>[]</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>findall()</code> method finds child elements that match a specific query. (More on the query format in a minute.)
|
||||
<li>Each element — including the root element, but also child elements — has a <code>findall()</code> method. It finds all matching elements among the element’s children.
|
||||
<li>What happened here? Although it may not be obvious, this particular <code>findall()</code> query only searches the element’s children. Since the root <code>feed</code> element has no child named <code>feed</code>, this query returns an empty list.
|
||||
<li>This result may also surprise you. <a href=#divingin>There is an <code>author</code> element</a> in this document; in fact, there are three (one in each <code>entry</code>). But those <code>author</code> elements are not <em>direct children</em> of the root element; they are “grandchildren” (literally, a child element of a child element). If you want to look for <code>author</code> elements at any nesting level, you can do that, but the query format is slightly different.
|
||||
</ol>
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>feed_links = tree.findall("{http://www.w3.org/2005/Atom}link")</kbd>
|
||||
<samp class=p>>>> </samp><kbd>feed_links</kbd>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at e181b0>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>feed_links[0].attrib</kbd>
|
||||
<samp>{'href': 'http://diveintomark.org/',
|
||||
'type': 'text/html',
|
||||
'rel': 'alternate'}</samp></pre>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("{http://www.w3.org/2005/Atom}entry")</kbd> <span>①</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("{http://www.w3.org/2005/Atom}author")</kbd> <span>②</span></a>
|
||||
<samp>[]</samp>
|
||||
</pre>
|
||||
<ol>
|
||||
<li>For convenience, the <code>tree</code> object (returned from the <code>etree.parse()</code> function) has several methods that mirror the methods on the root element. The results are the same as if you had called the <code>tree.getroot().findall()</code> method.
|
||||
<li>Perhaps surprisingly, this query does not find the <code>author</code> elements in this document. Why not? Because this is just a shortcut for <code>tree.getroot().findall("{http://www.w3.org/2005/Atom}author")</code>, which means “find all the <code>author</code> elements that are children of the root element.” The <code>author</code> elements are not children of the root element; they’re children of the <code>entry</code> elements. Thus the query doesn’t return any matches.
|
||||
</ol>
|
||||
|
||||
<p>There <em>is</em> a way to search for <em>descendant</em> elements, <i>i.e.</i> children, grandchildren, and any element at any nesting level.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>all_links = tree.findall("//{http://www.w3.org/2005/Atom}link")</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links = tree.findall("//{http://www.w3.org/2005/Atom}link")</kbd> <span>①</span></a>
|
||||
<samp class=p>>>> </samp><kbd>all_links</kbd>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at e181b0>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at e2b570>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at e2b480>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at e2b5a0>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>all_links[0].attrib</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links[0].attrib</kbd> <span>②</span></a>
|
||||
<samp>{'href': 'http://diveintomark.org/',
|
||||
'type': 'text/html',
|
||||
'rel': 'alternate'}</samp>
|
||||
<samp class=p>>>> </samp><kbd>all_links[1].attrib</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links[1].attrib</kbd> <span>③</span></a>
|
||||
<samp>{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
|
||||
'type': 'text/html',
|
||||
'rel': 'alternate'}</samp>
|
||||
@@ -362,6 +380,34 @@ mark{display:inline}
|
||||
<samp>{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',
|
||||
'type': 'text/html',
|
||||
'rel': 'alternate'}</samp></pre>
|
||||
<ol>
|
||||
<li>This query — <code>//{http://www.w3.org/2005/Atom}link</code> — is very similar to the previous examples, except for the two slashes at the beginning of the query. Those two slashes mean “don’t just look for direct children; I want <em>any</em> elements, regardless of nesting level.” So the result is a list of four <code>link</code> elements, not just one.
|
||||
<li>The first result <em>is</em> a direct child of the root element. As you can see from its attributes, this is the feed-level alternate link that points to the <abbr>HTML</abbr> version of the website that the feed describes.
|
||||
<li>The other three results are each entry-level alternate links. Each <code>entry</code> has a single <code>link</code> child element, and because of the double slash at the beginning of the query, this query finds all of them.
|
||||
</ol>
|
||||
|
||||
<p>The <code>findall()</code> method has a few other tricks up its sleeve.
|
||||
|
||||
<pre class=screen>
|
||||
# continuing from the previous example
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href]")</kbd> <span>①</span></a>
|
||||
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb990>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb960>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span>②</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at eeb930>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>NS = "{http://www.w3.org/2005/Atom}"</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{NS}author[{NS}uri]".format(NS=NS))</kbd> <span>③</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
|
||||
<Element {http://www.w3.org/2005/Atom}author at eebba0>]</samp></pre>
|
||||
<ol>
|
||||
<li>This query finds all elements in the Atom namespace, anywhere in the document, that have an <code>href</code> attribute. The <code>//</code> at the beginning of the query means “elements anywhere (not just as children of the root element).” <code>{http://www.w3.org/2005/Atom}</code> means “only elements in the Atom namespace.” <code>*</code> means “elements with any local name.” And <code>[@href]</code> means “has an <code>href</code> attribute.”
|
||||
<li>The query finds all Atom elements with an <code>href</code> whose value is <code>http://diveintomark.org/</code>.
|
||||
<li>After doing some quick <a href=strings.html#formatting-strings>string formatting</a> (because otherwise these compound queries get ridiculously long), this query searches for Atom <code>author</code> elements that have an Atom <code>uri</code> element as a child. This only returns two <code>author</code> elements, the ones in the first and second <code>entry</code>. The <code>author</code> in the last <code>entry</code> contains only a <code>name</code>, not a <code>uri</code>.
|
||||
</ol>
|
||||
|
||||
<p>Overall, ElementTree’s <code>findall()</code> method is a very powerful feature, but the query language can be a bit surprising. It is officially described as “<a href=http://effbot.org/zone/element-xpath.htm>limited support for XPath expressions</a>.” <a href=http://www.w3.org/TR/xpath>XPath</a> is a W3C standard for querying <abbr>XML</abbr> documents. ElementTree’s query language is similar enough to XPath to do basic searching, but dissimilar enough that it may annoy you if you already know XPath. Now let’s look at a third-party <abbr>XML</abbr> library that extends the ElementTree <abbr>API</abbr> with full XPath support.
|
||||
|
||||
<h2 id=xml-lxml>Going Further With lxml</h2>
|
||||
|
||||
@@ -459,12 +505,13 @@ StopIteration</samp></pre>
|
||||
<h2 id=furtherreading>Further Reading</h2>
|
||||
|
||||
<ul>
|
||||
<li><a href=http://en.wikipedia.org/wiki/XML>XML on Wikipedia.org</a>
|
||||
<li><a href=http://docs.python.org/3.0/library/xml.etree.elementtree.html>The ElementTree XML API</a>
|
||||
<li><a href=http://en.wikipedia.org/wiki/XML><abbr>XML</abbr> on Wikipedia.org</a>
|
||||
<li><a href=http://docs.python.org/3.0/library/xml.etree.elementtree.html>The ElementTree <abbr>XML</abbr> API</a>
|
||||
<li><a href=http://effbot.org/zone/element.htm>Elements and Element Trees</a>
|
||||
<li><a href=http://effbot.org/zone/element-xpath.htm>XPath Support in ElementTree</a>
|
||||
<li><a href=http://effbot.org/zone/element-iterparse.htm>The ElementTree iterparse Function</a>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/parsing.html>Parsing XML and HTML with lxml</a>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and XSLT with lxml</a>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/parsing.html>Parsing <abbr>XML</abbr> and <abbr>HTML</abbr> with lxml</a>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and <abbr>XSLT</abbr> with lxml</a>
|
||||
</ul>
|
||||
|
||||
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
||||
|
||||
Reference in New Issue
Block a user