mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
520 lines
38 KiB
HTML
520 lines
38 KiB
HTML
<!DOCTYPE html>
|
|
<head>
|
|
<meta charset=utf-8>
|
|
<title>XML - Dive into Python 3</title>
|
|
<!--[if IE]><script src=html5.js></script><![endif]-->
|
|
<link rel=stylesheet href=dip3.css>
|
|
<style>
|
|
body{counter-reset:h1 13}
|
|
mark{display:inline}
|
|
</style>
|
|
<link rel=stylesheet media='only screen and (max-device-width: 480px)' href=mobile.css>
|
|
</head>
|
|
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=25> <input type=submit name=root value=Search></div></form>
|
|
<p>You are here: <a href=index.html>Home</a> <span>‣</span> <a href=table-of-contents.html#xml>Dive Into Python 3</a> <span>‣</span>
|
|
<p id=level>Difficulty level: <span title=advanced>♦♦♦♦♢</span>
|
|
<h1>XML</h1>
|
|
<blockquote class=q>
|
|
<p><span>❝</span> In the archonship of Aristaechmus, Draco enacted his ordinances. <span>❞</span><br>— <a href="http://www.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus:text:1999.01.0046;query=chapter%3D%235;layout=;loc=3.1">Aristotle</a>
|
|
</blockquote>
|
|
<p id=toc>
|
|
<h2 id=divingin>Diving In</h2>
|
|
<p class=f>Most of the chapters in this book have centered around a piece of sample code. But <abbr>XML</abbr> isn’t about code; it’s about data. One common use of <abbr>XML</abbr> is “syndication feeds” that list the latest articles on a blog, forum, or other frequently-updated website. Most popular blogging software can produce a feed and update it whenever new articles, discussion threads, or blog posts are published. You can follow a blog by “subscribing” to its feed, and you can follow multiple blogs with a dedicated “<a href=http://en.wikipedia.org/wiki/List_of_feed_aggregators>feed aggregator</a>” like <a href=http://www.google.com/reader/>Google Reader</a>.
|
|
|
|
<p>Here, then, is the <abbr>XML</abbr> data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
|
|
|
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
|
<pre><code><?xml version="1.0" encoding="utf-8"?>
|
|
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
|
<title>dive into mark</title>
|
|
<subtitle>currently between addictions</subtitle>
|
|
<id>tag:diveintomark.org,2001-07-29:/</id>
|
|
<updated>2009-03-27T21:56:07Z</updated>
|
|
<link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
|
|
<link rel="self" type="application/atom+xml" href="http://diveintomark.org/feed/"/>
|
|
<entry>
|
|
<author>
|
|
<name>Mark</name>
|
|
<uri>http://diveintomark.org/</uri>
|
|
</author>
|
|
<title>Dive into history, 2009 edition</title>
|
|
<link rel="alternate" type="text/html"
|
|
href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
|
|
<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
|
|
<updated>2009-03-27T21:56:07Z</updated>
|
|
<published>2009-03-27T17:20:42Z</published>
|
|
<category scheme="http://diveintomark.org" term="diveintopython"/>
|
|
<category scheme="http://diveintomark.org" term="docbook"/>
|
|
<category scheme="http://diveintomark.org" term="html"/>
|
|
<summary type="html">Putting an entire chapter on one page sounds
|
|
bloated, but consider this &amp;mdash; my longest chapter so far
|
|
would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
|
|
On dialup.</summary>
|
|
</entry>
|
|
<entry>
|
|
<author>
|
|
<name>Mark</name>
|
|
<uri>http://diveintomark.org/</uri>
|
|
</author>
|
|
<title>Accessibility is a harsh mistress</title>
|
|
<link rel="alternate" type="text/html"
|
|
href="http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress"/>
|
|
<id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
|
|
<updated>2009-03-22T01:05:37Z</updated>
|
|
<published>2009-03-21T20:09:28Z</published>
|
|
<category scheme="http://diveintomark.org" term="accessibility"/>
|
|
<summary type="html">The accessibility orthodoxy does not permit people to
|
|
question the value of features that are rarely useful and rarely used.</summary>
|
|
</entry>
|
|
<entry>
|
|
<author>
|
|
<name>Mark</name>
|
|
</author>
|
|
<title>A gentle introduction to video encoding, part 1: container formats</title>
|
|
<link rel="alternate" type="text/html"
|
|
href="http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats"/>
|
|
<id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
|
|
<updated>2009-01-11T19:39:22Z</updated>
|
|
<published>2008-12-18T15:54:22Z</published>
|
|
<category scheme="http://diveintomark.org" term="asf"/>
|
|
<category scheme="http://diveintomark.org" term="avi"/>
|
|
<category scheme="http://diveintomark.org" term="encoding"/>
|
|
<category scheme="http://diveintomark.org" term="flv"/>
|
|
<category scheme="http://diveintomark.org" term="GIVE"/>
|
|
<category scheme="http://diveintomark.org" term="mp4"/>
|
|
<category scheme="http://diveintomark.org" term="ogg"/>
|
|
<category scheme="http://diveintomark.org" term="video"/>
|
|
<summary type="html">These notes will eventually become part of a
|
|
tech talk on video encoding.</summary>
|
|
</entry>
|
|
</feed></code></pre>
|
|
|
|
<h2 id=xml-intro>A 5-Minute Crash Course in XML</h2>
|
|
|
|
<p>If you already know about <abbr>XML</abbr>, you can skip this section.
|
|
|
|
<p><abbr>XML</abbr> is a generalized way of describing hierarchical structured data. An <abbr>XML</abbr> <i>document</i> contains one or more <i>elements</i>, which are delimited by <i>start and end tags</i>. This is a complete (albeit boring) <abbr>XML</abbr> document:
|
|
|
|
<pre class=nd><code><a><foo> <span>①</span></a>
|
|
<a></foo> <span>②</span></a></code></pre>
|
|
<ol>
|
|
<li>This is the <i>start tag</i> of the <code>foo</code> element.
|
|
<li>This is the matching <i>end tag</i> of the <code>foo</code> element. Like balancing parentheses in writing or mathematics or code, every start tag much be <i>closed</i> (matched) by a corresponding end tag.
|
|
</ol>
|
|
|
|
<p>Elements can be <i>nested</i> to any depth. An element <code>bar</code> inside an element <code>foo</code> is said to be a <i>subelement</i> or <i>child</i> of <code>foo</code>.
|
|
|
|
<pre class=nd><code><foo>
|
|
<mark><bar></bar></mark>
|
|
</foo>
|
|
</code></pre>
|
|
|
|
<p>The first element in every <abbr>XML</abbr> document is called the <i>root element</i>. An <abbr>XML</abbr> document can only have one root element. The following is <strong>not an <abbr>XML</abbr> document</strong>, because it has two root elements:
|
|
|
|
<pre class=nd><code><foo></foo>
|
|
<bar></bar></code></pre>
|
|
|
|
<p>Elements can have <i>attributes</i>, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. <i>Attribute names</i> can not be repeated within an element. <i>Attribute values</i> must be quoted.
|
|
|
|
<pre class=nd><code><a><foo <mark>lang="en"</mark>> <span>①</span></a>
|
|
<a> <bar <mark>lang="fr"</mark>></bar> <span>②</span></a>
|
|
</foo>
|
|
</code></pre>
|
|
<ol>
|
|
<li>The <code>foo</code> element has one attribute, named <code>lang</code>. The value of its <code>lang</code> attribute is <code>en</code>.
|
|
<li>The <code>bar</code> element has one attribute, named <code>lang</code>. The value of its <code>lang</code> attribute is <code>fr</code>. This doesn’t conflict with the <code>foo</code> element in any way. Each element has its own set of attributes.
|
|
</ol>
|
|
|
|
<p>If an element has more than one attribute, the ordering of the attributes is not significant. An element’s attributes form an unordered set of keys and values, like a Python dictionary.
|
|
|
|
<p>Elements can have <i>text content</i>.
|
|
|
|
<pre class=nd><code><foo lang="en">
|
|
<bar lang="fr"><mark>PapayaWhip</mark></bar>
|
|
</foo>
|
|
</code></pre>
|
|
|
|
<p>Elements that contain no text and no children are <i>empty</i>.
|
|
|
|
<pre class=nd><code><foo></foo></code></pre>
|
|
|
|
<p>There is a shorthand for writing empty elements. By putting a <code>/</code> character in the start tag, you can skip the end tag altogther. The <abbr>XML</abbr> document in the previous example could be written like this instead:
|
|
|
|
<pre class=nd><code><foo<mark>/</mark>></code></pre>
|
|
|
|
<p>Like Python functions can be declared in different <i>modules</i>, <abbr>XML</abbr> elements can be declared in different <i>namespaces</i>. Namespaces usually look like URLs. You use an <code>xmlns</code> declaration to define a <i>default namespace</i>. A namespace declaration looks similar to an attribute, but it has a different purpose.
|
|
|
|
<pre class=nd><code><a><feed <mark>xmlns="http://www.w3.org/2005/Atom"</mark>> <span>①</span></a>
|
|
<a> <title>dive into mark</title> <span>②</span></a>
|
|
</feed>
|
|
</code></pre>
|
|
<ol>
|
|
<li>The <code>feed</code> element is in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
|
<li>The <code>title</code> element is also in the <code>http://www.w3.org/2005/Atom</code> namespace. The namespace declaration affects the element where it’s declared, plus all child elements.
|
|
</ol>
|
|
|
|
<p>You can also use an <code>xmlns:<var>prefix</var></code> declaration to define a namespace and associate it with a <i>prefix</i>. Then each element in that namespace must be explicitly declared with the prefix.
|
|
|
|
<pre class=nd><code><a><atom:feed <mark>xmlns:atom="http://www.w3.org/2005/Atom"</mark>> <span>①</span></a>
|
|
<a> <atom:title>dive into mark</atom:title> <span>②</span></a>
|
|
</atom:feed></code></pre>
|
|
<ol>
|
|
<li>The <code>feed</code> element is in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
|
<li>The <code>title</code> element is also in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
|
</ol>
|
|
|
|
<p>As far as an <abbr>XML</abbr> parser is concerned, the previous two <abbr>XML</abbr> documents are <em>identical</em>. Namespace + element name = <abbr>XML</abbr> identity. Prefixes only exist to refer to namespaces, so the actual prefix name (<code>atom:</code>) is irrelevant. The namespaces match, the element names match, the attributes (or lack of attributes) match, and each element’s text content matches, therefore the <abbr>XML</abbr> documents are the same.
|
|
|
|
<p>Finally, <abbr>XML</abbr> documents can contain <a href=strings.html#one-ring-to-rule-them-all>character encoding information</a> on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the document can be parsed, <a href=http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info>Section F of the <abbr>XML</abbr> specification</a> details how to resolve this Catch-22.)
|
|
|
|
<pre class=nd><code><?xml version="1.0" <mark>encoding="utf-8"</mark>?></code></pre>
|
|
|
|
<p>And now you know just enough <abbr>XML</abbr> to be dangerous!
|
|
|
|
<h2 id=xml-structure>The Structure Of An Atom Feed</h2>
|
|
|
|
<p>Think of a weblog, or in fact any website with frequently updated content, like <a href=http://www.cnn.com/>CNN.com</a>. The site itself has a title (“CNN.com”), a subtitle (“Breaking News, U.S., World, Weather, Entertainment <i class=baa>&</i> Video News”), a last-updated date (“updated 12:43 p.m. EDT, Sat May 16, 2009”), and a list of articles posted at different times. Each article also has a title, a first-published date (and maybe also a last-updated date, if they published a correction or fixed a typo), and a unique URL.
|
|
|
|
<p>The Atom syndication format is designed to capture all of this information in a standard format. My weblog and CNN.com are wildly different in design, scope, and audience, but they both have the same basic structure. CNN.com has a title; my blog has a title. CNN.com publishes articles; I publish articles.
|
|
|
|
<p>At the top level is the <i>root element</i>, which every Atom feed shares: the <code>feed</code> element in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
|
|
|
<pre><code><a><feed xmlns="http://www.w3.org/2005/Atom" <span>①</span></a>
|
|
<a> xml:lang="en"> <span>②</span></a></code></pre>
|
|
<ol>
|
|
<li><code>http://www.w3.org/2005/Atom</code> is the Atom namespace.
|
|
<li>Any element can contain an <code>xml:lang</code> attribute, which declares the language of the element and its children. In this case, the <code>xml:lang</code> attribute is declared once on the root element, which means the entire feed is in English.
|
|
</ol>
|
|
|
|
<p>An Atom feed contains several pieces of information about the feed itself. These are declared as children of the root-level <code>feed</code> element.
|
|
|
|
<pre><code><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
|
<a> <title>dive into mark</title> <span>①</span></a>
|
|
<a> <subtitle>currently between addictions</subtitle> <span>②</span></a>
|
|
<a> <id>tag:diveintomark.org,2001-07-29:/</id> <span>③</span></a>
|
|
<a> <updated>2009-03-27T21:56:07Z</updated> <span>④</span></a>
|
|
<a> <link rel="alternate" type="text/html" href="http://diveintomark.org/"/> <span>⑤</span></a></code></pre>
|
|
<ol>
|
|
<li>The title of this feed is <code>dive into mark</code>.
|
|
<li>The subtitle of this feed is <code>currently between addictions</code>.
|
|
<li>Every feed needs a globally unique identifier. See <a href=http://www.ietf.org/rfc/rfc4151.txt>RFC 4151</a> for how to create one.
|
|
<li>This feed was last updated on March 27, 2009, at 21:56 GMT. This is usually equivalent to the last-modified date of the most recent article.
|
|
<li>Now things start to get interesting. This <code>link</code> element has no text content, but it has three attributes: <code>rel</code>, <code>type</code>, and <code>href</code>. The <code>rel</code> value tells you what kind of link this is; <code>rel="alternate"</code> means that this is a link to an alternate representation of this feed. The <code>type="text/html"</code> attribute means that this is a link to an <abbr>HTML</abbr> page. And the link target is given in the <code>href</code> attribute.
|
|
</ol>
|
|
|
|
<p>Now we know that this is a feed for a site named “dive into mark“ which is available at <a href=http://diveintomark.org/><code>http://diveintomark.org/</code></a> and was last updated on March 27, 2009.
|
|
|
|
<blockquote class=note>
|
|
<p><span>☞</span>Although the order of elements can be relevant in some <abbr>XML</abbr> documents, it is not relevant in an Atom feed.
|
|
</blockquote>
|
|
|
|
<p>After the feed-level metadata is the list of the most recent articles. An article looks like this:
|
|
|
|
<pre><code><entry>
|
|
<a> <author> <span>①</span></a>
|
|
<name>Mark</name>
|
|
<uri>http://diveintomark.org/</uri>
|
|
</author>
|
|
<a> <title>Dive into history, 2009 edition</title> <span>②</span></a>
|
|
<a> <link rel="alternate" type="text/html" <span>③</span></a>
|
|
href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
|
|
<a> <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id> <span>④</span></a>
|
|
<a> <updated>2009-03-27T21:56:07Z</updated> <span>⑤</span></a>
|
|
<published>2009-03-27T17:20:42Z</published>
|
|
<a> <category scheme="http://diveintomark.org" term="diveintopython"/> <span>⑥</span></a>
|
|
<category scheme="http://diveintomark.org" term="docbook"/>
|
|
<category scheme="http://diveintomark.org" term="html"/>
|
|
<a> <summary type="html">Putting an entire chapter on one page sounds <span>⑦</span></a>
|
|
bloated, but consider this &amp;mdash; my longest chapter so far
|
|
would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
|
|
On dialup.</summary>
|
|
<a></entry> <span>⑧</span></a></code></pre>
|
|
<ol>
|
|
<li>The <code>author</code> element tells who wrote this article: some guy named Mark, whom you can find loafing at <code>http://diveintomark.org/</code>. (This is the same as the alternate link in the feed metadata, but it doesn’t have to be. Many weblogs have multiple authors, each with their own personal website.)
|
|
<li>The <code>title</code> element gives the title of the article, “Dive into history, 2009 edition”.
|
|
<li>As with the feed-level alternate link, this <code>link</code> element gives the address of the <abbr>HTML</abbr> version of this article.
|
|
<li>Entries, like feeds, need a unique identifier.
|
|
<li>Entries have two dates: a first-published date (<code>published</code>) and a last-modified date (<code>updated</code>).
|
|
<li>Entries can have an arbitrary number of categories. This article is filed under <code>diveintopython</code>, <code>docbook</code>, and <code>html</code>.
|
|
<li>The <code>summary</code> element gives a brief summary of the article. (There is also a <code>content</code> element, not shown here, if you want to include the complete article text in your feed.) This <code>summary</code> element has the Atom-specific <code>type="html"</code> attribute, which specifies that this summary is a snippet of <abbr>HTML</abbr>, not plain text. This is important, since it has <abbr>HTML</abbr>-specific entities in it (<code>&mdash;</code> and <code>&hellip;</code>) which should be rendered as “—” and “…” rather than displayed directly.
|
|
<li>Finally, the end tag for the <code>entry</code> element, signaling the end of the metadata for this article.
|
|
</ol>
|
|
|
|
<h2 id=xml-parse>Parsing XML</h2>
|
|
|
|
<p>Python can parse <abbr>XML</abbr> documents in several ways. It has traditional <a href=http://en.wikipedia.org/wiki/XML#DOM>DOM</a> and <a href=http://en.wikipedia.org/wiki/Simple_API_for_XML>SAX</a> parsers, but I will focus on a different library called Etree.
|
|
|
|
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
|
<pre class=screen>
|
|
<a><samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd> <span>①</span></a>
|
|
<a><samp class=p>>>> </samp><kbd>tree = etree.parse("examples/feed.xml")</kbd> <span>②</span></a>
|
|
<a><samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd> <span>③</span></a>
|
|
<a><samp class=p>>>> </samp><kbd>root</kbd> <span>④</span></a>
|
|
<samp><Element {http://www.w3.org/2005/Atom}feed at cd1eb0></samp></pre>
|
|
<ol>
|
|
<li>The Etree library is part of the Python standard library, in <code>xml.etree.ElementTree</code>.
|
|
<li>The primary entry point for the Etree library is the <code>parse()</code> function, which can take a filename or a file-like object [FIXME xref]. This function parses the entire document at once. If memory is tight, there are ways to parse an <abbr>XML</abbr> document incrementally instead.
|
|
<li>The <code>parse()</code> function returns an object which represents the entire document. This is <em>not</em> the root element. To get a reference to the root element, call the <code>getroot()</code> method.
|
|
<li>As expected, the root element is the <code>feed</code> element in the <code>http://www.w3.org/2005/Atom</code> namespace. The string representation of this object reinforces an important point: an <abbr>XML</abbr> element is a combination of its namespace and its tag name (also called the <i>local name</i>). Every element in this document is in the Atom namespace, so the root element is represented as <code>{http://www.w3.org/2005/Atom}feed</code>.
|
|
</ol>
|
|
|
|
<blockquote class=note>
|
|
<p><span>☞</span>Etree represents <abbr>XML</abbr> elements as <code>{<var>namespace</var>}<var>localname</var></code>. You’ll see and use this format in multiple places in the Etree library.
|
|
</blockquote>
|
|
|
|
<h3 id=xml-elements>Elements Are Lists</h3>
|
|
|
|
<p>In Etree, an element acts like a list. The items of the list are the element’s children.
|
|
|
|
<pre class=screen>
|
|
# continued from the previous example
|
|
<a><samp class=p>>>> </samp><kbd>root.tag</kbd> <span>①</span></a>
|
|
<samp>'{http://www.w3.org/2005/Atom}feed'</samp>
|
|
<a><samp class=p>>>> </samp><kbd>len(root)</kbd> <span>②</span></a>
|
|
<samp>8</samp>
|
|
<a><samp class=p>>>> </samp><kbd>for child in root:</kbd> <span>③</span></a>
|
|
<a><samp class=p>... </samp><kbd> print(child)</kbd> <span>④</span></a>
|
|
<samp class=p>... </samp>
|
|
<samp><Element {http://www.w3.org/2005/Atom}title at e2b5d0>
|
|
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
|
|
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>
|
|
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
|
|
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b720>
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b510>
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b750></samp></pre>
|
|
<ol>
|
|
<li>Continuing from the previous example, the root element is <code>{http://www.w3.org/2005/Atom}feed</code>.
|
|
<li>The “length” of the root element is the number of child elements.
|
|
<li>You can use the element itself as an iterator to loop through all of its child elements.
|
|
<li>As you can see from the output, there are indeed 8 child elements: all of the feed-level metadata (<code>title</code>, <code>subtitle</code>, <code>id</code>, <code>updated</code>, and <code>link</code>) followed by the three <code>entry</code> elements.
|
|
</ol>
|
|
|
|
<p>You may have guessed this already, but I want to point it out explicitly: the list of child elements only includes <em>direct</em> children. Each of the <code>entry</code> elements contain their own children, but those are not included in the list. They would be included in the list of each <code>entry</code>’s children, but they are not included in the list of the <code>feed</code>’s children. There are ways to find elements no matter how deeply nested they are; we’ll look at two such ways later in this chapter.
|
|
|
|
<h3 id=xml-attributes>Attributes Are Dictonaries</h3>
|
|
|
|
<p><abbr>XML</abbr> isn’t just a collection of elements; each element can also have its own set of attributes. Once you have a reference to a specific element, you can easily get its attributes as a Python dictionary.
|
|
|
|
<pre class=screen>
|
|
# continuing from the previous example
|
|
<a><samp class=p>>>> </samp><kbd>root.attrib</kbd> <span>①</span></a>
|
|
<samp>{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}</samp>
|
|
<a><samp class=p>>>> </samp><kbd>root[4]</kbd> <span>②</span></a>
|
|
<samp><Element {http://www.w3.org/2005/Atom}link at e181b0></samp>
|
|
<a><samp class=p>>>> </samp><kbd>root[4].attrib</kbd> <span>③</span></a>
|
|
<samp>{'href': 'http://diveintomark.org/',
|
|
'type': 'text/html',
|
|
'rel': 'alternate'}</samp>
|
|
<a><samp class=p>>>> </samp><kbd>root[3]</kbd> <span>④</span></a>
|
|
<samp><Element {http://www.w3.org/2005/Atom}updated at e2b4e0></samp>
|
|
<a><samp class=p>>>> </samp><kbd>root[3].attrib</kbd> <span>⑤</span></a>
|
|
<samp>{}</samp></pre>
|
|
<ol>
|
|
<li>The <code>attrib</code> property is a dictionary of the element’s attributes. The original markup here was <code><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"></code>. The <code>xml:</code> prefix refers to a built-in namespace that every <abbr>XML</abbr> document can use without declaring it.
|
|
<li>The fifth child — <code>[4]</code> in a <code>0</code>-based list — is the <code>link</code> element.
|
|
<li>The <code>link</code> element has three attributes: <code>href</code>, <code>type</code>, and <code>rel</code>.
|
|
<li>The fourth child — <code>[3]</code> in a <code>0</code>-based list — is the <code>updated</code> element.
|
|
<li>The <code>updated</code> element has no attributes, so its <code>.attrib</code> is just an empty dictionary.
|
|
</ol>
|
|
|
|
<h2 id=xml-find>Searching For Nodes Within An XML Document</h2>
|
|
|
|
<p>So far, we’ve worked with this <abbr>XML</abbr> document “from the top down,” starting with the root element, getting its child elements, and so on throughout the document. But many uses of <abbr>XML</abbr> require you to find specific elements. Etree can do that, too.
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd>
|
|
<samp class=p>>>> </samp><kbd>tree = etree.parse("examples/feed.xml")</kbd>
|
|
<samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}entry")</kbd> <span>①</span></a>
|
|
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp>
|
|
<samp class=p>>>> </samp><kbd>root.tag</kbd>
|
|
<samp>'{http://www.w3.org/2005/Atom}feed'</samp>
|
|
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}feed")</kbd> <span>②</span></a>
|
|
<samp>[]</samp>
|
|
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}author")</kbd> <span>③</span></a>
|
|
<samp>[]</samp></pre>
|
|
<ol>
|
|
<li>The <code>findall()</code> method finds child elements that match a specific query. (More on the query format in a minute.)
|
|
<li>Each element — including the root element, but also child elements — has a <code>findall()</code> method. It finds all matching elements among the element’s children. But why aren’t there any results? Although it may not be obvious, this particular query only searches the element’s children. Since the root <code>feed</code> element has no child named <code>feed</code>, this query returns an empty list.
|
|
<li>This result may also surprise you. <a href=#divingin>There is an <code>author</code> element</a> in this document; in fact, there are three (one in each <code>entry</code>). But those <code>author</code> elements are not <em>direct children</em> of the root element; they are “grandchildren” (literally, a child element of a child element). If you want to look for <code>author</code> elements at any nesting level, you can do that, but the query format is slightly different.
|
|
</ol>
|
|
|
|
<pre class=screen>
|
|
<a><samp class=p>>>> </samp><kbd>tree.findall("{http://www.w3.org/2005/Atom}entry")</kbd> <span>①</span></a>
|
|
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp>
|
|
<a><samp class=p>>>> </samp><kbd>tree.findall("{http://www.w3.org/2005/Atom}author")</kbd> <span>②</span></a>
|
|
<samp>[]</samp>
|
|
</pre>
|
|
<ol>
|
|
<li>For convenience, the <code>tree</code> object (returned from the <code>etree.parse()</code> function) has several methods that mirror the methods on the root element. The results are the same as if you had called the <code>tree.getroot().findall()</code> method.
|
|
<li>Perhaps surprisingly, this query does not find the <code>author</code> elements in this document. Why not? Because this is just a shortcut for <code>tree.getroot().findall("{http://www.w3.org/2005/Atom}author")</code>, which means “find all the <code>author</code> elements that are children of the root element.” The <code>author</code> elements are not children of the root element; they’re children of the <code>entry</code> elements. Thus the query doesn’t return any matches.
|
|
</ol>
|
|
|
|
<p>There <em>is</em> a way to search for <em>descendant</em> elements, <i>i.e.</i> children, grandchildren, and any element at any nesting level.
|
|
|
|
<pre class=screen>
|
|
<a><samp class=p>>>> </samp><kbd>all_links = tree.findall("//{http://www.w3.org/2005/Atom}link")</kbd> <span>①</span></a>
|
|
<samp class=p>>>> </samp><kbd>all_links</kbd>
|
|
<samp>[<Element {http://www.w3.org/2005/Atom}link at e181b0>,
|
|
<Element {http://www.w3.org/2005/Atom}link at e2b570>,
|
|
<Element {http://www.w3.org/2005/Atom}link at e2b480>,
|
|
<Element {http://www.w3.org/2005/Atom}link at e2b5a0>]</samp>
|
|
<a><samp class=p>>>> </samp><kbd>all_links[0].attrib</kbd> <span>②</span></a>
|
|
<samp>{'href': 'http://diveintomark.org/',
|
|
'type': 'text/html',
|
|
'rel': 'alternate'}</samp>
|
|
<a><samp class=p>>>> </samp><kbd>all_links[1].attrib</kbd> <span>③</span></a>
|
|
<samp>{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
|
|
'type': 'text/html',
|
|
'rel': 'alternate'}</samp>
|
|
<samp class=p>>>> </samp><kbd>all_links[2].attrib</kbd>
|
|
<samp>{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress',
|
|
'type': 'text/html',
|
|
'rel': 'alternate'}</samp>
|
|
<samp class=p>>>> </samp><kbd>all_links[3].attrib</kbd>
|
|
<samp>{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats',
|
|
'type': 'text/html',
|
|
'rel': 'alternate'}</samp></pre>
|
|
<ol>
|
|
<li>This query — <code>//{http://www.w3.org/2005/Atom}link</code> — is very similar to the previous examples, except for the two slashes at the beginning of the query. Those two slashes mean “don’t just look for direct children; I want <em>any</em> elements, regardless of nesting level.” So the result is a list of four <code>link</code> elements, not just one.
|
|
<li>The first result <em>is</em> a direct child of the root element. As you can see from its attributes, this is the feed-level alternate link that points to the <abbr>HTML</abbr> version of the website that the feed describes.
|
|
<li>The other three results are each entry-level alternate links. Each <code>entry</code> has a single <code>link</code> child element, and because of the double slash at the beginning of the query, this query finds all of them.
|
|
</ol>
|
|
|
|
<p>The <code>findall()</code> method has a few other tricks up its sleeve.
|
|
|
|
<pre class=screen>
|
|
# continuing from the previous example
|
|
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href]")</kbd> <span>①</span></a>
|
|
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
|
|
<Element {http://www.w3.org/2005/Atom}link at eeb990>,
|
|
<Element {http://www.w3.org/2005/Atom}link at eeb960>,
|
|
<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
|
|
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span>②</span></a>
|
|
<samp>[<Element {http://www.w3.org/2005/Atom}link at eeb930>]</samp>
|
|
<samp class=p>>>> </samp><kbd>NS = "{http://www.w3.org/2005/Atom}"</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>tree.findall("//{NS}author[{NS}uri]".format(NS=NS))</kbd> <span>③</span></a>
|
|
<samp>[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
|
|
<Element {http://www.w3.org/2005/Atom}author at eebba0>]</samp></pre>
|
|
<ol>
|
|
<li>This query finds all elements in the Atom namespace, anywhere in the document, that have an <code>href</code> attribute. The <code>//</code> at the beginning of the query means “elements anywhere (not just as children of the root element).” <code>{http://www.w3.org/2005/Atom}</code> means “only elements in the Atom namespace.” <code>*</code> means “elements with any local name.” And <code>[@href]</code> means “has an <code>href</code> attribute.”
|
|
<li>The query finds all Atom elements with an <code>href</code> whose value is <code>http://diveintomark.org/</code>.
|
|
<li>After doing some quick <a href=strings.html#formatting-strings>string formatting</a> (because otherwise these compound queries get ridiculously long), this query searches for Atom <code>author</code> elements that have an Atom <code>uri</code> element as a child. This only returns two <code>author</code> elements, the ones in the first and second <code>entry</code>. The <code>author</code> in the last <code>entry</code> contains only a <code>name</code>, not a <code>uri</code>.
|
|
</ol>
|
|
|
|
<p>Overall, ElementTree’s <code>findall()</code> method is a very powerful feature, but the query language can be a bit surprising. It is officially described as “<a href=http://effbot.org/zone/element-xpath.htm>limited support for XPath expressions</a>.” <a href=http://www.w3.org/TR/xpath>XPath</a> is a W3C standard for querying <abbr>XML</abbr> documents. ElementTree’s query language is similar enough to XPath to do basic searching, but dissimilar enough that it may annoy you if you already know XPath. Now let’s look at a third-party <abbr>XML</abbr> library that extends the ElementTree <abbr>API</abbr> with full XPath support.
|
|
|
|
<h2 id=xml-lxml>Going Further With lxml</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>from lxml import etree</kbd>
|
|
.
|
|
. FIXME (show how it's a drop-in replacement for everything we've done so far)
|
|
.
|
|
</pre>
|
|
|
|
<p>FIXME: from here on out, we use lxml.etree explicitly because these functions are specific to lxml
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
|
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed.xml")</kbd>
|
|
<samp class=p>>>> </samp><kbd>it = tree.iterfind("//{http://www.w3.org/2005/Atom}link")</kbd>
|
|
<samp class=p>>>> </samp><kbd>next(it)</kbd>
|
|
<Element {http://www.w3.org/2005/Atom}link at 122f1b0>
|
|
<samp class=p>>>> </samp><kbd>next(it)</kbd>
|
|
<Element {http://www.w3.org/2005/Atom}link at 122f1e0>
|
|
<samp class=p>>>> </samp><kbd>next(it)</kbd>
|
|
<Element {http://www.w3.org/2005/Atom}link at 122f210>
|
|
<samp class=p>>>> </samp><kbd>next(it)</kbd>
|
|
<Element {http://www.w3.org/2005/Atom}link at 122f1b0>
|
|
<samp class=p>>>> </samp><kbd>next(it)</kbd>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
StopIteration</samp></pre>
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>NSMAP = {"atom": "http://www.w3.org/2005/Atom"}</kbd>
|
|
<samp class=p>>>> </samp><kbd>entries = tree.xpath("//atom:category[@term='accessibility']/..", namespaces=NSMAP)</kbd>
|
|
<samp class=p>>>> </samp><kbd>entries</kbd>
|
|
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]</samp>
|
|
<samp class=p>>>> </samp><kbd>entry = entries[0]</kbd>
|
|
<samp class=p>>>> </samp><kbd>entry.xpath("./atom:title/text()", namespaces=nsmap)</kbd>
|
|
<samp>['Accessibility is a harsh mistress']</samp></pre>
|
|
|
|
<h3 id=xml-custom-parser>Customizing Your XML Parser</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
|
<samp class=p>>>> </samp><kbd>parser = lxml.etree.XMLParser(no_network=True, ns_clean=True, recover=True, remove_blank_text=True, remove_comments=True)</kbd>
|
|
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed.xml", parser)</kbd>
|
|
.
|
|
.
|
|
.
|
|
</pre>
|
|
|
|
<h3 id=xml-incremental>Incremental Parsing</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<h2 id=xml-generate>Generating XML</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd>
|
|
<samp class=p>>>> </samp><kbd>new_feed = etree.Element("{http://www.w3.org/2005/Atom}feed",</kbd>
|
|
<samp class=p>... </samp><kbd> attrib={"{http://www.w3.org/XML/1998/namespace}lang": "en"})</kbd>
|
|
<samp class=p>>>> </samp><kbd>print(etree.tostring(new_feed))</kbd>
|
|
<samp><ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en"/></samp></pre>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
|
<samp class=p>>>> </samp><kbd>NSMAP = {"atom": "http://www.w3.org/2005/Atom"}</kbd>
|
|
<samp class=p>>>> </samp><kbd>new_feed = lxml.etree.Element("feed", nsmap=NSMAP)</kbd>
|
|
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd>
|
|
<samp><feed xmlns="http://www.w3.org/2005/Atom"/></samp>
|
|
<samp class=p>>>> </samp><kbd>new_feed.set("{http://www.w3.org/XML/1998/namespace}lang", "en")</kbd>
|
|
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd>
|
|
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/></samp></pre>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
<samp class=p>>>> </samp><kbd>title = lxml.etree.SubElement(new_feed, "title", attrib={"type":"html"})</kbd>
|
|
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd>
|
|
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html"/></feed></samp>
|
|
<samp class=p>>>> </samp><kbd>title.text = "dive into mark"</kbd>
|
|
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd>
|
|
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html">dive into mark</title></feed></samp>
|
|
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed, pretty_print=True))</kbd>
|
|
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
|
<title type="html">dive into mark</title>
|
|
</feed></samp></pre>
|
|
|
|
<h2 id=furtherreading>Further Reading</h2>
|
|
|
|
<ul>
|
|
<li><a href=http://en.wikipedia.org/wiki/XML><abbr>XML</abbr> on Wikipedia.org</a>
|
|
<li><a href=http://docs.python.org/3.0/library/xml.etree.elementtree.html>The ElementTree <abbr>XML</abbr> API</a>
|
|
<li><a href=http://effbot.org/zone/element.htm>Elements and Element Trees</a>
|
|
<li><a href=http://effbot.org/zone/element-xpath.htm>XPath Support in ElementTree</a>
|
|
<li><a href=http://effbot.org/zone/element-iterparse.htm>The ElementTree iterparse Function</a>
|
|
<li><a href=http://codespeak.net/lxml/1.3/parsing.html>Parsing <abbr>XML</abbr> and <abbr>HTML</abbr> with lxml</a>
|
|
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and <abbr>XSLT</abbr> with lxml</a>
|
|
</ul>
|
|
|
|
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
|
<script src=jquery.js></script>
|
|
<script src=dip3.js></script>
|