mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
you wouldn't believe me if I told you
This commit is contained in:
@@ -17,7 +17,7 @@ mark{display:inline}
|
||||
<p id=level>Difficulty level: <span title=advanced>♦♦♦♦♢</span>
|
||||
<h1>XML</h1>
|
||||
<blockquote class=q>
|
||||
<p><span>❝</span> In the archonship of Aristaechmus, Draco enacted his ordinances. <span>❞</span><br>— <a href="http://www.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus:text:1999.01.0046;query=chapter%3D%235;layout=;loc=3.1">Aristotle</a>
|
||||
<p><span>❝</span> In the archonship of Aristaechmus, Draco enacted his ordinances. <span>❞</span><br>— <a href='http://www.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus:text:1999.01.0046;query=chapter%3D%235;layout=;loc=3.1'>Aristotle</a>
|
||||
</blockquote>
|
||||
<p id=toc>
|
||||
<h2 id=divingin>Diving In</h2>
|
||||
@@ -26,29 +26,29 @@ mark{display:inline}
|
||||
<p>Here, then, is the <abbr>XML</abbr> data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
||||
|
||||
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
||||
<pre><code><?xml version="1.0" encoding="utf-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
||||
<pre><code><?xml version='1.0' encoding='utf-8'?>
|
||||
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title>dive into mark</title>
|
||||
<subtitle>currently between addictions</subtitle>
|
||||
<id>tag:diveintomark.org,2001-07-29:/</id>
|
||||
<updated>2009-03-27T21:56:07Z</updated>
|
||||
<link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
|
||||
<link rel="self" type="application/atom+xml" href="http://diveintomark.org/feed/"/>
|
||||
<link rel='alternate' type='text/html' href='http://diveintomark.org/'/>
|
||||
<link rel='self' type='application/atom+xml' href='http://diveintomark.org/feed/'/>
|
||||
<entry>
|
||||
<author>
|
||||
<name>Mark</name>
|
||||
<uri>http://diveintomark.org/</uri>
|
||||
</author>
|
||||
<title>Dive into history, 2009 edition</title>
|
||||
<link rel="alternate" type="text/html"
|
||||
href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
|
||||
<link rel='alternate' type='text/html'
|
||||
href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
|
||||
<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
|
||||
<updated>2009-03-27T21:56:07Z</updated>
|
||||
<published>2009-03-27T17:20:42Z</published>
|
||||
<category scheme="http://diveintomark.org" term="diveintopython"/>
|
||||
<category scheme="http://diveintomark.org" term="docbook"/>
|
||||
<category scheme="http://diveintomark.org" term="html"/>
|
||||
<summary type="html">Putting an entire chapter on one page sounds
|
||||
<category scheme='http://diveintomark.org' term='diveintopython'/>
|
||||
<category scheme='http://diveintomark.org' term='docbook'/>
|
||||
<category scheme='http://diveintomark.org' term='html'/>
|
||||
<summary type='html'>Putting an entire chapter on one page sounds
|
||||
bloated, but consider this &amp;mdash; my longest chapter so far
|
||||
would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
|
||||
On dialup.</summary>
|
||||
@@ -59,13 +59,13 @@ mark{display:inline}
|
||||
<uri>http://diveintomark.org/</uri>
|
||||
</author>
|
||||
<title>Accessibility is a harsh mistress</title>
|
||||
<link rel="alternate" type="text/html"
|
||||
href="http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress"/>
|
||||
<link rel='alternate' type='text/html'
|
||||
href='http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress'/>
|
||||
<id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
|
||||
<updated>2009-03-22T01:05:37Z</updated>
|
||||
<published>2009-03-21T20:09:28Z</published>
|
||||
<category scheme="http://diveintomark.org" term="accessibility"/>
|
||||
<summary type="html">The accessibility orthodoxy does not permit people to
|
||||
<category scheme='http://diveintomark.org' term='accessibility'/>
|
||||
<summary type='html'>The accessibility orthodoxy does not permit people to
|
||||
question the value of features that are rarely useful and rarely used.</summary>
|
||||
</entry>
|
||||
<entry>
|
||||
@@ -73,20 +73,20 @@ mark{display:inline}
|
||||
<name>Mark</name>
|
||||
</author>
|
||||
<title>A gentle introduction to video encoding, part 1: container formats</title>
|
||||
<link rel="alternate" type="text/html"
|
||||
href="http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats"/>
|
||||
<link rel='alternate' type='text/html'
|
||||
href='http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats'/>
|
||||
<id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
|
||||
<updated>2009-01-11T19:39:22Z</updated>
|
||||
<published>2008-12-18T15:54:22Z</published>
|
||||
<category scheme="http://diveintomark.org" term="asf"/>
|
||||
<category scheme="http://diveintomark.org" term="avi"/>
|
||||
<category scheme="http://diveintomark.org" term="encoding"/>
|
||||
<category scheme="http://diveintomark.org" term="flv"/>
|
||||
<category scheme="http://diveintomark.org" term="GIVE"/>
|
||||
<category scheme="http://diveintomark.org" term="mp4"/>
|
||||
<category scheme="http://diveintomark.org" term="ogg"/>
|
||||
<category scheme="http://diveintomark.org" term="video"/>
|
||||
<summary type="html">These notes will eventually become part of a
|
||||
<category scheme='http://diveintomark.org' term='asf'/>
|
||||
<category scheme='http://diveintomark.org' term='avi'/>
|
||||
<category scheme='http://diveintomark.org' term='encoding'/>
|
||||
<category scheme='http://diveintomark.org' term='flv'/>
|
||||
<category scheme='http://diveintomark.org' term='GIVE'/>
|
||||
<category scheme='http://diveintomark.org' term='mp4'/>
|
||||
<category scheme='http://diveintomark.org' term='ogg'/>
|
||||
<category scheme='http://diveintomark.org' term='video'/>
|
||||
<summary type='html'>These notes will eventually become part of a
|
||||
tech talk on video encoding.</summary>
|
||||
</entry>
|
||||
</feed></code></pre>
|
||||
@@ -120,8 +120,8 @@ mark{display:inline}
|
||||
|
||||
<p>Elements can have <i>attributes</i>, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. <i>Attribute names</i> can not be repeated within an element. <i>Attribute values</i> must be quoted.
|
||||
|
||||
<pre class=nd><code><a><foo <mark>lang="en"</mark>> <span>①</span></a>
|
||||
<a> <bar <mark>lang="fr"</mark>></bar> <span>②</span></a>
|
||||
<pre class=nd><code><a><foo <mark>lang='en'</mark>> <span>①</span></a>
|
||||
<a> <bar <mark>lang='fr'</mark>></bar> <span>②</span></a>
|
||||
</foo>
|
||||
</code></pre>
|
||||
<ol>
|
||||
@@ -133,8 +133,8 @@ mark{display:inline}
|
||||
|
||||
<p>Elements can have <i>text content</i>.
|
||||
|
||||
<pre class=nd><code><foo lang="en">
|
||||
<bar lang="fr"><mark>PapayaWhip</mark></bar>
|
||||
<pre class=nd><code><foo lang='en'>
|
||||
<bar lang='fr'><mark>PapayaWhip</mark></bar>
|
||||
</foo>
|
||||
</code></pre>
|
||||
|
||||
@@ -148,7 +148,7 @@ mark{display:inline}
|
||||
|
||||
<p>Like Python functions can be declared in different <i>modules</i>, <abbr>XML</abbr> elements can be declared in different <i>namespaces</i>. Namespaces usually look like URLs. You use an <code>xmlns</code> declaration to define a <i>default namespace</i>. A namespace declaration looks similar to an attribute, but it has a different purpose.
|
||||
|
||||
<pre class=nd><code><a><feed <mark>xmlns="http://www.w3.org/2005/Atom"</mark>> <span>①</span></a>
|
||||
<pre class=nd><code><a><feed <mark>xmlns='http://www.w3.org/2005/Atom'</mark>> <span>①</span></a>
|
||||
<a> <title>dive into mark</title> <span>②</span></a>
|
||||
</feed>
|
||||
</code></pre>
|
||||
@@ -159,7 +159,7 @@ mark{display:inline}
|
||||
|
||||
<p>You can also use an <code>xmlns:<var>prefix</var></code> declaration to define a namespace and associate it with a <i>prefix</i>. Then each element in that namespace must be explicitly declared with the prefix.
|
||||
|
||||
<pre class=nd><code><a><atom:feed <mark>xmlns:atom="http://www.w3.org/2005/Atom"</mark>> <span>①</span></a>
|
||||
<pre class=nd><code><a><atom:feed <mark>xmlns:atom='http://www.w3.org/2005/Atom'</mark>> <span>①</span></a>
|
||||
<a> <atom:title>dive into mark</atom:title> <span>②</span></a>
|
||||
</atom:feed></code></pre>
|
||||
<ol>
|
||||
@@ -171,7 +171,7 @@ mark{display:inline}
|
||||
|
||||
<p>Finally, <abbr>XML</abbr> documents can contain <a href=strings.html#one-ring-to-rule-them-all>character encoding information</a> on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the document can be parsed, <a href=http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info>Section F of the <abbr>XML</abbr> specification</a> details how to resolve this Catch-22.)
|
||||
|
||||
<pre class=nd><code><?xml version="1.0" <mark>encoding="utf-8"</mark>?></code></pre>
|
||||
<pre class=nd><code><?xml version='1.0' <mark>encoding='utf-8'</mark>?></code></pre>
|
||||
|
||||
<p>And now you know just enough <abbr>XML</abbr> to be dangerous!
|
||||
|
||||
@@ -185,8 +185,8 @@ mark{display:inline}
|
||||
|
||||
<p>At the top level is the <i>root element</i>, which every Atom feed shares: the <code>feed</code> element in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
||||
|
||||
<pre><code><a><feed xmlns="http://www.w3.org/2005/Atom" <span>①</span></a>
|
||||
<a> xml:lang="en"> <span>②</span></a></code></pre>
|
||||
<pre><code><a><feed xmlns='http://www.w3.org/2005/Atom' <span>①</span></a>
|
||||
<a> xml:lang='en'> <span>②</span></a></code></pre>
|
||||
<ol>
|
||||
<li><code>http://www.w3.org/2005/Atom</code> is the Atom namespace.
|
||||
<li>Any element can contain an <code>xml:lang</code> attribute, which declares the language of the element and its children. In this case, the <code>xml:lang</code> attribute is declared once on the root element, which means the entire feed is in English.
|
||||
@@ -194,18 +194,18 @@ mark{display:inline}
|
||||
|
||||
<p>An Atom feed contains several pieces of information about the feed itself. These are declared as children of the root-level <code>feed</code> element.
|
||||
|
||||
<pre><code><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
||||
<pre><code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<a> <title>dive into mark</title> <span>①</span></a>
|
||||
<a> <subtitle>currently between addictions</subtitle> <span>②</span></a>
|
||||
<a> <id>tag:diveintomark.org,2001-07-29:/</id> <span>③</span></a>
|
||||
<a> <updated>2009-03-27T21:56:07Z</updated> <span>④</span></a>
|
||||
<a> <link rel="alternate" type="text/html" href="http://diveintomark.org/"/> <span>⑤</span></a></code></pre>
|
||||
<a> <link rel='alternate' type='text/html' href='http://diveintomark.org/'/> <span>⑤</span></a></code></pre>
|
||||
<ol>
|
||||
<li>The title of this feed is <code>dive into mark</code>.
|
||||
<li>The subtitle of this feed is <code>currently between addictions</code>.
|
||||
<li>Every feed needs a globally unique identifier. See <a href=http://www.ietf.org/rfc/rfc4151.txt>RFC 4151</a> for how to create one.
|
||||
<li>This feed was last updated on March 27, 2009, at 21:56 GMT. This is usually equivalent to the last-modified date of the most recent article.
|
||||
<li>Now things start to get interesting. This <code>link</code> element has no text content, but it has three attributes: <code>rel</code>, <code>type</code>, and <code>href</code>. The <code>rel</code> value tells you what kind of link this is; <code>rel="alternate"</code> means that this is a link to an alternate representation of this feed. The <code>type="text/html"</code> attribute means that this is a link to an <abbr>HTML</abbr> page. And the link target is given in the <code>href</code> attribute.
|
||||
<li>Now things start to get interesting. This <code>link</code> element has no text content, but it has three attributes: <code>rel</code>, <code>type</code>, and <code>href</code>. The <code>rel</code> value tells you what kind of link this is; <code>rel='alternate'</code> means that this is a link to an alternate representation of this feed. The <code>type='text/html'</code> attribute means that this is a link to an <abbr>HTML</abbr> page. And the link target is given in the <code>href</code> attribute.
|
||||
</ol>
|
||||
|
||||
<p>Now we know that this is a feed for a site named “dive into mark“ which is available at <a href=http://diveintomark.org/><code>http://diveintomark.org/</code></a> and was last updated on March 27, 2009.
|
||||
@@ -222,15 +222,15 @@ mark{display:inline}
|
||||
<uri>http://diveintomark.org/</uri>
|
||||
</author>
|
||||
<a> <title>Dive into history, 2009 edition</title> <span>②</span></a>
|
||||
<a> <link rel="alternate" type="text/html" <span>③</span></a>
|
||||
href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
|
||||
<a> <link rel='alternate' type='text/html' <span>③</span></a>
|
||||
href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
|
||||
<a> <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id> <span>④</span></a>
|
||||
<a> <updated>2009-03-27T21:56:07Z</updated> <span>⑤</span></a>
|
||||
<published>2009-03-27T17:20:42Z</published>
|
||||
<a> <category scheme="http://diveintomark.org" term="diveintopython"/> <span>⑥</span></a>
|
||||
<category scheme="http://diveintomark.org" term="docbook"/>
|
||||
<category scheme="http://diveintomark.org" term="html"/>
|
||||
<a> <summary type="html">Putting an entire chapter on one page sounds <span>⑦</span></a>
|
||||
<a> <category scheme='http://diveintomark.org' term='diveintopython'/> <span>⑥</span></a>
|
||||
<category scheme='http://diveintomark.org' term='docbook'/>
|
||||
<category scheme='http://diveintomark.org' term='html'/>
|
||||
<a> <summary type='html'>Putting an entire chapter on one page sounds <span>⑦</span></a>
|
||||
bloated, but consider this &amp;mdash; my longest chapter so far
|
||||
would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
|
||||
On dialup.</summary>
|
||||
@@ -242,7 +242,7 @@ mark{display:inline}
|
||||
<li>Entries, like feeds, need a unique identifier.
|
||||
<li>Entries have two dates: a first-published date (<code>published</code>) and a last-modified date (<code>updated</code>).
|
||||
<li>Entries can have an arbitrary number of categories. This article is filed under <code>diveintopython</code>, <code>docbook</code>, and <code>html</code>.
|
||||
<li>The <code>summary</code> element gives a brief summary of the article. (There is also a <code>content</code> element, not shown here, if you want to include the complete article text in your feed.) This <code>summary</code> element has the Atom-specific <code>type="html"</code> attribute, which specifies that this summary is a snippet of <abbr>HTML</abbr>, not plain text. This is important, since it has <abbr>HTML</abbr>-specific entities in it (<code>&mdash;</code> and <code>&hellip;</code>) which should be rendered as “—” and “…” rather than displayed directly.
|
||||
<li>The <code>summary</code> element gives a brief summary of the article. (There is also a <code>content</code> element, not shown here, if you want to include the complete article text in your feed.) This <code>summary</code> element has the Atom-specific <code>type='html'</code> attribute, which specifies that this summary is a snippet of <abbr>HTML</abbr>, not plain text. This is important, since it has <abbr>HTML</abbr>-specific entities in it (<code>&mdash;</code> and <code>&hellip;</code>) which should be rendered as “—” and “…” rather than displayed directly.
|
||||
<li>Finally, the end tag for the <code>entry</code> element, signaling the end of the metadata for this article.
|
||||
</ol>
|
||||
|
||||
@@ -255,7 +255,7 @@ mark{display:inline}
|
||||
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = etree.parse("examples/feed.xml")</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = etree.parse('examples/feed.xml')</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root</kbd> <span>④</span></a>
|
||||
<samp><Element {http://www.w3.org/2005/Atom}feed at cd1eb0></samp></pre>
|
||||
@@ -319,7 +319,7 @@ mark{display:inline}
|
||||
<a><samp class=p>>>> </samp><kbd>root[3].attrib</kbd> <span>⑤</span></a>
|
||||
<samp>{}</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>attrib</code> property is a dictionary of the element’s attributes. The original markup here was <code><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"></code>. The <code>xml:</code> prefix refers to a built-in namespace that every <abbr>XML</abbr> document can use without declaring it.
|
||||
<li>The <code>attrib</code> property is a dictionary of the element’s attributes. The original markup here was <code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'></code>. The <code>xml:</code> prefix refers to a built-in namespace that every <abbr>XML</abbr> document can use without declaring it.
|
||||
<li>The fifth child — <code>[4]</code> in a <code>0</code>-based list — is the <code>link</code> element.
|
||||
<li>The <code>link</code> element has three attributes: <code>href</code>, <code>type</code>, and <code>rel</code>.
|
||||
<li>The fourth child — <code>[3]</code> in a <code>0</code>-based list — is the <code>updated</code> element.
|
||||
@@ -334,17 +334,17 @@ mark{display:inline}
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = etree.parse("examples/feed.xml")</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = etree.parse('examples/feed.xml')</kbd>
|
||||
<samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}entry")</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span>①</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>root.tag</kbd>
|
||||
<samp>'{http://www.w3.org/2005/Atom}feed'</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}feed")</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}feed')</kbd> <span>②</span></a>
|
||||
<samp>[]</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}author")</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}author')</kbd> <span>③</span></a>
|
||||
<samp>[]</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>findall()</code> method finds child elements that match a specific query. (More on the query format in a minute.)
|
||||
@@ -353,22 +353,22 @@ mark{display:inline}
|
||||
</ol>
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("{http://www.w3.org/2005/Atom}entry")</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span>①</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("{http://www.w3.org/2005/Atom}author")</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('{http://www.w3.org/2005/Atom}author')</kbd> <span>②</span></a>
|
||||
<samp>[]</samp>
|
||||
</pre>
|
||||
<ol>
|
||||
<li>For convenience, the <code>tree</code> object (returned from the <code>etree.parse()</code> function) has several methods that mirror the methods on the root element. The results are the same as if you had called the <code>tree.getroot().findall()</code> method.
|
||||
<li>Perhaps surprisingly, this query does not find the <code>author</code> elements in this document. Why not? Because this is just a shortcut for <code>tree.getroot().findall("{http://www.w3.org/2005/Atom}author")</code>, which means “find all the <code>author</code> elements that are children of the root element.” The <code>author</code> elements are not children of the root element; they’re children of the <code>entry</code> elements. Thus the query doesn’t return any matches.
|
||||
<li>Perhaps surprisingly, this query does not find the <code>author</code> elements in this document. Why not? Because this is just a shortcut for <code>tree.getroot().findall('{http://www.w3.org/2005/Atom}author')</code>, which means “find all the <code>author</code> elements that are children of the root element.” The <code>author</code> elements are not children of the root element; they’re children of the <code>entry</code> elements. Thus the query doesn’t return any matches.
|
||||
</ol>
|
||||
|
||||
<p>There <em>is</em> a way to search for <em>descendant</em> elements, <i>i.e.</i> children, grandchildren, and any element at any nesting level.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links = tree.findall("//{http://www.w3.org/2005/Atom}link")</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links = tree.findall('//{http://www.w3.org/2005/Atom}link')</kbd> <span>①</span></a>
|
||||
<samp class=p>>>> </samp><kbd>all_links</kbd>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at e181b0>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at e2b570>,
|
||||
@@ -400,7 +400,7 @@ mark{display:inline}
|
||||
|
||||
<pre class=screen>
|
||||
# continuing from the previous example
|
||||
<a><samp class=p>>>> </samp><kbd>it = tree.getiterator("{http://www.w3.org/2005/Atom}link")</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>it = tree.getiterator('{http://www.w3.org/2005/Atom}link')</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>next(it)</kbd> <span>②</span></a>
|
||||
<Element {http://www.w3.org/2005/Atom}link at 122f1b0>
|
||||
<samp class=p>>>> </samp><kbd>next(it)</kbd>
|
||||
@@ -428,9 +428,9 @@ StopIteration</samp></pre>
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>from lxml import etree</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = etree.parse("examples/feed.xml")</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = etree.parse('examples/feed.xml')</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall("{http://www.w3.org/2005/Atom}entry")</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span>④</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp></pre>
|
||||
@@ -452,16 +452,16 @@ except ImportError:
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>import lxml.etree</kbd> <span>①</span></a>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed.xml")</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href]")</kbd> <span>②</span></a>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse('examples/feed.xml')</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')</kbd> <span>②</span></a>
|
||||
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb990>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb960>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span>③</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at eeb930>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>NS = "{http://www.w3.org/2005/Atom}"</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{NS}author[{NS}uri]".format(NS=NS))</kbd> <span>④</span></a>
|
||||
<samp class=p>>>> </samp><kbd>NS = '{http://www.w3.org/2005/Atom}'</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('//{NS}author[{NS}uri]'.format(NS=NS))</kbd> <span>④</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
|
||||
<Element {http://www.w3.org/2005/Atom}author at eebba0>]</samp></pre>
|
||||
<ol>
|
||||
@@ -475,18 +475,18 @@ except ImportError:
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed.xml")</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {"atom": "http://www.w3.org/2005/Atom"}</kbd> <span>①</span></a>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse('examples/feed.xml')</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entries = tree.xpath("//atom:category[@term='accessibility']/..",</kbd> <span>②</span></a>
|
||||
<samp class=p>... </samp><kbd> namespaces=NSMAP)</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>entries</kbd> <span>③</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>entry = entries[0]</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>entry.xpath("./atom:title/text()", namespaces=nsmap)</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entry.xpath('./atom:title/text()', namespaces=nsmap)</kbd> <span>④</span></a>
|
||||
<samp>['Accessibility is a harsh mistress']</samp></pre>
|
||||
<ol>
|
||||
<li>To perform XPath queries on namespaced elements, you need to define a namespace prefix mapping. This is just a Python dictionary.
|
||||
<li>Here is an XPath query. The XPath expression searches for <code>category</code> elements (in the Atom namespace) that contain a <code>term</code> attribute with the value <code>accessibility</code>. But that’s not actually the query result. Look at the very end of the query string; did you notice the <code>/..</code> bit? That means “and then return the parent element of the <code>category</code> element you just found.” So this single XPath query will find all entries with a child element of <code><category term="accessibility"></code>.
|
||||
<li>Here is an XPath query. The XPath expression searches for <code>category</code> elements (in the Atom namespace) that contain a <code>term</code> attribute with the value <code>accessibility</code>. But that’s not actually the query result. Look at the very end of the query string; did you notice the <code>/..</code> bit? That means “and then return the parent element of the <code>category</code> element you just found.” So this single XPath query will find all entries with a child element of <code><category term='accessibility'></code>.
|
||||
<li>The <code>xpath()</code> function returns a list of ElementTree objects. In this document, there is only one entry with a <code>category</code> whose <code>term</code> is <code>accessibility</code>.
|
||||
<li>XPath expressions don’t always return a list of elements. Technically, the <abbr>DOM</abbr> of a parsed <abbr>XML</abbr> document doesn’t contain elements; it contains <i>nodes</i>. Depending on their type, nodes can be elements, attributes, or even text content. The result of an XPath query is a list of nodes. This query returns a list of text nodes: the text content (<code>text()</code>) of the <code>title</code> element (<code>atom:title</code>) that is a child of the current element (<code>./</code>).
|
||||
</ol>
|
||||
@@ -499,25 +499,25 @@ except ImportError:
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed = etree.Element("{http://www.w3.org/2005/Atom}feed",</kbd> <span>①</span></a>
|
||||
<a><samp class=p>... </samp><kbd> attrib={"{http://www.w3.org/XML/1998/namespace}lang": "en"})</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',</kbd> <span>①</span></a>
|
||||
<a><samp class=p>... </samp><kbd> attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'})</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(etree.tostring(new_feed))</kbd> <span>③</span></a>
|
||||
<samp><ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en"/></samp></pre>
|
||||
<samp><ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/></samp></pre>
|
||||
<ol>
|
||||
<li>To create a new element, instantiate the <code>Element</code> class. You pass the element name (namespace + local name) as the first argument. This statement creates a <code>feed</code> element in the Atom namespace. This will be our new document’s root element.
|
||||
<li>To add attributes to the newly created element, pass a dictionary of attribute names and values in the <var>attrib</var> argument. Note that the attribute name should be in the standard ElementTree format, <code>{<var>namespace</var>}<var>localname</var></code>.
|
||||
<li>At any time, you can serialize any element (and its children) with the ElementTree <code>tostring()</code> function.
|
||||
</ol>
|
||||
|
||||
<p>Was that serialization surprising to you? The way ElementTree serializes namespaced <abbr>XML</abbr> elements is technically accurate but not optimal. The sample <abbr>XML</abbr> document at the beginning of this chapter defined a <i>default namespace</i> (<code>xmlns="http://www.w3.org/2005/Atom"</code>). Defining a default namespace is useful for documents — like Atom feeds — where every element is in the same namespace, because you can declare the namespace once and declare each element with just its local name (<code><feed></code>, <code><link></code>, <code><entry></code>). There is no need to use any prefixes unless you want to declare elements from another namespace.
|
||||
<p>Was that serialization surprising to you? The way ElementTree serializes namespaced <abbr>XML</abbr> elements is technically accurate but not optimal. The sample <abbr>XML</abbr> document at the beginning of this chapter defined a <i>default namespace</i> (<code>xmlns='http://www.w3.org/2005/Atom'</code>). Defining a default namespace is useful for documents — like Atom feeds — where every element is in the same namespace, because you can declare the namespace once and declare each element with just its local name (<code><feed></code>, <code><link></code>, <code><entry></code>). There is no need to use any prefixes unless you want to declare elements from another namespace.
|
||||
|
||||
<p>An <abbr>XML</abbr> parser won’t “see” any difference between an <abbr>XML</abbr> document with a default namespace and an <abbr>XML</abbr> document with a prefixed namespace. The resulting <abbr>DOM</abbr> of this serialization:
|
||||
|
||||
<pre class=nd><code><ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en"/></code></pre>
|
||||
<pre class=nd><code><ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
|
||||
<p>is identical to the <abbr>DOM</abbr> of this serialization:
|
||||
|
||||
<pre class=nd><code><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/></code></pre>
|
||||
<pre class=nd><code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
|
||||
<p>The only practical difference is that the second serialization is several characters shorter. If we were to recast our entire sample feed with a <code>ns0:</code> prefix in every start and end tag, it would add 4 characters per start tag × 79 tags + 4 characters for the namespace declaration itself, for a total of 316 characters. Assuming <a href=strings.html#byte-arrays>UTF-8 encoding</a>, that’s 316 extra bytes. (After gzipping, the difference drops to 21 bytes, but still, 21 bytes is 21 bytes.) Maybe that doesn’t matter to you, but for something like an Atom feed, which may be downloaded several thousand times whenever it changes, saving a few bytes per request can quickly add up.
|
||||
|
||||
@@ -525,13 +525,13 @@ except ImportError:
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {None: "http://www.w3.org/2005/Atom"}</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed = lxml.etree.Element("feed", nsmap=NSMAP)</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {None: 'http://www.w3.org/2005/Atom'}</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed = lxml.etree.Element('feed', nsmap=NSMAP)</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd> <span>③</span></a>
|
||||
<samp><feed xmlns="http://www.w3.org/2005/Atom"/></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed.set("{http://www.w3.org/XML/1998/namespace}lang", "en")</kbd> <span>④</span></a>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom'/></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed.set('{http://www.w3.org/XML/1998/namespace}lang', 'en')</kbd> <span>④</span></a>
|
||||
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd>
|
||||
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/></samp></pre>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/></samp></pre>
|
||||
<ol>
|
||||
<li>To start, define a namespace mapping as a dictionary. Dictionary values are namespaces; dictionary keys are the desired prefix. Using <code>None</code> as a prefix effectively declares a default namespace.
|
||||
<li>Now you can pass the <code>lxml</code>-specific <var>nsmap</var> argument when you create an element, and <code>lxml</code> will respect the namespace prefixes you’ve defined.
|
||||
@@ -542,16 +542,16 @@ except ImportError:
|
||||
<p>Are <abbr>XML</abbr> documents limited to one element per document? No, of course not. You can easily create child elements, too.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>title = lxml.etree.SubElement(new_feed, "title",</kbd> <span>①</span></a>
|
||||
<a><samp class=p>... </samp><kbd> attrib={"type":"html"})</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>title = lxml.etree.SubElement(new_feed, 'title',</kbd> <span>①</span></a>
|
||||
<a><samp class=p>... </samp><kbd> attrib={'type':'html'})</kbd> <span>②</span></a>
|
||||
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd>
|
||||
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html"/></feed></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>title.text = "dive into &hellip;"</kbd> <span>③</span></a>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'/></feed></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>title.text = 'dive into &hellip;'</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd> <span>④</span></a>
|
||||
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html">dive into &amp;hellip;</title></feed></samp>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'>dive into &amp;hellip;</title></feed></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed, pretty_print=True))</kbd> <span>⑤</span></a>
|
||||
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
||||
<title type="html">dive into&amp;hellip;</title>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title type='html'>dive into&amp;hellip;</title>
|
||||
</feed></samp></pre>
|
||||
<ol>
|
||||
<li>To create a child element of an existing element, instantiate the <code>SubElement</code> class. The only required arguments are the parent element (<var>new_feed</var> in this case) and the new element’s name. Since this child element will inherit the namespace mapping of its parent, there is no need to redeclare the namespace or prefix here.
|
||||
@@ -574,8 +574,8 @@ except ImportError:
|
||||
|
||||
<p>Here is a fragment of a broken <abbr>XML</abbr> document. I’ve highlighted the wellformedness error.
|
||||
|
||||
<pre class=nd><code><?xml version="1.0" encoding="utf-8"?>
|
||||
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
||||
<pre class=nd><code><?xml version='1.0' encoding='utf-8'?>
|
||||
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title>dive into <mark>…</mark></title>
|
||||
...
|
||||
</feed></code></pre>
|
||||
@@ -584,7 +584,7 @@ except ImportError:
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed-broken.xml")</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse('examples/feed-broken.xml')</kbd>
|
||||
<samp class=traceback>Traceback (most recent call last):
|
||||
File "<stdin>", line 1, in <module>
|
||||
File "lxml.etree.pyx", line 2693, in lxml.etree.parse (src/lxml/lxml.etree.c:52591)
|
||||
@@ -601,16 +601,16 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>parser = lxml.etree.XMLParser(recover=True)</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed-broken.xml", parser)</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = lxml.etree.parse('examples/feed-broken.xml', parser)</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>parser.error_log</kbd> <span>③</span></a>
|
||||
<samp>examples/feed-broken.xml:3:28:FATAL:PARSER:ERR_UNDECLARED_ENTITY: Entity 'hellip' not defined</samp>
|
||||
<samp class=p>>>> </samp><kbd>tree.findall("{http://www.w3.org/2005/Atom}title")</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree.findall('{http://www.w3.org/2005/Atom}title')</kbd>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}title at ead510>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>title = tree.findall("{http://www.w3.org/2005/Atom}title")[0]</kbd>
|
||||
<samp class=p>>>> </samp><kbd>title = tree.findall('{http://www.w3.org/2005/Atom}title')[0]</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>title.text</kbd> <span>④</span></a>
|
||||
<samp>'dive into '</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(tree.getroot()))</kbd> <span>⑤</span></a>
|
||||
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title>dive into </title>
|
||||
.
|
||||
. [rest of serialization snipped for brevity]
|
||||
@@ -619,7 +619,7 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
|
||||
<li>To create a custom parser, instantiate the <code>lxml.etree.XMLParser</code> class. It can take <a href=http://codespeak.net/lxml/parsing.html#parser-options>a number of different named arguments</a>. The one we’re interested in here is the <var>recover</var> argument. When set to <code>True</code>, the <abbr>XML</abbr> parser will try its best to “recover” from wellformedness errors.
|
||||
<li>To parse an <code>XML</code> document with your custom parser, pass the <var>parser</var> object as the second argument to the <code>parse()</code> function. Note that <code>lxml</code> does not raise an exception about the undefined <code>&hellip;</code> entity.
|
||||
<li>The parser keeps a log of the wellformedness errors that it has encountered. (This is actually true regardless of whether it is set to recover from those errors or not.)
|
||||
<li>Since it didn’t know what to do with the undefined <code>&hellip;</code> entity, the parser just silently dropped it. The text content of the <code>title</code> element becomes <code>"dive into "</code>.
|
||||
<li>Since it didn’t know what to do with the undefined <code>&hellip;</code> entity, the parser just silently dropped it. The text content of the <code>title</code> element becomes <code>'dive into '</code>.
|
||||
<li>As you can see from the serialization, the <code>&hellip;</code> entity didn’t get moved; it was simply dropped.
|
||||
</ol>
|
||||
|
||||
@@ -640,6 +640,7 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and <abbr>XSLT</abbr> with <code>lxml</code></a>
|
||||
</ul>
|
||||
|
||||
<p class=v><a rel=prev class=todo><span>☜</span></a> <a rel=next class=todo><span>☞</span></a>
|
||||
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
||||
<script src=j/jquery.js></script>
|
||||
<script src=j/dip3.js></script>
|
||||
|
||||
Reference in New Issue
Block a user