mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 15:00:18 +00:00
syntax highlighting for everyone!
This commit is contained in:
@@ -13,11 +13,11 @@ mark{display:inline}
|
||||
<meta name=viewport content='initial-scale=1.0'>
|
||||
</head>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=25> <input type=submit name=root value=Search></div></form>
|
||||
<p>You are here: <a href=index.html>Home</a> <span>‣</span> <a href=table-of-contents.html#xml>Dive Into Python 3</a> <span>‣</span>
|
||||
<p>You are here: <a href=index.html>Home</a> <span class=u>‣</span> <a href=table-of-contents.html#xml>Dive Into Python 3</a> <span class=u>‣</span>
|
||||
<p id=level>Difficulty level: <span title=advanced>♦♦♦♦♢</span>
|
||||
<h1>XML</h1>
|
||||
<blockquote class=q>
|
||||
<p><span>❝</span> In the archonship of Aristaechmus, Draco enacted his ordinances. <span>❞</span><br>— <a href='http://www.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus:text:1999.01.0046;query=chapter%3D%235;layout=;loc=3.1'>Aristotle</a>
|
||||
<p><span class=u>❝</span> In the archonship of Aristaechmus, Draco enacted his ordinances. <span class=u>❞</span><br>— <a href='http://www.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus:text:1999.01.0046;query=chapter%3D%235;layout=;loc=3.1'>Aristotle</a>
|
||||
</blockquote>
|
||||
<p id=toc>
|
||||
<h2 id=divingin>Diving In</h2>
|
||||
@@ -26,7 +26,7 @@ mark{display:inline}
|
||||
<p>Here, then, is the <abbr>XML</abbr> data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
||||
|
||||
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
||||
<pre><code><?xml version='1.0' encoding='utf-8'?>
|
||||
<pre><code class=pp><?xml version='1.0' encoding='utf-8'?>
|
||||
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title>dive into mark</title>
|
||||
<subtitle>currently between addictions</subtitle>
|
||||
@@ -99,8 +99,8 @@ mark{display:inline}
|
||||
|
||||
<p><abbr>XML</abbr> is a generalized way of describing hierarchical structured data. An <abbr>XML</abbr> <i>document</i> contains one or more <i>elements</i>, which are delimited by <i>start and end tags</i>. This is a complete (albeit boring) <abbr>XML</abbr> document:
|
||||
|
||||
<pre class=nd><code><a><foo> <span>①</span></a>
|
||||
<a></foo> <span>②</span></a></code></pre>
|
||||
<pre class=nd><code class=pp><a><foo> <span class=u>①</span></a>
|
||||
<a></foo> <span class=u>②</span></a></code></pre>
|
||||
<ol>
|
||||
<li>This is the <i>start tag</i> of the <code>foo</code> element.
|
||||
<li>This is the matching <i>end tag</i> of the <code>foo</code> element. Like balancing parentheses in writing or mathematics or code, every start tag much be <i>closed</i> (matched) by a corresponding end tag.
|
||||
@@ -108,20 +108,20 @@ mark{display:inline}
|
||||
|
||||
<p>Elements can be <i>nested</i> to any depth. An element <code>bar</code> inside an element <code>foo</code> is said to be a <i>subelement</i> or <i>child</i> of <code>foo</code>.
|
||||
|
||||
<pre class=nd><code><foo>
|
||||
<pre class=nd><code class=pp><foo>
|
||||
<mark><bar></bar></mark>
|
||||
</foo>
|
||||
</code></pre>
|
||||
|
||||
<p>The first element in every <abbr>XML</abbr> document is called the <i>root element</i>. An <abbr>XML</abbr> document can only have one root element. The following is <strong>not an <abbr>XML</abbr> document</strong>, because it has two root elements:
|
||||
|
||||
<pre class=nd><code><foo></foo>
|
||||
<pre class=nd><code class=pp><foo></foo>
|
||||
<bar></bar></code></pre>
|
||||
|
||||
<p>Elements can have <i>attributes</i>, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. <i>Attribute names</i> can not be repeated within an element. <i>Attribute values</i> must be quoted.
|
||||
|
||||
<pre class=nd><code><a><foo <mark>lang='en'</mark>> <span>①</span></a>
|
||||
<a> <bar <mark>lang='fr'</mark>></bar> <span>②</span></a>
|
||||
<pre class=nd><code class=pp><a><foo <mark>lang='en'</mark>> <span class=u>①</span></a>
|
||||
<a> <bar <mark>lang='fr'</mark>></bar> <span class=u>②</span></a>
|
||||
</foo>
|
||||
</code></pre>
|
||||
<ol>
|
||||
@@ -133,23 +133,23 @@ mark{display:inline}
|
||||
|
||||
<p>Elements can have <i>text content</i>.
|
||||
|
||||
<pre class=nd><code><foo lang='en'>
|
||||
<pre class=nd><code class=pp><foo lang='en'>
|
||||
<bar lang='fr'><mark>PapayaWhip</mark></bar>
|
||||
</foo>
|
||||
</code></pre>
|
||||
|
||||
<p>Elements that contain no text and no children are <i>empty</i>.
|
||||
|
||||
<pre class=nd><code><foo></foo></code></pre>
|
||||
<pre class=nd><code class=pp><foo></foo></code></pre>
|
||||
|
||||
<p>There is a shorthand for writing empty elements. By putting a <code>/</code> character in the start tag, you can skip the end tag altogther. The <abbr>XML</abbr> document in the previous example could be written like this instead:
|
||||
|
||||
<pre class=nd><code><foo<mark>/</mark>></code></pre>
|
||||
<pre class=nd><code class=pp><foo<mark>/</mark>></code></pre>
|
||||
|
||||
<p>Like Python functions can be declared in different <i>modules</i>, <abbr>XML</abbr> elements can be declared in different <i>namespaces</i>. Namespaces usually look like URLs. You use an <code>xmlns</code> declaration to define a <i>default namespace</i>. A namespace declaration looks similar to an attribute, but it has a different purpose.
|
||||
|
||||
<pre class=nd><code><a><feed <mark>xmlns='http://www.w3.org/2005/Atom'</mark>> <span>①</span></a>
|
||||
<a> <title>dive into mark</title> <span>②</span></a>
|
||||
<pre class=nd><code class=pp><a><feed <mark>xmlns='http://www.w3.org/2005/Atom'</mark>> <span class=u>①</span></a>
|
||||
<a> <title>dive into mark</title> <span class=u>②</span></a>
|
||||
</feed>
|
||||
</code></pre>
|
||||
<ol>
|
||||
@@ -159,8 +159,8 @@ mark{display:inline}
|
||||
|
||||
<p>You can also use an <code>xmlns:<var>prefix</var></code> declaration to define a namespace and associate it with a <i>prefix</i>. Then each element in that namespace must be explicitly declared with the prefix.
|
||||
|
||||
<pre class=nd><code><a><atom:feed <mark>xmlns:atom='http://www.w3.org/2005/Atom'</mark>> <span>①</span></a>
|
||||
<a> <atom:title>dive into mark</atom:title> <span>②</span></a>
|
||||
<pre class=nd><code class=pp><a><atom:feed <mark>xmlns:atom='http://www.w3.org/2005/Atom'</mark>> <span class=u>①</span></a>
|
||||
<a> <atom:title>dive into mark</atom:title> <span class=u>②</span></a>
|
||||
</atom:feed></code></pre>
|
||||
<ol>
|
||||
<li>The <code>feed</code> element is in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
||||
@@ -171,7 +171,7 @@ mark{display:inline}
|
||||
|
||||
<p>Finally, <abbr>XML</abbr> documents can contain <a href=strings.html#one-ring-to-rule-them-all>character encoding information</a> on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the document can be parsed, <a href=http://www.w3.org/TR/REC-xml/#sec-guessing-no-ext-info>Section F of the <abbr>XML</abbr> specification</a> details how to resolve this Catch-22.)
|
||||
|
||||
<pre class=nd><code><?xml version='1.0' <mark>encoding='utf-8'</mark>?></code></pre>
|
||||
<pre class=nd><code class=pp><?xml version='1.0' <mark>encoding='utf-8'</mark>?></code></pre>
|
||||
|
||||
<p>And now you know just enough <abbr>XML</abbr> to be dangerous!
|
||||
|
||||
@@ -185,8 +185,8 @@ mark{display:inline}
|
||||
|
||||
<p>At the top level is the <i>root element</i>, which every Atom feed shares: the <code>feed</code> element in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
||||
|
||||
<pre><code><a><feed xmlns='http://www.w3.org/2005/Atom' <span>①</span></a>
|
||||
<a> xml:lang='en'> <span>②</span></a></code></pre>
|
||||
<pre><code class=pp><a><feed xmlns='http://www.w3.org/2005/Atom' <span class=u>①</span></a>
|
||||
<a> xml:lang='en'> <span class=u>②</span></a></code></pre>
|
||||
<ol>
|
||||
<li><code>http://www.w3.org/2005/Atom</code> is the Atom namespace.
|
||||
<li>Any element can contain an <code>xml:lang</code> attribute, which declares the language of the element and its children. In this case, the <code>xml:lang</code> attribute is declared once on the root element, which means the entire feed is in English.
|
||||
@@ -194,12 +194,12 @@ mark{display:inline}
|
||||
|
||||
<p>An Atom feed contains several pieces of information about the feed itself. These are declared as children of the root-level <code>feed</code> element.
|
||||
|
||||
<pre><code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<a> <title>dive into mark</title> <span>①</span></a>
|
||||
<a> <subtitle>currently between addictions</subtitle> <span>②</span></a>
|
||||
<a> <id>tag:diveintomark.org,2001-07-29:/</id> <span>③</span></a>
|
||||
<a> <updated>2009-03-27T21:56:07Z</updated> <span>④</span></a>
|
||||
<a> <link rel='alternate' type='text/html' href='http://diveintomark.org/'/> <span>⑤</span></a></code></pre>
|
||||
<pre><code class=pp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<a> <title>dive into mark</title> <span class=u>①</span></a>
|
||||
<a> <subtitle>currently between addictions</subtitle> <span class=u>②</span></a>
|
||||
<a> <id>tag:diveintomark.org,2001-07-29:/</id> <span class=u>③</span></a>
|
||||
<a> <updated>2009-03-27T21:56:07Z</updated> <span class=u>④</span></a>
|
||||
<a> <link rel='alternate' type='text/html' href='http://diveintomark.org/'/> <span class=u>⑤</span></a></code></pre>
|
||||
<ol>
|
||||
<li>The title of this feed is <code>dive into mark</code>.
|
||||
<li>The subtitle of this feed is <code>currently between addictions</code>.
|
||||
@@ -211,30 +211,30 @@ mark{display:inline}
|
||||
<p>Now we know that this is a feed for a site named “dive into mark“ which is available at <a href=http://diveintomark.org/><code>http://diveintomark.org/</code></a> and was last updated on March 27, 2009.
|
||||
|
||||
<blockquote class=note>
|
||||
<p><span>☞</span>Although the order of elements can be relevant in some <abbr>XML</abbr> documents, it is not relevant in an Atom feed.
|
||||
<p><span class=u>☞</span>Although the order of elements can be relevant in some <abbr>XML</abbr> documents, it is not relevant in an Atom feed.
|
||||
</blockquote>
|
||||
|
||||
<p>After the feed-level metadata is the list of the most recent articles. An article looks like this:
|
||||
|
||||
<pre><code><entry>
|
||||
<a> <author> <span>①</span></a>
|
||||
<pre><code class=pp><entry>
|
||||
<a> <author> <span class=u>①</span></a>
|
||||
<name>Mark</name>
|
||||
<uri>http://diveintomark.org/</uri>
|
||||
</author>
|
||||
<a> <title>Dive into history, 2009 edition</title> <span>②</span></a>
|
||||
<a> <link rel='alternate' type='text/html' <span>③</span></a>
|
||||
<a> <title>Dive into history, 2009 edition</title> <span class=u>②</span></a>
|
||||
<a> <link rel='alternate' type='text/html' <span class=u>③</span></a>
|
||||
href='http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition'/>
|
||||
<a> <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id> <span>④</span></a>
|
||||
<a> <updated>2009-03-27T21:56:07Z</updated> <span>⑤</span></a>
|
||||
<a> <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id> <span class=u>④</span></a>
|
||||
<a> <updated>2009-03-27T21:56:07Z</updated> <span class=u>⑤</span></a>
|
||||
<published>2009-03-27T17:20:42Z</published>
|
||||
<a> <category scheme='http://diveintomark.org' term='diveintopython'/> <span>⑥</span></a>
|
||||
<a> <category scheme='http://diveintomark.org' term='diveintopython'/> <span class=u>⑥</span></a>
|
||||
<category scheme='http://diveintomark.org' term='docbook'/>
|
||||
<category scheme='http://diveintomark.org' term='html'/>
|
||||
<a> <summary type='html'>Putting an entire chapter on one page sounds <span>⑦</span></a>
|
||||
<a> <summary type='html'>Putting an entire chapter on one page sounds <span class=u>⑦</span></a>
|
||||
bloated, but consider this &amp;mdash; my longest chapter so far
|
||||
would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
|
||||
On dialup.</summary>
|
||||
<a></entry> <span>⑧</span></a></code></pre>
|
||||
<a></entry> <span class=u>⑧</span></a></code></pre>
|
||||
<ol>
|
||||
<li>The <code>author</code> element tells who wrote this article: some guy named Mark, whom you can find loafing at <code>http://diveintomark.org/</code>. (This is the same as the alternate link in the feed metadata, but it doesn’t have to be. Many weblogs have multiple authors, each with their own personal website.)
|
||||
<li>The <code>title</code> element gives the title of the article, “Dive into history, 2009 edition”.
|
||||
@@ -254,10 +254,10 @@ mark{display:inline}
|
||||
|
||||
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = etree.parse('examples/feed.xml')</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd> <span class=u>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = etree.parse('examples/feed.xml')</kbd> <span class=u>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd> <span class=u>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root</kbd> <span class=u>④</span></a>
|
||||
<samp><Element {http://www.w3.org/2005/Atom}feed at cd1eb0></samp></pre>
|
||||
<ol>
|
||||
<li>The ElementTree library is part of the Python standard library, in <code>xml.etree.ElementTree</code>.
|
||||
@@ -267,7 +267,7 @@ mark{display:inline}
|
||||
</ol>
|
||||
|
||||
<blockquote class=note>
|
||||
<p><span>☞</span>ElementTree represents <abbr>XML</abbr> elements as <code>{<var>namespace</var>}<var>localname</var></code>. You’ll see and use this format in multiple places in the ElementTree <abbr>API</abbr>.
|
||||
<p><span class=u>☞</span>ElementTree represents <abbr>XML</abbr> elements as <code>{<var>namespace</var>}<var>localname</var></code>. You’ll see and use this format in multiple places in the ElementTree <abbr>API</abbr>.
|
||||
</blockquote>
|
||||
|
||||
<h3 id=xml-elements>Elements Are Lists</h3>
|
||||
@@ -276,12 +276,12 @@ mark{display:inline}
|
||||
|
||||
<pre class=screen>
|
||||
# continued from the previous example
|
||||
<a><samp class=p>>>> </samp><kbd>root.tag</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.tag</kbd> <span class=u>①</span></a>
|
||||
<samp>'{http://www.w3.org/2005/Atom}feed'</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>len(root)</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>len(root)</kbd> <span class=u>②</span></a>
|
||||
<samp>8</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>for child in root:</kbd> <span>③</span></a>
|
||||
<a><samp class=p>... </samp><kbd> print(child)</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>for child in root:</kbd> <span class=u>③</span></a>
|
||||
<a><samp class=p>... </samp><kbd> print(child)</kbd> <span class=u>④</span></a>
|
||||
<samp class=p>... </samp>
|
||||
<samp><Element {http://www.w3.org/2005/Atom}title at e2b5d0>
|
||||
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
|
||||
@@ -306,17 +306,17 @@ mark{display:inline}
|
||||
|
||||
<pre class=screen>
|
||||
# continuing from the previous example
|
||||
<a><samp class=p>>>> </samp><kbd>root.attrib</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.attrib</kbd> <span class=u>①</span></a>
|
||||
<samp>{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root[4]</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root[4]</kbd> <span class=u>②</span></a>
|
||||
<samp><Element {http://www.w3.org/2005/Atom}link at e181b0></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root[4].attrib</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root[4].attrib</kbd> <span class=u>③</span></a>
|
||||
<samp>{'href': 'http://diveintomark.org/',
|
||||
'type': 'text/html',
|
||||
'rel': 'alternate'}</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root[3]</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root[3]</kbd> <span class=u>④</span></a>
|
||||
<samp><Element {http://www.w3.org/2005/Atom}updated at e2b4e0></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root[3].attrib</kbd> <span>⑤</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root[3].attrib</kbd> <span class=u>⑤</span></a>
|
||||
<samp>{}</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>attrib</code> property is a dictionary of the element’s attributes. The original markup here was <code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'></code>. The <code>xml:</code> prefix refers to a built-in namespace that every <abbr>XML</abbr> document can use without declaring it.
|
||||
@@ -336,15 +336,15 @@ mark{display:inline}
|
||||
<samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = etree.parse('examples/feed.xml')</kbd>
|
||||
<samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span class=u>①</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>root.tag</kbd>
|
||||
<samp>'{http://www.w3.org/2005/Atom}feed'</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}feed')</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}feed')</kbd> <span class=u>②</span></a>
|
||||
<samp>[]</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}author')</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}author')</kbd> <span class=u>③</span></a>
|
||||
<samp>[]</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>findall()</code> method finds child elements that match a specific query. (More on the query format in a minute.)
|
||||
@@ -353,11 +353,11 @@ mark{display:inline}
|
||||
</ol>
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span class=u>①</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('{http://www.w3.org/2005/Atom}author')</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('{http://www.w3.org/2005/Atom}author')</kbd> <span class=u>②</span></a>
|
||||
<samp>[]</samp>
|
||||
</pre>
|
||||
<ol>
|
||||
@@ -368,17 +368,17 @@ mark{display:inline}
|
||||
<p>There <em>is</em> a way to search for <em>descendant</em> elements, <i>i.e.</i> children, grandchildren, and any element at any nesting level.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links = tree.findall('//{http://www.w3.org/2005/Atom}link')</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links = tree.findall('//{http://www.w3.org/2005/Atom}link')</kbd> <span class=u>①</span></a>
|
||||
<samp class=p>>>> </samp><kbd>all_links</kbd>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at e181b0>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at e2b570>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at e2b480>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at e2b5a0>]</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links[0].attrib</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links[0].attrib</kbd> <span class=u>②</span></a>
|
||||
<samp>{'href': 'http://diveintomark.org/',
|
||||
'type': 'text/html',
|
||||
'rel': 'alternate'}</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links[1].attrib</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>all_links[1].attrib</kbd> <span class=u>③</span></a>
|
||||
<samp>{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition',
|
||||
'type': 'text/html',
|
||||
'rel': 'alternate'}</samp>
|
||||
@@ -400,8 +400,8 @@ mark{display:inline}
|
||||
|
||||
<pre class=screen>
|
||||
# continuing from the previous example
|
||||
<a><samp class=p>>>> </samp><kbd>it = tree.getiterator('{http://www.w3.org/2005/Atom}link')</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>next(it)</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>it = tree.getiterator('{http://www.w3.org/2005/Atom}link')</kbd> <span class=u>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>next(it)</kbd> <span class=u>②</span></a>
|
||||
<Element {http://www.w3.org/2005/Atom}link at 122f1b0>
|
||||
<samp class=p>>>> </samp><kbd>next(it)</kbd>
|
||||
<Element {http://www.w3.org/2005/Atom}link at 122f1e0>
|
||||
@@ -427,10 +427,10 @@ StopIteration</samp></pre>
|
||||
<p><a href=http://codespeak.net/lxml/><code>lxml</code></a> is an open source third-party library that builds on the popular <a href=http://www.xmlsoft.org/>libxml2 parser</a>. It provides a 100% compatible ElementTree <abbr>API</abbr>, then extends it with full XPath support and a few other niceties. There are <a href=http://pypi.python.org/pypi/lxml/>installers available for Windows</a>; Linux users should always try to use distribution-specific tools like <code>yum</code> or <code>apt-get</code> to install precompiled binaries from their repositories. Otherwise you’ll need to <a href=http://codespeak.net/lxml/installation.html>install <code>lxml</code> manually</a>.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>from lxml import etree</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = etree.parse('examples/feed.xml')</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>from lxml import etree</kbd> <span class=u>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = etree.parse('examples/feed.xml')</kbd> <span class=u>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root = tree.getroot()</kbd> <span class=u>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>root.findall('{http://www.w3.org/2005/Atom}entry')</kbd> <span class=u>④</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp></pre>
|
||||
@@ -443,7 +443,7 @@ StopIteration</samp></pre>
|
||||
|
||||
<p>For large <abbr>XML</abbr> documents, <code>lxml</code> is significantly faster than the built-in ElementTree libary. If you’re only using the ElementTree <abbr>API</abbr> and want to use the fastest available implementation, you can try to import <code>lxml</code> and fall back to the built-in ElementTree.
|
||||
|
||||
<pre><code>try:
|
||||
<pre><code class=pp>try:
|
||||
from lxml import etree
|
||||
except ImportError:
|
||||
import xml.etree.ElementTree as etree</code></pre>
|
||||
@@ -451,17 +451,17 @@ except ImportError:
|
||||
<p>But <code>lxml</code> is more than just a faster ElementTree. Its <code>findall()</code> method includes support for more complicated expressions.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>import lxml.etree</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>import lxml.etree</kbd> <span class=u>①</span></a>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse('examples/feed.xml')</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('//{http://www.w3.org/2005/Atom}*[@href]')</kbd> <span class=u>②</span></a>
|
||||
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb990>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb960>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span class=u>③</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at eeb930>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>NS = '{http://www.w3.org/2005/Atom}'</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('//{NS}author[{NS}uri]'.format(NS=NS))</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall('//{NS}author[{NS}uri]'.format(NS=NS))</kbd> <span class=u>④</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
|
||||
<Element {http://www.w3.org/2005/Atom}author at eebba0>]</samp></pre>
|
||||
<ol>
|
||||
@@ -476,13 +476,13 @@ except ImportError:
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse('examples/feed.xml')</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entries = tree.xpath("//atom:category[@term='accessibility']/..",</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {'atom': 'http://www.w3.org/2005/Atom'}</kbd> <span class=u>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entries = tree.xpath("//atom:category[@term='accessibility']/..",</kbd> <span class=u>②</span></a>
|
||||
<samp class=p>... </samp><kbd> namespaces=NSMAP)</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>entries</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entries</kbd> <span class=u>③</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>entry = entries[0]</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>entry.xpath('./atom:title/text()', namespaces=nsmap)</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entry.xpath('./atom:title/text()', namespaces=nsmap)</kbd> <span class=u>④</span></a>
|
||||
<samp>['Accessibility is a harsh mistress']</samp></pre>
|
||||
<ol>
|
||||
<li>To perform XPath queries on namespaced elements, you need to define a namespace prefix mapping. This is just a Python dictionary.
|
||||
@@ -499,9 +499,9 @@ except ImportError:
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import xml.etree.ElementTree as etree</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',</kbd> <span>①</span></a>
|
||||
<a><samp class=p>... </samp><kbd> attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'})</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(etree.tostring(new_feed))</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed = etree.Element('{http://www.w3.org/2005/Atom}feed',</kbd> <span class=u>①</span></a>
|
||||
<a><samp class=p>... </samp><kbd> attrib={'{http://www.w3.org/XML/1998/namespace}lang': 'en'})</kbd> <span class=u>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(etree.tostring(new_feed))</kbd> <span class=u>③</span></a>
|
||||
<samp><ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/></samp></pre>
|
||||
<ol>
|
||||
<li>To create a new element, instantiate the <code>Element</code> class. You pass the element name (namespace + local name) as the first argument. This statement creates a <code>feed</code> element in the Atom namespace. This will be our new document’s root element.
|
||||
@@ -513,11 +513,11 @@ except ImportError:
|
||||
|
||||
<p>An <abbr>XML</abbr> parser won’t “see” any difference between an <abbr>XML</abbr> document with a default namespace and an <abbr>XML</abbr> document with a prefixed namespace. The resulting <abbr>DOM</abbr> of this serialization:
|
||||
|
||||
<pre class=nd><code><ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
<pre class=nd><code class=pp><ns0:feed xmlns:ns0='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
|
||||
<p>is identical to the <abbr>DOM</abbr> of this serialization:
|
||||
|
||||
<pre class=nd><code><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
<pre class=nd><code class=pp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/></code></pre>
|
||||
|
||||
<p>The only practical difference is that the second serialization is several characters shorter. If we were to recast our entire sample feed with a <code>ns0:</code> prefix in every start and end tag, it would add 4 characters per start tag × 79 tags + 4 characters for the namespace declaration itself, for a total of 316 characters. Assuming <a href=strings.html#byte-arrays>UTF-8 encoding</a>, that’s 316 extra bytes. (After gzipping, the difference drops to 21 bytes, but still, 21 bytes is 21 bytes.) Maybe that doesn’t matter to you, but for something like an Atom feed, which may be downloaded several thousand times whenever it changes, saving a few bytes per request can quickly add up.
|
||||
|
||||
@@ -525,11 +525,11 @@ except ImportError:
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {None: 'http://www.w3.org/2005/Atom'}</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed = lxml.etree.Element('feed', nsmap=NSMAP)</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {None: 'http://www.w3.org/2005/Atom'}</kbd> <span class=u>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed = lxml.etree.Element('feed', nsmap=NSMAP)</kbd> <span class=u>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd> <span class=u>③</span></a>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom'/></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed.set('{http://www.w3.org/XML/1998/namespace}lang', 'en')</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>new_feed.set('{http://www.w3.org/XML/1998/namespace}lang', 'en')</kbd> <span class=u>④</span></a>
|
||||
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'/></samp></pre>
|
||||
<ol>
|
||||
@@ -542,14 +542,14 @@ except ImportError:
|
||||
<p>Are <abbr>XML</abbr> documents limited to one element per document? No, of course not. You can easily create child elements, too.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>title = lxml.etree.SubElement(new_feed, 'title',</kbd> <span>①</span></a>
|
||||
<a><samp class=p>... </samp><kbd> attrib={'type':'html'})</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>title = lxml.etree.SubElement(new_feed, 'title',</kbd> <span class=u>①</span></a>
|
||||
<a><samp class=p>... </samp><kbd> attrib={'type':'html'})</kbd> <span class=u>②</span></a>
|
||||
<samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'/></feed></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>title.text = 'dive into &hellip;'</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>title.text = 'dive into &hellip;'</kbd> <span class=u>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed))</kbd> <span class=u>④</span></a>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'><title type='html'>dive into &amp;hellip;</title></feed></samp>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed, pretty_print=True))</kbd> <span>⑤</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(new_feed, pretty_print=True))</kbd> <span class=u>⑤</span></a>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title type='html'>dive into&amp;hellip;</title>
|
||||
</feed></samp></pre>
|
||||
@@ -574,9 +574,9 @@ except ImportError:
|
||||
|
||||
<p>Here is a fragment of a broken <abbr>XML</abbr> document. I’ve highlighted the wellformedness error.
|
||||
|
||||
<pre class=nd><code><?xml version='1.0' encoding='utf-8'?>
|
||||
<pre class=nd><code class=pp><?xml version='1.0' encoding='utf-8'?>
|
||||
<feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title>dive into <mark>…</mark></title>
|
||||
<title>dive into <mark>&hellip;</mark></title>
|
||||
...
|
||||
</feed></code></pre>
|
||||
|
||||
@@ -600,16 +600,16 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
|
||||
<p>To parse this broken <abbr>XML</abbr> document, despite its wellformedness error, you need to create a custom <abbr>XML</abbr> parser.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>parser = lxml.etree.XMLParser(recover=True)</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = lxml.etree.parse('examples/feed-broken.xml', parser)</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>parser.error_log</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>parser = lxml.etree.XMLParser(recover=True)</kbd> <span class=u>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree = lxml.etree.parse('examples/feed-broken.xml', parser)</kbd> <span class=u>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>parser.error_log</kbd> <span class=u>③</span></a>
|
||||
<samp>examples/feed-broken.xml:3:28:FATAL:PARSER:ERR_UNDECLARED_ENTITY: Entity 'hellip' not defined</samp>
|
||||
<samp class=p>>>> </samp><kbd>tree.findall('{http://www.w3.org/2005/Atom}title')</kbd>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}title at ead510>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>title = tree.findall('{http://www.w3.org/2005/Atom}title')[0]</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>title.text</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>title.text</kbd> <span class=u>④</span></a>
|
||||
<samp>'dive into '</samp>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(tree.getroot()))</kbd> <span>⑤</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>print(lxml.etree.tounicode(tree.getroot()))</kbd> <span class=u>⑤</span></a>
|
||||
<samp><feed xmlns='http://www.w3.org/2005/Atom' xml:lang='en'>
|
||||
<title>dive into </title>
|
||||
.
|
||||
@@ -640,7 +640,8 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and <abbr>XSLT</abbr> with <code>lxml</code></a>
|
||||
</ul>
|
||||
|
||||
<p class=v><a rel=prev class=todo><span>☜</span></a> <a rel=next class=todo><span>☞</span></a>
|
||||
<p class=v><a rel=prev class=todo><span class=u>☜</span></a> <a rel=next class=todo><span class=u>☞</span></a>
|
||||
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
||||
<script src=j/jquery.js></script>
|
||||
<script src=j/prettify.js></script>
|
||||
<script src=j/dip3.js></script>
|
||||
|
||||
Reference in New Issue
Block a user