moved an example with lxml-specific syntax to lxml section [thanks G.P.]

This commit is contained in:
Mark Pilgrim
2009-06-01 19:51:28 -07:00
parent 851fd27d7f
commit de717f0830
+45 -44
View File
@@ -396,27 +396,6 @@ mark{display:inline}
<li>The other three results are each entry-level alternate links. Each <code>entry</code> has a single <code>link</code> child element, and because of the double slash at the beginning of the query, this query finds all of them.
</ol>
<p>The <code>findall()</code> method has a few other tricks up its sleeve.
<pre class=screen>
# continuing from the previous example
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href]")</kbd> <span>&#x2460;</span></a>
[&lt;Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
&lt;Element {http://www.w3.org/2005/Atom}link at eeb990>,
&lt;Element {http://www.w3.org/2005/Atom}link at eeb960>,
&lt;Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span>&#x2461;</span></a>
<samp>[&lt;Element {http://www.w3.org/2005/Atom}link at eeb930>]</samp>
<samp class=p>>>> </samp><kbd>NS = "{http://www.w3.org/2005/Atom}"</kbd>
<a><samp class=p>>>> </samp><kbd>tree.findall("//{NS}author[{NS}uri]".format(NS=NS))</kbd> <span>&#x2462;</span></a>
<samp>[&lt;Element {http://www.w3.org/2005/Atom}author at eeba80>,
&lt;Element {http://www.w3.org/2005/Atom}author at eebba0>]</samp></pre>
<ol>
<li>This query finds all elements in the Atom namespace, anywhere in the document, that have an <code>href</code> attribute. The <code>//</code> at the beginning of the query means &#8220;elements anywhere (not just as children of the root element).&#8221; <code>{http://www.w3.org/2005/Atom}</code> means &#8220;only elements in the Atom namespace.&#8221; <code>*</code> means &#8220;elements with any local name.&#8221; And <code>[@href]</code> means &#8220;has an <code>href</code> attribute.&#8221;
<li>The query finds all Atom elements with an <code>href</code> whose value is <code>http://diveintomark.org/</code>.
<li>After doing some quick <a href=strings.html#formatting-strings>string formatting</a> (because otherwise these compound queries get ridiculously long), this query searches for Atom <code>author</code> elements that have an Atom <code>uri</code> element as a child. This only returns two <code>author</code> elements, the ones in the first and second <code>entry</code>. The <code>author</code> in the last <code>entry</code> contains only a <code>name</code>, not a <code>uri</code>.
</ol>
<p>What&#8217;s that? You say you want the power of the <code>findall()</code> method, but you want to work with an iterator instead of building a complete list? ElementTree can do that too.
<pre class=screen>
@@ -445,7 +424,7 @@ StopIteration</samp></pre>
<h2 id=xml-lxml>Going Further With lxml</h2>
<p><a href=http://codespeak.net/lxml/>lxml</a> is an open source third-party library that builds on the popular <a href=http://www.xmlsoft.org/>libxml2 parser</a>. It provides a 100% compatible ElementTree <abbr>API</abbr>, then extends it with full XPath support and a few other niceties. There are <a href=http://pypi.python.org/pypi/lxml/>installers available for Windows</a>; Linux users should always try to use distribution-specific tools like <code>yum</code> or <code>apt-get</code> to install precompiled binaries from their repositories. Otherwise you&#8217;ll need to <a href=http://codespeak.net/lxml/installation.html>install lxml manually</a>.
<p><a href=http://codespeak.net/lxml/><code>lxml</code></a> is an open source third-party library that builds on the popular <a href=http://www.xmlsoft.org/>libxml2 parser</a>. It provides a 100% compatible ElementTree <abbr>API</abbr>, then extends it with full XPath support and a few other niceties. There are <a href=http://pypi.python.org/pypi/lxml/>installers available for Windows</a>; Linux users should always try to use distribution-specific tools like <code>yum</code> or <code>apt-get</code> to install precompiled binaries from their repositories. Otherwise you&#8217;ll need to <a href=http://codespeak.net/lxml/installation.html>install <code>lxml</code> manually</a>.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>from lxml import etree</kbd> <span>&#x2460;</span></a>
@@ -456,34 +435,56 @@ StopIteration</samp></pre>
&lt;Element {http://www.w3.org/2005/Atom}entry at e2b510>,
&lt;Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp></pre>
<ol>
<li>Once imported, lxml provides the same <abbr>API</abbr> as the built-in ElementTree libary.
<li>Once imported, <code>lxml</code> provides the same <abbr>API</abbr> as the built-in ElementTree libary.
<li><code>parse()</code> function: same as ElementTree.
<li><code>getroot()</code> method: also the same.
<li><code>findall()</code> method: exactly the same.
</ol>
<p>For large <abbr>XML</abbr> documents, lxml is significantly faster than the built-in ElementTree libary. If you&#8217;re only using the ElementTree <abbr>API</abbr> and want to use the fastest available implementation, you can try to import lxml and fall back to the built-in ElementTree.
<p>For large <abbr>XML</abbr> documents, <code>lxml</code> is significantly faster than the built-in ElementTree libary. If you&#8217;re only using the ElementTree <abbr>API</abbr> and want to use the fastest available implementation, you can try to import <code>lxml</code> and fall back to the built-in ElementTree.
<pre><code>try:
from lxml import etree
except ImportError:
import xml.etree.ElementTree as etree</code></pre>
<p>But lxml is more than just a faster ElementTree. It also integrates support for arbitrary XPath expressions. I&#8217;m not going to go into depth about XPath syntax. (That could be a whole book unto itself!) But I will show you how it integrates into lxml.
<p>But <code>lxml</code> is more than just a faster ElementTree. Its <code>findall()</code> method includes support for more complicated expressions.
<pre class=screen>
<a><samp class=p>>>> </samp><kbd>import lxml.etree</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>import lxml.etree</kbd> <span>&#x2460;</span></a>
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed.xml")</kbd>
<a><samp class=p>>>> </samp><kbd>NSMAP = {"atom": "http://www.w3.org/2005/Atom"}</kbd> <span>&#x2461;</span></a>
<a><samp class=p>>>> </samp><kbd>entries = tree.xpath("//atom:category[@term='accessibility']/..",</kbd> <span>&#x2462;</span></a>
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href]")</kbd> <span>&#x2461;</span></a>
[&lt;Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
&lt;Element {http://www.w3.org/2005/Atom}link at eeb990>,
&lt;Element {http://www.w3.org/2005/Atom}link at eeb960>,
&lt;Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span>&#x2462;</span></a>
<samp>[&lt;Element {http://www.w3.org/2005/Atom}link at eeb930>]</samp>
<samp class=p>>>> </samp><kbd>NS = "{http://www.w3.org/2005/Atom}"</kbd>
<a><samp class=p>>>> </samp><kbd>tree.findall("//{NS}author[{NS}uri]".format(NS=NS))</kbd> <span>&#x2463;</span></a>
<samp>[&lt;Element {http://www.w3.org/2005/Atom}author at eeba80>,
&lt;Element {http://www.w3.org/2005/Atom}author at eebba0>]</samp></pre>
<ol>
<li>In this example, I&#8217;m going to <code>import lxml.etree</code> (instead of, say, <code>from lxml import etree</code>), to emphasize that these features are specific to <code>lxml</code>.
<li>This query finds all elements in the Atom namespace, anywhere in the document, that have an <code>href</code> attribute. The <code>//</code> at the beginning of the query means &#8220;elements anywhere (not just as children of the root element).&#8221; <code>{http://www.w3.org/2005/Atom}</code> means &#8220;only elements in the Atom namespace.&#8221; <code>*</code> means &#8220;elements with any local name.&#8221; And <code>[@href]</code> means &#8220;has an <code>href</code> attribute.&#8221;
<li>The query finds all Atom elements with an <code>href</code> whose value is <code>http://diveintomark.org/</code>.
<li>After doing some quick <a href=strings.html#formatting-strings>string formatting</a> (because otherwise these compound queries get ridiculously long), this query searches for Atom <code>author</code> elements that have an Atom <code>uri</code> element as a child. This only returns two <code>author</code> elements, the ones in the first and second <code>entry</code>. The <code>author</code> in the last <code>entry</code> contains only a <code>name</code>, not a <code>uri</code>.
</ol>
<p>Not enough for you? <code>lxml</code> also integrates support for arbitrary XPath expressions. I&#8217;m not going to go into depth about XPath syntax; that could be a whole book unto itself! But I will show you how it integrates into <code>lxml</code>.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed.xml")</kbd>
<a><samp class=p>>>> </samp><kbd>NSMAP = {"atom": "http://www.w3.org/2005/Atom"}</kbd> <span>&#x2460;</span></a>
<a><samp class=p>>>> </samp><kbd>entries = tree.xpath("//atom:category[@term='accessibility']/..",</kbd> <span>&#x2461;</span></a>
<samp class=p>... </samp><kbd> namespaces=NSMAP)</kbd>
<a><samp class=p>>>> </samp><kbd>entries</kbd> <span>&#x2463;</span></a>
<a><samp class=p>>>> </samp><kbd>entries</kbd> <span>&#x2462;</span></a>
<samp>[&lt;Element {http://www.w3.org/2005/Atom}entry at e2b630>]</samp>
<samp class=p>>>> </samp><kbd>entry = entries[0]</kbd>
<a><samp class=p>>>> </samp><kbd>entry.xpath("./atom:title/text()", namespaces=nsmap)</kbd> <span>&#x2464;</span></a>
<a><samp class=p>>>> </samp><kbd>entry.xpath("./atom:title/text()", namespaces=nsmap)</kbd> <span>&#x2463;</span></a>
<samp>['Accessibility is a harsh mistress']</samp></pre>
<ol>
<li>In this example, I&#8217;m going to <code>import lxml.etree</code> (instead of, say, <code>from lxml import etree</code>), to emphasize that these features are specific to lxml.
<li>To perform XPath queries on namespaced elements, you need to define a namespace prefix mapping. This is just a Python dictionary.
<li>Here is an XPath query. The XPath expression searches for <code>category</code> elements (in the Atom namespace) that contain a <code>term</code> attribute with the value <code>accessibility</code>. But that&#8217;s not actually the query result. Look at the very end of the query string; did you notice the <code>/..</code> bit? That means &#8220;and then return the parent element of the <code>category</code> element you just found.&#8221; So this single XPath query will find all entries with a child element of <code>&lt;category term="accessibility"></code>.
<li>The <code>xpath()</code> function returns a list of ElementTree objects. In this document, there is only one entry with a <code>category</code> whose <code>term</code> is <code>accessibility</code>.
@@ -520,7 +521,7 @@ except ImportError:
<p>The only practical difference is that the second serialization is several characters shorter. If we were to recast our entire sample feed with a <code>ns0:</code> prefix in every start and end tag, it would add 4 characters per start tag &times; 79 tags + 4 characters for the namespace declaration itself, for a total of 316 characters. Assuming <a href=strings.html#byte-arrays>UTF-8 encoding</a>, that&#8217;s 316 extra bytes. (After gzipping, the difference drops to 21 bytes, but still, 21 bytes is 21 bytes.) Maybe that doesn&#8217;t matter to you, but for something like an Atom feed, which may be downloaded several thousand times whenever it changes, saving a few bytes per request can quickly add up.
<p>The built-in ElementTree library does not offer this fine-grained control over serializing namespaced elements, but lxml does.
<p>The built-in ElementTree library does not offer this fine-grained control over serializing namespaced elements, but <code>lxml</code> does.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
@@ -533,9 +534,9 @@ except ImportError:
<samp>&lt;feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/></samp></pre>
<ol>
<li>To start, define a namespace mapping as a dictionary. Dictionary values are namespaces; dictionary keys are the desired prefix. Using <code>None</code> as a prefix effectively declares a default namespace.
<li>Now you can pass the lxml-specific <var>nsmap</var> argument when you create an element, and lxml will respect the namespace prefixes you&#8217;ve defined.
<li>Now you can pass the <code>lxml</code>-specific <var>nsmap</var> argument when you create an element, and <code>lxml</code> will respect the namespace prefixes you&#8217;ve defined.
<li>As expected, this serialization defines the Atom namespace as the default namespace and declares the <code>feed</code> element without a namespace prefix.
<li>Oops, we forgot to add the <code>xml:lang</code> attribute. You can always add attributes to any element with the <code>set()</code> method. It takes two arguments: the attribute name in standard ElementTree format, then the attribute value. (This method is not lxml-specific. The only lxml-specific part of this example was the <var>nsmap</var> argument to control the namespace prefixes in the serialized output.)
<li>Oops, we forgot to add the <code>xml:lang</code> attribute. You can always add attributes to any element with the <code>set()</code> method. It takes two arguments: the attribute name in standard ElementTree format, then the attribute value. (This method is not <code>lxml</code>-specific. The only <code>lxml</code>-specific part of this example was the <var>nsmap</var> argument to control the namespace prefixes in the serialized output.)
</ol>
<p>Are <abbr>XML</abbr> documents limited to one element per document? No, of course not. You can easily create child elements, too.
@@ -555,21 +556,21 @@ except ImportError:
<ol>
<li>To create a child element of an existing element, instantiate the <code>SubElement</code> class. The only required arguments are the parent element (<var>new_feed</var> in this case) and the new element&#8217;s name. Since this child element will inherit the namespace mapping of its parent, there is no need to redeclare the namespace or prefix here.
<li>You can also pass in an attribute dictionary. Keys are attribute names; values are attribute values.
<li>As expected, the new <code>title</code> element was created in the Atom namespace, and it was inserted as a child of the <code>feed</code> element. Since the <code>title</code> element has no text content and no children of its own, lxml serializes it as an empty element (with the <code>/></code> shortcut).
<li>As expected, the new <code>title</code> element was created in the Atom namespace, and it was inserted as a child of the <code>feed</code> element. Since the <code>title</code> element has no text content and no children of its own, <code>lxml</code> serializes it as an empty element (with the <code>/></code> shortcut).
<li>To set the text content of an element, simply set its <code>.text</code> property.
<li>Now the <code>title</code> element is serialized with its text content. Any text content that contains less-than signs or ampersands needs to be escaped when serialized. lxml handles this escaping automatically.
<li>You can also apply &#8220;pretty printing&#8221; to the serialization, which inserts line breaks after end tags, and after start tags of elements that contain child elements but no text content. In technical terms, lxml adds &#8220;insignificant whitespace&#8221; to make the output more readable.
<li>Now the <code>title</code> element is serialized with its text content. Any text content that contains less-than signs or ampersands needs to be escaped when serialized. <code>lxml</code> handles this escaping automatically.
<li>You can also apply &#8220;pretty printing&#8221; to the serialization, which inserts line breaks after end tags, and after start tags of elements that contain child elements but no text content. In technical terms, <code>lxml</code> adds &#8220;insignificant whitespace&#8221; to make the output more readable.
</ol>
<p class=a>&#x2042;
<h2 id=xml-custom-parser>Customizing Your XML Parser</h2>
<h2 id=xml-custom-parser>Parsing Broken XML</h2>
<p>The <abbr>XML</abbr> specification mandates that all conforming <abbr>XML</abbr> parsers employ &#8220;draconian error handling.&#8221; That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the <abbr>XML</abbr> document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like <abbr>HTML</abbr> &mdash; your browser doesn&#8217;t stop rendering a web page if you forget to close an <abbr>HTML</abbr> tag or escape an ampersand in an attribute value. (It is a common misconception that <abbr>HTML</abbr> has no defined error handling. <a href=http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#parsing><abbr>HTML</abbr> error handling</a> is actually quite well-defined, but it&#8217;s significantly more complicated than &#8220;halt and catch fire on first error.&#8221;)
<p>Some people (myself included) believe that it was a mistake for the inventors of <abbr>XML</abbr> to mandate draconian error handling. Don&#8217;t get me wrong; I can certainly see the allure of simplifying the error handling rules. But in practice, the concept of &#8220;wellformedness&#8221; is trickier than it sounds, especially for <code>XML</code> documents (like Atom feeds) that are published on the web and served over <abbr>HTTP</abbr>. Despite the maturity of <abbr>XML</abbr>, which standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom feeds on the web are plagued with wellformedness errors.
<p>So, I have both theoretical and practical reasons to parse <code>XML</code> documents &#8220;at any cost,&#8221; that is, <em>not</em> to halt and catch fire at the first wellformedness error. If you find yourself wanting to do this too, lxml can help.
<p>So, I have both theoretical and practical reasons to parse <code>XML</code> documents &#8220;at any cost,&#8221; that is, <em>not</em> to halt and catch fire at the first wellformedness error. If you find yourself wanting to do this too, <code>lxml</code> can help.
<p>Here is a fragment of a broken <abbr>XML</abbr> document. I&#8217;ve highlighted the wellformedness error.
@@ -579,7 +580,7 @@ except ImportError:
...
&lt;/feed></code></pre>
<p>That&#8217;s an error, because the <code>&amp;hellip;</code> entity is not defined in <abbr>XML</abbr>. (It is defined in <abbr>HTML</abbr>.) If you try to parse this broken feed with the default settings, lxml will choke on the undefined entity.
<p>That&#8217;s an error, because the <code>&amp;hellip;</code> entity is not defined in <abbr>XML</abbr>. (It is defined in <abbr>HTML</abbr>.) If you try to parse this broken feed with the default settings, <code>lxml</code> will choke on the undefined entity.
<pre class=screen>
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
@@ -616,7 +617,7 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
.</samp></pre>
<ol>
<li>To create a custom parser, instantiate the <code>lxml.etree.XMLParser</code> class. It can take <a href=http://codespeak.net/lxml/parsing.html#parser-options>a number of different named arguments</a>. The one we&#8217;re interested in here is the <var>recover</var> argument. When set to <code>True</code>, the <abbr>XML</abbr> parser will try its best to &#8220;recover&#8221; from wellformedness errors.
<li>To parse an <code>XML</code> document with your custom parser, pass the <var>parser</var> object as the second argument to the <code>parse()</code> function. Note that lxml does not raise an exception about the undefined <code>&amp;hellip;</code> entity.
<li>To parse an <code>XML</code> document with your custom parser, pass the <var>parser</var> object as the second argument to the <code>parse()</code> function. Note that <code>lxml</code> does not raise an exception about the undefined <code>&amp;hellip;</code> entity.
<li>The parser keeps a log of the wellformedness errors that it has encountered. (This is actually true regardless of whether it is set to recover from those errors or not.)
<li>Since it didn&#8217;t know what to do with the undefined <code>&amp;hellip;</code> entity, the parser just silently dropped it. The text content of the <code>title</code> element becomes <code>"dive into "</code>.
<li>As you can see from the serialization, the <code>&amp;hellip;</code> entity didn&#8217;t get moved; it was simply dropped.
@@ -634,9 +635,9 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
<li><a href=http://effbot.org/zone/element.htm>Elements and Element Trees</a>
<li><a href=http://effbot.org/zone/element-xpath.htm>XPath Support in ElementTree</a>
<li><a href=http://effbot.org/zone/element-iterparse.htm>The ElementTree iterparse Function</a>
<li><a href=http://codespeak.net/lxml/>lxml</a>
<li><a href=http://codespeak.net/lxml/1.3/parsing.html>Parsing <abbr>XML</abbr> and <abbr>HTML</abbr> with lxml</a>
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and <abbr>XSLT</abbr> with lxml</a>
<li><a href=http://codespeak.net/lxml/><code>lxml</code></a>
<li><a href=http://codespeak.net/lxml/1.3/parsing.html>Parsing <abbr>XML</abbr> and <abbr>HTML</abbr> with <code>lxml</code></a>
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and <abbr>XSLT</abbr> with <code>lxml</code></a>
</ul>
<p class=c>&copy; 2001&ndash;9 <a href=about.html>Mark Pilgrim</a>