mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
moved an example with lxml-specific syntax to lxml section [thanks G.P.]
This commit is contained in:
@@ -396,27 +396,6 @@ mark{display:inline}
|
||||
<li>The other three results are each entry-level alternate links. Each <code>entry</code> has a single <code>link</code> child element, and because of the double slash at the beginning of the query, this query finds all of them.
|
||||
</ol>
|
||||
|
||||
<p>The <code>findall()</code> method has a few other tricks up its sleeve.
|
||||
|
||||
<pre class=screen>
|
||||
# continuing from the previous example
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href]")</kbd> <span>①</span></a>
|
||||
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb990>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb960>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span>②</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at eeb930>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>NS = "{http://www.w3.org/2005/Atom}"</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{NS}author[{NS}uri]".format(NS=NS))</kbd> <span>③</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
|
||||
<Element {http://www.w3.org/2005/Atom}author at eebba0>]</samp></pre>
|
||||
<ol>
|
||||
<li>This query finds all elements in the Atom namespace, anywhere in the document, that have an <code>href</code> attribute. The <code>//</code> at the beginning of the query means “elements anywhere (not just as children of the root element).” <code>{http://www.w3.org/2005/Atom}</code> means “only elements in the Atom namespace.” <code>*</code> means “elements with any local name.” And <code>[@href]</code> means “has an <code>href</code> attribute.”
|
||||
<li>The query finds all Atom elements with an <code>href</code> whose value is <code>http://diveintomark.org/</code>.
|
||||
<li>After doing some quick <a href=strings.html#formatting-strings>string formatting</a> (because otherwise these compound queries get ridiculously long), this query searches for Atom <code>author</code> elements that have an Atom <code>uri</code> element as a child. This only returns two <code>author</code> elements, the ones in the first and second <code>entry</code>. The <code>author</code> in the last <code>entry</code> contains only a <code>name</code>, not a <code>uri</code>.
|
||||
</ol>
|
||||
|
||||
<p>What’s that? You say you want the power of the <code>findall()</code> method, but you want to work with an iterator instead of building a complete list? ElementTree can do that too.
|
||||
|
||||
<pre class=screen>
|
||||
@@ -445,7 +424,7 @@ StopIteration</samp></pre>
|
||||
|
||||
<h2 id=xml-lxml>Going Further With lxml</h2>
|
||||
|
||||
<p><a href=http://codespeak.net/lxml/>lxml</a> is an open source third-party library that builds on the popular <a href=http://www.xmlsoft.org/>libxml2 parser</a>. It provides a 100% compatible ElementTree <abbr>API</abbr>, then extends it with full XPath support and a few other niceties. There are <a href=http://pypi.python.org/pypi/lxml/>installers available for Windows</a>; Linux users should always try to use distribution-specific tools like <code>yum</code> or <code>apt-get</code> to install precompiled binaries from their repositories. Otherwise you’ll need to <a href=http://codespeak.net/lxml/installation.html>install lxml manually</a>.
|
||||
<p><a href=http://codespeak.net/lxml/><code>lxml</code></a> is an open source third-party library that builds on the popular <a href=http://www.xmlsoft.org/>libxml2 parser</a>. It provides a 100% compatible ElementTree <abbr>API</abbr>, then extends it with full XPath support and a few other niceties. There are <a href=http://pypi.python.org/pypi/lxml/>installers available for Windows</a>; Linux users should always try to use distribution-specific tools like <code>yum</code> or <code>apt-get</code> to install precompiled binaries from their repositories. Otherwise you’ll need to <a href=http://codespeak.net/lxml/installation.html>install <code>lxml</code> manually</a>.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>from lxml import etree</kbd> <span>①</span></a>
|
||||
@@ -456,34 +435,56 @@ StopIteration</samp></pre>
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b510>,
|
||||
<Element {http://www.w3.org/2005/Atom}entry at e2b540>]</samp></pre>
|
||||
<ol>
|
||||
<li>Once imported, lxml provides the same <abbr>API</abbr> as the built-in ElementTree libary.
|
||||
<li>Once imported, <code>lxml</code> provides the same <abbr>API</abbr> as the built-in ElementTree libary.
|
||||
<li><code>parse()</code> function: same as ElementTree.
|
||||
<li><code>getroot()</code> method: also the same.
|
||||
<li><code>findall()</code> method: exactly the same.
|
||||
</ol>
|
||||
|
||||
<p>For large <abbr>XML</abbr> documents, lxml is significantly faster than the built-in ElementTree libary. If you’re only using the ElementTree <abbr>API</abbr> and want to use the fastest available implementation, you can try to import lxml and fall back to the built-in ElementTree.
|
||||
<p>For large <abbr>XML</abbr> documents, <code>lxml</code> is significantly faster than the built-in ElementTree libary. If you’re only using the ElementTree <abbr>API</abbr> and want to use the fastest available implementation, you can try to import <code>lxml</code> and fall back to the built-in ElementTree.
|
||||
|
||||
<pre><code>try:
|
||||
from lxml import etree
|
||||
except ImportError:
|
||||
import xml.etree.ElementTree as etree</code></pre>
|
||||
|
||||
<p>But lxml is more than just a faster ElementTree. It also integrates support for arbitrary XPath expressions. I’m not going to go into depth about XPath syntax. (That could be a whole book unto itself!) But I will show you how it integrates into lxml.
|
||||
<p>But <code>lxml</code> is more than just a faster ElementTree. Its <code>findall()</code> method includes support for more complicated expressions.
|
||||
|
||||
<pre class=screen>
|
||||
<a><samp class=p>>>> </samp><kbd>import lxml.etree</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>import lxml.etree</kbd> <span>①</span></a>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed.xml")</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {"atom": "http://www.w3.org/2005/Atom"}</kbd> <span>②</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entries = tree.xpath("//atom:category[@term='accessibility']/..",</kbd> <span>③</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href]")</kbd> <span>②</span></a>
|
||||
[<Element {http://www.w3.org/2005/Atom}link at eeb8a0>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb990>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb960>,
|
||||
<Element {http://www.w3.org/2005/Atom}link at eeb9c0>]
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{http://www.w3.org/2005/Atom}*[@href='http://diveintomark.org/']")</kbd> <span>③</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}link at eeb930>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>NS = "{http://www.w3.org/2005/Atom}"</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>tree.findall("//{NS}author[{NS}uri]".format(NS=NS))</kbd> <span>④</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}author at eeba80>,
|
||||
<Element {http://www.w3.org/2005/Atom}author at eebba0>]</samp></pre>
|
||||
<ol>
|
||||
<li>In this example, I’m going to <code>import lxml.etree</code> (instead of, say, <code>from lxml import etree</code>), to emphasize that these features are specific to <code>lxml</code>.
|
||||
<li>This query finds all elements in the Atom namespace, anywhere in the document, that have an <code>href</code> attribute. The <code>//</code> at the beginning of the query means “elements anywhere (not just as children of the root element).” <code>{http://www.w3.org/2005/Atom}</code> means “only elements in the Atom namespace.” <code>*</code> means “elements with any local name.” And <code>[@href]</code> means “has an <code>href</code> attribute.”
|
||||
<li>The query finds all Atom elements with an <code>href</code> whose value is <code>http://diveintomark.org/</code>.
|
||||
<li>After doing some quick <a href=strings.html#formatting-strings>string formatting</a> (because otherwise these compound queries get ridiculously long), this query searches for Atom <code>author</code> elements that have an Atom <code>uri</code> element as a child. This only returns two <code>author</code> elements, the ones in the first and second <code>entry</code>. The <code>author</code> in the last <code>entry</code> contains only a <code>name</code>, not a <code>uri</code>.
|
||||
</ol>
|
||||
|
||||
<p>Not enough for you? <code>lxml</code> also integrates support for arbitrary XPath expressions. I’m not going to go into depth about XPath syntax; that could be a whole book unto itself! But I will show you how it integrates into <code>lxml</code>.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
||||
<samp class=p>>>> </samp><kbd>tree = lxml.etree.parse("examples/feed.xml")</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>NSMAP = {"atom": "http://www.w3.org/2005/Atom"}</kbd> <span>①</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entries = tree.xpath("//atom:category[@term='accessibility']/..",</kbd> <span>②</span></a>
|
||||
<samp class=p>... </samp><kbd> namespaces=NSMAP)</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>entries</kbd> <span>④</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entries</kbd> <span>③</span></a>
|
||||
<samp>[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]</samp>
|
||||
<samp class=p>>>> </samp><kbd>entry = entries[0]</kbd>
|
||||
<a><samp class=p>>>> </samp><kbd>entry.xpath("./atom:title/text()", namespaces=nsmap)</kbd> <span>⑤</span></a>
|
||||
<a><samp class=p>>>> </samp><kbd>entry.xpath("./atom:title/text()", namespaces=nsmap)</kbd> <span>④</span></a>
|
||||
<samp>['Accessibility is a harsh mistress']</samp></pre>
|
||||
<ol>
|
||||
<li>In this example, I’m going to <code>import lxml.etree</code> (instead of, say, <code>from lxml import etree</code>), to emphasize that these features are specific to lxml.
|
||||
<li>To perform XPath queries on namespaced elements, you need to define a namespace prefix mapping. This is just a Python dictionary.
|
||||
<li>Here is an XPath query. The XPath expression searches for <code>category</code> elements (in the Atom namespace) that contain a <code>term</code> attribute with the value <code>accessibility</code>. But that’s not actually the query result. Look at the very end of the query string; did you notice the <code>/..</code> bit? That means “and then return the parent element of the <code>category</code> element you just found.” So this single XPath query will find all entries with a child element of <code><category term="accessibility"></code>.
|
||||
<li>The <code>xpath()</code> function returns a list of ElementTree objects. In this document, there is only one entry with a <code>category</code> whose <code>term</code> is <code>accessibility</code>.
|
||||
@@ -520,7 +521,7 @@ except ImportError:
|
||||
|
||||
<p>The only practical difference is that the second serialization is several characters shorter. If we were to recast our entire sample feed with a <code>ns0:</code> prefix in every start and end tag, it would add 4 characters per start tag × 79 tags + 4 characters for the namespace declaration itself, for a total of 316 characters. Assuming <a href=strings.html#byte-arrays>UTF-8 encoding</a>, that’s 316 extra bytes. (After gzipping, the difference drops to 21 bytes, but still, 21 bytes is 21 bytes.) Maybe that doesn’t matter to you, but for something like an Atom feed, which may be downloaded several thousand times whenever it changes, saving a few bytes per request can quickly add up.
|
||||
|
||||
<p>The built-in ElementTree library does not offer this fine-grained control over serializing namespaced elements, but lxml does.
|
||||
<p>The built-in ElementTree library does not offer this fine-grained control over serializing namespaced elements, but <code>lxml</code> does.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
||||
@@ -533,9 +534,9 @@ except ImportError:
|
||||
<samp><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/></samp></pre>
|
||||
<ol>
|
||||
<li>To start, define a namespace mapping as a dictionary. Dictionary values are namespaces; dictionary keys are the desired prefix. Using <code>None</code> as a prefix effectively declares a default namespace.
|
||||
<li>Now you can pass the lxml-specific <var>nsmap</var> argument when you create an element, and lxml will respect the namespace prefixes you’ve defined.
|
||||
<li>Now you can pass the <code>lxml</code>-specific <var>nsmap</var> argument when you create an element, and <code>lxml</code> will respect the namespace prefixes you’ve defined.
|
||||
<li>As expected, this serialization defines the Atom namespace as the default namespace and declares the <code>feed</code> element without a namespace prefix.
|
||||
<li>Oops, we forgot to add the <code>xml:lang</code> attribute. You can always add attributes to any element with the <code>set()</code> method. It takes two arguments: the attribute name in standard ElementTree format, then the attribute value. (This method is not lxml-specific. The only lxml-specific part of this example was the <var>nsmap</var> argument to control the namespace prefixes in the serialized output.)
|
||||
<li>Oops, we forgot to add the <code>xml:lang</code> attribute. You can always add attributes to any element with the <code>set()</code> method. It takes two arguments: the attribute name in standard ElementTree format, then the attribute value. (This method is not <code>lxml</code>-specific. The only <code>lxml</code>-specific part of this example was the <var>nsmap</var> argument to control the namespace prefixes in the serialized output.)
|
||||
</ol>
|
||||
|
||||
<p>Are <abbr>XML</abbr> documents limited to one element per document? No, of course not. You can easily create child elements, too.
|
||||
@@ -555,21 +556,21 @@ except ImportError:
|
||||
<ol>
|
||||
<li>To create a child element of an existing element, instantiate the <code>SubElement</code> class. The only required arguments are the parent element (<var>new_feed</var> in this case) and the new element’s name. Since this child element will inherit the namespace mapping of its parent, there is no need to redeclare the namespace or prefix here.
|
||||
<li>You can also pass in an attribute dictionary. Keys are attribute names; values are attribute values.
|
||||
<li>As expected, the new <code>title</code> element was created in the Atom namespace, and it was inserted as a child of the <code>feed</code> element. Since the <code>title</code> element has no text content and no children of its own, lxml serializes it as an empty element (with the <code>/></code> shortcut).
|
||||
<li>As expected, the new <code>title</code> element was created in the Atom namespace, and it was inserted as a child of the <code>feed</code> element. Since the <code>title</code> element has no text content and no children of its own, <code>lxml</code> serializes it as an empty element (with the <code>/></code> shortcut).
|
||||
<li>To set the text content of an element, simply set its <code>.text</code> property.
|
||||
<li>Now the <code>title</code> element is serialized with its text content. Any text content that contains less-than signs or ampersands needs to be escaped when serialized. lxml handles this escaping automatically.
|
||||
<li>You can also apply “pretty printing” to the serialization, which inserts line breaks after end tags, and after start tags of elements that contain child elements but no text content. In technical terms, lxml adds “insignificant whitespace” to make the output more readable.
|
||||
<li>Now the <code>title</code> element is serialized with its text content. Any text content that contains less-than signs or ampersands needs to be escaped when serialized. <code>lxml</code> handles this escaping automatically.
|
||||
<li>You can also apply “pretty printing” to the serialization, which inserts line breaks after end tags, and after start tags of elements that contain child elements but no text content. In technical terms, <code>lxml</code> adds “insignificant whitespace” to make the output more readable.
|
||||
</ol>
|
||||
|
||||
<p class=a>⁂
|
||||
|
||||
<h2 id=xml-custom-parser>Customizing Your XML Parser</h2>
|
||||
<h2 id=xml-custom-parser>Parsing Broken XML</h2>
|
||||
|
||||
<p>The <abbr>XML</abbr> specification mandates that all conforming <abbr>XML</abbr> parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the <abbr>XML</abbr> document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like <abbr>HTML</abbr> — your browser doesn’t stop rendering a web page if you forget to close an <abbr>HTML</abbr> tag or escape an ampersand in an attribute value. (It is a common misconception that <abbr>HTML</abbr> has no defined error handling. <a href=http://www.whatwg.org/specs/web-apps/current-work/multipage/syntax.html#parsing><abbr>HTML</abbr> error handling</a> is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”)
|
||||
|
||||
<p>Some people (myself included) believe that it was a mistake for the inventors of <abbr>XML</abbr> to mandate draconian error handling. Don’t get me wrong; I can certainly see the allure of simplifying the error handling rules. But in practice, the concept of “wellformedness” is trickier than it sounds, especially for <code>XML</code> documents (like Atom feeds) that are published on the web and served over <abbr>HTTP</abbr>. Despite the maturity of <abbr>XML</abbr>, which standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom feeds on the web are plagued with wellformedness errors.
|
||||
|
||||
<p>So, I have both theoretical and practical reasons to parse <code>XML</code> documents “at any cost,” that is, <em>not</em> to halt and catch fire at the first wellformedness error. If you find yourself wanting to do this too, lxml can help.
|
||||
<p>So, I have both theoretical and practical reasons to parse <code>XML</code> documents “at any cost,” that is, <em>not</em> to halt and catch fire at the first wellformedness error. If you find yourself wanting to do this too, <code>lxml</code> can help.
|
||||
|
||||
<p>Here is a fragment of a broken <abbr>XML</abbr> document. I’ve highlighted the wellformedness error.
|
||||
|
||||
@@ -579,7 +580,7 @@ except ImportError:
|
||||
...
|
||||
</feed></code></pre>
|
||||
|
||||
<p>That’s an error, because the <code>&hellip;</code> entity is not defined in <abbr>XML</abbr>. (It is defined in <abbr>HTML</abbr>.) If you try to parse this broken feed with the default settings, lxml will choke on the undefined entity.
|
||||
<p>That’s an error, because the <code>&hellip;</code> entity is not defined in <abbr>XML</abbr>. (It is defined in <abbr>HTML</abbr>.) If you try to parse this broken feed with the default settings, <code>lxml</code> will choke on the undefined entity.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=p>>>> </samp><kbd>import lxml.etree</kbd>
|
||||
@@ -616,7 +617,7 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
|
||||
.</samp></pre>
|
||||
<ol>
|
||||
<li>To create a custom parser, instantiate the <code>lxml.etree.XMLParser</code> class. It can take <a href=http://codespeak.net/lxml/parsing.html#parser-options>a number of different named arguments</a>. The one we’re interested in here is the <var>recover</var> argument. When set to <code>True</code>, the <abbr>XML</abbr> parser will try its best to “recover” from wellformedness errors.
|
||||
<li>To parse an <code>XML</code> document with your custom parser, pass the <var>parser</var> object as the second argument to the <code>parse()</code> function. Note that lxml does not raise an exception about the undefined <code>&hellip;</code> entity.
|
||||
<li>To parse an <code>XML</code> document with your custom parser, pass the <var>parser</var> object as the second argument to the <code>parse()</code> function. Note that <code>lxml</code> does not raise an exception about the undefined <code>&hellip;</code> entity.
|
||||
<li>The parser keeps a log of the wellformedness errors that it has encountered. (This is actually true regardless of whether it is set to recover from those errors or not.)
|
||||
<li>Since it didn’t know what to do with the undefined <code>&hellip;</code> entity, the parser just silently dropped it. The text content of the <code>title</code> element becomes <code>"dive into "</code>.
|
||||
<li>As you can see from the serialization, the <code>&hellip;</code> entity didn’t get moved; it was simply dropped.
|
||||
@@ -634,9 +635,9 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28</samp>
|
||||
<li><a href=http://effbot.org/zone/element.htm>Elements and Element Trees</a>
|
||||
<li><a href=http://effbot.org/zone/element-xpath.htm>XPath Support in ElementTree</a>
|
||||
<li><a href=http://effbot.org/zone/element-iterparse.htm>The ElementTree iterparse Function</a>
|
||||
<li><a href=http://codespeak.net/lxml/>lxml</a>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/parsing.html>Parsing <abbr>XML</abbr> and <abbr>HTML</abbr> with lxml</a>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and <abbr>XSLT</abbr> with lxml</a>
|
||||
<li><a href=http://codespeak.net/lxml/><code>lxml</code></a>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/parsing.html>Parsing <abbr>XML</abbr> and <abbr>HTML</abbr> with <code>lxml</code></a>
|
||||
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and <abbr>XSLT</abbr> with <code>lxml</code></a>
|
||||
</ul>
|
||||
|
||||
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
||||
|
||||
Reference in New Issue
Block a user