mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
341 lines
17 KiB
HTML
341 lines
17 KiB
HTML
<!DOCTYPE html>
|
|
<head>
|
|
<meta charset=utf-8>
|
|
<title>XML - Dive into Python 3</title>
|
|
<link rel=stylesheet type=text/css href=dip3.css>
|
|
<style>
|
|
body{counter-reset:h1 13}
|
|
mark{display:inline}
|
|
</style>
|
|
<link rel=stylesheet type=text/css media='only screen and (max-device-width: 480px)' href=mobile.css>
|
|
</head>
|
|
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=25> <input type=submit name=root value=Search></div></form>
|
|
<p>You are here: <a href=index.html>Home</a> <span>‣</span> <a href=table-of-contents.html#xml>Dive Into Python 3</a> <span>‣</span>
|
|
<p id=level>Difficulty level: <span title=beginner>♦♦♦♢♢</span>
|
|
<h1>XML</h1>
|
|
<blockquote class=q>
|
|
<p><span>❝</span> FIXME <span>❞</span><br>— FIXME
|
|
</blockquote>
|
|
<p id=toc>
|
|
<h2 id=divingin>Diving In</h2>
|
|
<p class=f>Most of the chapters in this book have centered around a piece of sample code. But XML isn’t about code; it’s about data. One common use of XML is “syndication feeds” that list the latest articles on a blog, forum, or other frequently-updated website. Most popular blogging software can produce a feed and update it whenever new articles, discussion threads, or blog posts are published. You can follow a blog by “subscribing” to its feed, and you can follow multiple blogs with a dedicated “<a href=http://en.wikipedia.org/wiki/List_of_feed_aggregators>feed aggregator</a>” like <a href=http://www.google.com/reader/>Google Reader</a>.
|
|
|
|
<p>Here, then, is the XML data we’ll be working with in this chapter. It’s a feed — specifically, an <a href=http://atompub.org/rfc4287.html>Atom syndication feed</a>.
|
|
|
|
<p class=d>[<a href=examples/feed.xml>download <code>feed.xml</code></a>]
|
|
<pre><code><?xml version="1.0" encoding="utf-8"?>
|
|
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
|
<title type="text">dive into mark</title>
|
|
<subtitle type="text">currently between addictions</subtitle>
|
|
<id>tag:diveintomark.org,2001-07-29:/</id>
|
|
<updated>2009-03-27T21:56:07Z</updated>
|
|
<link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
|
|
<link rel="self" href="http://diveintomark.org/feed/" type="application/atom+xml"/>
|
|
<entry>
|
|
<author>
|
|
<name>Mark</name>
|
|
<uri>http://diveintomark.org/</uri>
|
|
</author>
|
|
<title type="html"><![CDATA[Dive into history, 2009 edition]]></title>
|
|
<link rel="alternate" type="text/html"
|
|
href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
|
|
<id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
|
|
<updated>2009-03-27T21:56:07Z</updated>
|
|
<published>2009-03-27T17:20:42Z</published>
|
|
<category scheme="http://diveintomark.org" term="diveintopython"/>
|
|
<category scheme="http://diveintomark.org" term="docbook"/>
|
|
<category scheme="http://diveintomark.org" term="html"/>
|
|
<summary type="html">Putting an entire chapter on one page sounds bloated, but
|
|
consider this: my longest chapter so far would be 75 printed pages, and it
|
|
loads in under 5 seconds. On dialup.</summary>
|
|
</entry>
|
|
<entry>
|
|
<author>
|
|
<name>Mark</name>
|
|
<uri>http://diveintomark.org/</uri>
|
|
</author>
|
|
<title type="html"><![CDATA[Accessibility is a harsh mistress]]></title>
|
|
<link rel="alternate" type="text/html"
|
|
href="http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress"/>
|
|
<id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
|
|
<updated>2009-03-22T01:05:37Z</updated>
|
|
<published>2009-03-21T20:09:28Z</published>
|
|
<category scheme="http://diveintomark.org" term="accessibility"/>
|
|
<summary type="html">The accessibility orthodoxy does not permit people to
|
|
question the value of features that are rarely useful and rarely used.</summary>
|
|
</entry>
|
|
<entry>
|
|
<author>
|
|
<name>Mark</name>
|
|
<uri>http://diveintomark.org/</uri>
|
|
</author>
|
|
<title type="html"><![CDATA[A gentle introduction to video encoding,
|
|
part 1: container formats]]></title>
|
|
<link rel="alternate" type="text/html"
|
|
href="http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats"/>
|
|
<id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
|
|
<updated>2009-01-11T19:39:22Z</updated>
|
|
<published>2008-12-18T15:54:22Z</published>
|
|
<category scheme="http://diveintomark.org" term="asf"/>
|
|
<category scheme="http://diveintomark.org" term="avi"/>
|
|
<category scheme="http://diveintomark.org" term="encoding"/>
|
|
<category scheme="http://diveintomark.org" term="flv"/>
|
|
<category scheme="http://diveintomark.org" term="GIVE"/>
|
|
<category scheme="http://diveintomark.org" term="mp4"/>
|
|
<category scheme="http://diveintomark.org" term="ogg"/>
|
|
<category scheme="http://diveintomark.org" term="video"/>
|
|
<summary type="html">These notes will eventually become part of a
|
|
tech talk on video encoding.</summary>
|
|
</entry>
|
|
</feed></code></pre>
|
|
|
|
<h2 id=xml-intro>A 5-Minute Crash Course in XML</h2>
|
|
|
|
<p>If you already know about XML, you can skip this section.
|
|
|
|
<p>XML is a generalized way of describing hierarchical structured data. An XML <i>document</i> contains one or more <i>elements</i>, which are delimited by <i>start and end tags</i>. This is a complete (albeit boring) XML document:
|
|
|
|
<pre class=nd><code><a><foo> <span>①</span></a>
|
|
<a></foo> <span>②</span></a></code></pre>
|
|
<ol>
|
|
<li>This is the <i>start tag</i> of the <code>foo</code> element.
|
|
<li>This is the matching <i>end tag</i> of the <code>foo</code> element. Like balancing parentheses in writing or mathematics or code, every start tag much be <i>closed</i> (matched) by a corresponding end tag.
|
|
</ol>
|
|
|
|
<p>Elements can be <i>nested</i>. An element <code>bar</code> inside an element <code>foo</code> is said to be a <i>subelement</i> or <i>child</i> of <code>foo</code>.
|
|
|
|
<pre class=nd><code><foo>
|
|
<mark><bar></bar></mark>
|
|
</foo>
|
|
</code></pre>
|
|
|
|
<p>Elements can have <i>attributes</i>, which are name-value pairs. Order of attributes is not significant; an element’s attributes form an unordered set of keys and values, like a Python dictionary. Attributes are listed within the start tag of an element. <i>Attribute names</i> can not be repeated on the same element (although they can appear on different elements). <i>Attribute values</i> must be quoted.
|
|
|
|
<pre class=nd><code><a><foo <mark>lang="en"</mark>> <span>①</span></a>
|
|
<a> <bar <mark>lang="fr"</mark>></bar> <span>②</span></a>
|
|
</foo>
|
|
</code></pre>
|
|
<ol>
|
|
<li>The <code>foo</code> element has one attribute, named <code>lang</code>. The value of its <code>lang</code> attribute is <code>en</code>.
|
|
<li>The <code>bar</code> element has one attribute, named <code>lang</code>. The value of its <code>lang</code> attribute is <code>fr</code>. This doesn’t conflict with the <code>foo</code> element in any way. Each element has its own set of attributes.
|
|
</ol>
|
|
|
|
<p>Elements can have <i>text content</i>.
|
|
|
|
<pre class=nd><code><foo lang="en">
|
|
<bar lang="fr"><mark>PapayaWhip</mark></bar>
|
|
</foo>
|
|
</code></pre>
|
|
|
|
<p>Elements that contain no text and no children are <i>empty</i>.
|
|
|
|
<pre class=nd><code><foo></foo></code></pre>
|
|
|
|
<p>There is a shorthand for writing empty elements. By putting a <code>/</code> character in the start tag, you can skip the end tag altogther. The XML document in the previous example could be written like this instead:
|
|
|
|
<pre class=nd><code><foo<mark>/</mark>></code></pre>
|
|
|
|
<p>Like Python functions can be declared in different <i>modules</i>, XML elements can be declared in different <i>namespaces</i>. Namespaces usually look like URLs. You use an <code>xmlns</code> declaration to define a <i>default namespace</i>. A namespace declaration looks similar to an attribute, but it has a different purpose.
|
|
|
|
<pre class=nd><a><code><feed <mark>xmlns="http://www.w3.org/2005/Atom"</mark>> <span>①</span></a>
|
|
<a> <title>dive into mark</title> <span>②</span></a>
|
|
</feed>
|
|
</code></pre>
|
|
<ol>
|
|
<li>The <code>feed</code> element is in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
|
<li>The <code>title</code> element is also in the <code>http://www.w3.org/2005/Atom</code> namespace. The namespace declaration affects the element where it’s declared, plus all child elements.
|
|
</ol>
|
|
|
|
<p>You can also use an <code>xmlns:<var>prefix</var></code> declaration to define a namespace and associate it with a <i>prefix</i>. Then each element in that namespace must be explicitly declared with the prefix.
|
|
|
|
<pre class=nd><a><code><atom:feed <mark>xmlns:atom="http://www.w3.org/2005/Atom"</mark>> <span>①</span></a>
|
|
<a> <atom:title>dive into mark</atom:title> <span>②</span></a>
|
|
</atom:feed>
|
|
</code></pre>
|
|
<ol>
|
|
<li>The <code>feed</code> element is in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
|
<li>The <code>title</code> element is also in the <code>http://www.w3.org/2005/Atom</code> namespace.
|
|
</ol>
|
|
|
|
<p>As far as an XML parser is concerned, the previous two XML documents are <em>identical</em>. Namespace + element name = XML identity. Prefixes only exist to refer to namespaces, so the actual prefix name (<code>atom:</code>) is irrelevant. The namespaces match, the element names match, the attributes (or lack of attributes) match, and each element’s text content matches, therefore the XML documents are the same.
|
|
|
|
<h2 id=xml-structure>The Structure Of An Atom Feed</h2>
|
|
|
|
<p>Think of a weblog, or in fact any website with frequently updated content, like <a href=http://www.cnn.com/>CNN.com</a>. The site itself has a title (“CNN.com”), a subtitle (“Breaking News, U.S., World, Weather, Entertainment <i class=baa>&</i> Video News”), a last-updated date (“updated 12:43 p.m. EDT, Sat May 16, 2009”), and a list of articles posted at different times. Each article also has a title, a first-published date (and maybe also a last-updated date, if they published a correction or fixed a typo), and a unique URL.
|
|
|
|
<p>The Atom syndication format is designed to capture all of this information in a standard format. My weblog and CNN.com are wildly different in design, scope, and audience, but they both have the same basic structure. CNN.com has a title; my blog has a title. CNN.com publishes articles; I publish articles.
|
|
|
|
<p>At the top level is the “root” element, which every Atom feed shares: the <code><feed></code> element in the Atom namespace (<code>http://www.w3.org/2005/Atom</code>). ... FIXME
|
|
|
|
<h2 id=xml-parse>Parsing XML</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> import xml.etree.ElementTree as etree
|
|
>>> tree = etree.parse("examples/feed.xml")
|
|
>>> root = tree.getroot()
|
|
>>> root
|
|
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>
|
|
</pre>
|
|
|
|
<h3 id=xml-elements>Elements Are Lists</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> root.tag
|
|
'{http://www.w3.org/2005/Atom}feed'
|
|
>>> len(root)
|
|
9
|
|
>>> for child in root:
|
|
... print(child)
|
|
...
|
|
<Element {http://www.w3.org/2005/Atom}title at e2b5d0>
|
|
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
|
|
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>
|
|
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
|
|
<Element {http://www.w3.org/2005/Atom}link at e181b0>
|
|
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b720>
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b510>
|
|
<Element {http://www.w3.org/2005/Atom}entry at e2b750>
|
|
</pre>
|
|
|
|
<h3 id=xml-attributes>Attributes Are Dictonaries</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> root.attrib
|
|
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}
|
|
>>> root[4]
|
|
<Element {http://www.w3.org/2005/Atom}link at e181b0>
|
|
>>> root[4].attrib
|
|
{'href': 'http://diveintomark.org/', 'type': 'text/html', 'rel': 'alternate'}
|
|
>>> root[3]
|
|
<Element {http://www.w3.org/2005/Atom}updated at e2b4e0>
|
|
>>> root[3].attrib
|
|
{}
|
|
</pre>
|
|
|
|
<h2 id=xml-find>Searching For Nodes Within An XML Document</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> tree.findall("{http://www.w3.org/2005/Atom}entry")
|
|
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>, <Element {http://www.w3.org/2005/Atom}entry at e2b510>, <Element {http://www.w3.org/2005/Atom}entry at e2b540>]
|
|
|
|
>>> feed_links = tree.findall("{http://www.w3.org/2005/Atom}link")
|
|
>>> feed_links
|
|
[<Element {http://www.w3.org/2005/Atom}link at e181b0>, <Element {http://www.w3.org/2005/Atom}link at e2b4b0>]
|
|
>>> feed_links[0].attrib
|
|
{'href': 'http://diveintomark.org/', 'type': 'text/html', 'rel': 'alternate'}
|
|
>>> feed_links[1].attrib
|
|
{'href': 'http://diveintomark.org/feed/', 'type': 'application/atom+xml', 'rel': 'self'}
|
|
|
|
>>> all_links = tree.findall("//{http://www.w3.org/2005/Atom}link")
|
|
>>> all_links
|
|
[<Element {http://www.w3.org/2005/Atom}link at e181b0>, <Element {http://www.w3.org/2005/Atom}link at e2b4b0>, <Element {http://www.w3.org/2005/Atom}link at e2b570>, <Element {http://www.w3.org/2005/Atom}link at e2b480>, <Element {http://www.w3.org/2005/Atom}link at e2b5a0>]
|
|
>>> all_links[0].attrib
|
|
{'href': 'http://diveintomark.org/', 'type': 'text/html', 'rel': 'alternate'}
|
|
>>> all_links[1].attrib
|
|
{'href': 'http://diveintomark.org/feed/', 'type': 'application/atom+xml', 'rel': 'self'}
|
|
>>> all_links[2].attrib
|
|
{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition', 'type': 'text/html', 'rel': 'alternate'}
|
|
>>> all_links[3].attrib
|
|
{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress', 'type': 'text/html', 'rel': 'alternate'}
|
|
>>> all_links[4].attrib
|
|
{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats', 'type': 'text/html', 'rel': 'alternate'}
|
|
</pre>
|
|
|
|
<h2 id=xml-lxml>Going Further With lxml</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> from lxml import etree
|
|
.
|
|
. FIXME (show how it's a drop-in replacement for everything we've done so far)
|
|
.
|
|
|
|
from here on out, use lxml.etree explicitly because these functions are specific to lxml
|
|
>>> import lxml.etree
|
|
>>> nsmap = {"atom": "http://www.w3.org/2005/Atom"}
|
|
>>> tree = lxml.etree.parse("examples/feed.xml")
|
|
>>> entries = tree.xpath("//atom:category[@term='accessibility']/..", namespaces=nsmap)
|
|
>>> entries
|
|
[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]
|
|
>>> entry = entries[0]
|
|
>>> entry.xpath("./atom:title/text()", namespaces=nsmap)
|
|
['Accessibility is a harsh mistress']
|
|
</pre>
|
|
|
|
<h3 id=xml-custom-parser>Customizing Your XML Parser</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> import lxml.etree
|
|
>>> parser = lxml.etree.XMLParser(no_network=True, ns_clean=True, recover=True, remove_blank_text=True, remove_comments=True)
|
|
>>> tree = lxml.etree.parse("examples/feed.xml", parser)
|
|
</pre>
|
|
|
|
<h3 id=xml-incremental>Incremental Parsing</h3>
|
|
|
|
<p>FIXME
|
|
|
|
<h2 id=xml-generate>Generating XML</h2>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> import lxml.etree
|
|
>>> new_feed = lxml.etree.Element("{http://www.w3.org/2005/Atom}feed", attrib={"{http://www.w3.org/XML/1998/namespace}lang": "en"})
|
|
>>> print(lxml.etree.tounicode(new_feed))
|
|
<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en"/>
|
|
</pre>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> import lxml.etree
|
|
>>> new_feed = lxml.etree.Element("feed", nsmap=NSMAP)
|
|
>>> print(lxml.etree.tounicode(new_feed))
|
|
<feed xmlns="http://www.w3.org/2005/Atom"/>
|
|
>>> new_feed.set("{http://www.w3.org/XML/1998/namespace}lang", "en")
|
|
>>> print(lxml.etree.tounicode(new_feed))
|
|
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/>
|
|
</pre>
|
|
|
|
<p>FIXME
|
|
|
|
<pre class=screen>
|
|
>>> title = lxml.etree.SubElement(new_feed, "title", attrib={"type":"html"})
|
|
>>> print(lxml.etree.tounicode(new_feed))
|
|
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html"/></feed>
|
|
>>> title.text = "dive into mark"
|
|
>>> print(lxml.etree.tounicode(new_feed))
|
|
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html">dive into mark</title></feed>
|
|
>>> print(lxml.etree.tounicode(new_feed, pretty_print=True))
|
|
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
|
|
<title type="html">dive into mark</title>
|
|
</feed>
|
|
</pre>
|
|
|
|
<h2 id=furtherreading>Further Reading</h2>
|
|
|
|
<ul>
|
|
<li><a href=http://en.wikipedia.org/wiki/XML>XML on Wikipedia.org</a>
|
|
<li><a href=http://docs.python.org/3.0/library/xml.etree.elementtree.html>The ElementTree XML API</a>
|
|
<li><a href=http://effbot.org/zone/element.htm>Elements and Element Trees</a>
|
|
<li><a href=http://effbot.org/zone/element-iterparse.htm>The ElementTree iterparse Function</a>
|
|
<li><a href=http://codespeak.net/lxml/1.3/parsing.html>Parsing XML and HTML with lxml</a>
|
|
<li><a href=http://codespeak.net/lxml/1.3/xpathxslt.html>XPath and XSLT with lxml</a>
|
|
</ul>
|
|
|
|
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
|
<script src=jquery.js></script>
|
|
<script src=dip3.js></script>
|