You are here: Home ‣ Dive Into Python 3 ‣

Difficulty level: ♦♦♦♢♢

XML

❝ In the archonship of Aristaechmus, Draco enacted his ordinances. ❞
— Aristotle

Diving In

Most of the chapters in this book have centered around a piece of sample code. But XML isn’t about code; it’s about data. One common use of XML is “syndication feeds” that list the latest articles on a blog, forum, or other frequently-updated website. Most popular blogging software can produce a feed and update it whenever new articles, discussion threads, or blog posts are published. You can follow a blog by “subscribing” to its feed, and you can follow multiple blogs with a dedicated “feed aggregator” like Google Reader.

Here, then, is the XML data we’ll be working with in this chapter. It’s a feed — specifically, an Atom syndication feed.

[download feed.xml]

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>dive into mark</title>
  <subtitle>currently between addictions</subtitle>
  <id>tag:diveintomark.org,2001-07-29:/</id>
  <updated>2009-03-27T21:56:07Z</updated>
  <link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
  <link rel="self" type="application/atom+xml" href="http://diveintomark.org/feed/"/>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Dive into history, 2009 edition</title>
    <link rel="alternate" type="text/html"
      href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
    <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
    <updated>2009-03-27T21:56:07Z</updated>
    <published>2009-03-27T17:20:42Z</published>
    <category scheme="http://diveintomark.org" term="diveintopython"/>
    <category scheme="http://diveintomark.org" term="docbook"/>
    <category scheme="http://diveintomark.org" term="html"/>
  <summary type="html">Putting an entire chapter on one page sounds
    bloated, but consider this &amp;mdash; my longest chapter so far
    would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
    On dialup.</summary>
  </entry>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>Accessibility is a harsh mistress</title>
    <link rel="alternate" type="text/html"
      href="http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress"/>
    <id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
    <updated>2009-03-22T01:05:37Z</updated>
    <published>2009-03-21T20:09:28Z</published>
    <category scheme="http://diveintomark.org" term="accessibility"/>
    <summary type="html">The accessibility orthodoxy does not permit people to
      question the value of features that are rarely useful and rarely used.</summary>
  </entry>
  <entry>
    <author>
      <name>Mark</name>
      <uri>http://diveintomark.org/</uri>
    </author>
    <title>A gentle introduction to video encoding, part 1: container formats</title>
    <link rel="alternate" type="text/html"
      href="http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats"/>
    <id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
    <updated>2009-01-11T19:39:22Z</updated>
    <published>2008-12-18T15:54:22Z</published>
    <category scheme="http://diveintomark.org" term="asf"/>
    <category scheme="http://diveintomark.org" term="avi"/>
    <category scheme="http://diveintomark.org" term="encoding"/>
    <category scheme="http://diveintomark.org" term="flv"/>
    <category scheme="http://diveintomark.org" term="GIVE"/>
    <category scheme="http://diveintomark.org" term="mp4"/>
    <category scheme="http://diveintomark.org" term="ogg"/>
    <category scheme="http://diveintomark.org" term="video"/>
    <summary type="html">These notes will eventually become part of a
      tech talk on video encoding.</summary>
  </entry>
</feed>

A 5-Minute Crash Course in XML

If you already know about XML, you can skip this section.

XML is a generalized way of describing hierarchical structured data. An XML document contains one or more elements, which are delimited by start and end tags. This is a complete (albeit boring) XML document:

<foo>   ①
</foo>  ②

This is the start tag of the foo element.
This is the matching end tag of the foo element. Like balancing parentheses in writing or mathematics or code, every start tag much be closed (matched) by a corresponding end tag.

Elements can be nested to any depth. An element bar inside an element foo is said to be a subelement or child of foo.

<foo>
  <bar></bar>
</foo>

The first element in every XML document is called the root element. An XML document can only have one root element. The following is not an XML document, because it has two root elements:

<foo></foo>
<bar></bar>

Elements can have attributes, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. Attribute names can not be repeated within an element. Attribute values must be quoted.

<foo lang="en">          ①
  <bar lang="fr"></bar>  ②
</foo>

The foo element has one attribute, named lang. The value of its lang attribute is en.
The bar element has one attribute, named lang. The value of its lang attribute is fr. This doesn’t conflict with the foo element in any way. Each element has its own set of attributes.

If an element has more than one attribute, the ordering of the attributes is not significant. An element’s attributes form an unordered set of keys and values, like a Python dictionary.

Elements can have text content.

<foo lang="en">
  <bar lang="fr">PapayaWhip</bar>
</foo>

Elements that contain no text and no children are empty.

<foo></foo>

There is a shorthand for writing empty elements. By putting a / character in the start tag, you can skip the end tag altogther. The XML document in the previous example could be written like this instead:

<foo/>

Like Python functions can be declared in different modules, XML elements can be declared in different namespaces. Namespaces usually look like URLs. You use an xmlns declaration to define a default namespace. A namespace declaration looks similar to an attribute, but it has a different purpose.

<feed xmlns="http://www.w3.org/2005/Atom">  ①
  <title>dive into mark</title>             ②
</feed>

The feed element is in the http://www.w3.org/2005/Atom namespace.
The title element is also in the http://www.w3.org/2005/Atom namespace. The namespace declaration affects the element where it’s declared, plus all child elements.

You can also use an xmlns:prefix declaration to define a namespace and associate it with a prefix. Then each element in that namespace must be explicitly declared with the prefix.

<atom:feed xmlns:atom="http://www.w3.org/2005/Atom">  ①
  <atom:title>dive into mark</atom:title>             ②
</atom:feed>

The feed element is in the http://www.w3.org/2005/Atom namespace.
The title element is also in the http://www.w3.org/2005/Atom namespace.

As far as an XML parser is concerned, the previous two XML documents are identical. Namespace + element name = XML identity. Prefixes only exist to refer to namespaces, so the actual prefix name (atom:) is irrelevant. The namespaces match, the element names match, the attributes (or lack of attributes) match, and each element’s text content matches, therefore the XML documents are the same.

Finally, XML documents can contain character encoding information on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the document can be parsed, Section F of the XML specification details how to resolve this Catch-22.)

<?xml version="1.0" encoding="utf-8"?>

And now you know just enough XML to be dangerous!

The Structure Of An Atom Feed

Think of a weblog, or in fact any website with frequently updated content, like CNN.com. The site itself has a title (“CNN.com”), a subtitle (“Breaking News, U.S., World, Weather, Entertainment & Video News”), a last-updated date (“updated 12:43 p.m. EDT, Sat May 16, 2009”), and a list of articles posted at different times. Each article also has a title, a first-published date (and maybe also a last-updated date, if they published a correction or fixed a typo), and a unique URL.

The Atom syndication format is designed to capture all of this information in a standard format. My weblog and CNN.com are wildly different in design, scope, and audience, but they both have the same basic structure. CNN.com has a title; my blog has a title. CNN.com publishes articles; I publish articles.

At the top level is the root element, which every Atom feed shares: the feed element in the http://www.w3.org/2005/Atom namespace.


<feed xmlns="http://www.w3.org/2005/Atom"  ①
      xml:lang="en">                       ②

http://www.w3.org/2005/Atom is the Atom namespace.
Any element can contain an xml:lang attribute, which declares the language of the element and its children. In this case, the xml:lang attribute is declared once on the root element, which means the entire feed is in English.

An Atom feed contains several pieces of information about the feed itself. These are declared as children of the root-level feed element.

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>dive into mark</title>                                             ①
  <subtitle>currently between addictions</subtitle>                         ②
  <id>tag:diveintomark.org,2001-07-29:/</id>                                ③
  <updated>2009-03-27T21:56:07Z</updated>                                   ④
  <link rel="alternate" type="text/html" href="http://diveintomark.org/"/>  ⑤

The title of this feed is dive into mark.
The subtitle of this feed is currently between addictions.
Every feed needs a globally unique identifier. See RFC 4151 for how to create one.
This feed was last updated on March 27, 2009, at 21:56 GMT. This is usually equivalent to the last-modified date of the most recent article.
Now things start to get interesting. This link element has no text content, but it has three attributes: rel, type, and href. The rel value tells you what kind of link this is; rel="alternate" means that this is a link to an alternate representation of this feed. The type="text/html" attribute means that this is a link to an HTML page. And the link target is given in the href attribute.

Now we know that this is a feed for a site named “dive into mark“ which is available at http://diveintomark.org/ and was last updated on March 27, 2009.

☞Although the order of elements can be relevant in some XML documents, it is not relevant in an Atom feed.

After the feed-level metadata is the list of the most recent articles. An article looks like this:

<entry>
  <author>                                                                 ①
    <name>Mark</name>
    <uri>http://diveintomark.org/</uri>
  </author>
  <title>Dive into history, 2009 edition</title>                           ②
  <link rel="alternate" type="text/html"                                   ③
    href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
  <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>        ④
  <updated>2009-03-27T21:56:07Z</updated>                                  ⑤
  <published>2009-03-27T17:20:42Z</published>        
  <category scheme="http://diveintomark.org" term="diveintopython"/>       ⑥
  <category scheme="http://diveintomark.org" term="docbook"/>
  <category scheme="http://diveintomark.org" term="html"/>
  <summary type="html">Putting an entire chapter on one page sounds        ⑦
    bloated, but consider this &amp;mdash; my longest chapter so far
    would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
    On dialup.</summary>
</entry>                                                                   ⑧

The author element tells who wrote this article: some guy named Mark, whom you can find loafing at http://diveintomark.org/. (This is the same as the alternate link in the feed metadata, but it doesn’t have to be. Many weblogs have multiple authors, each with their own personal website.)
The title element gives the title of the article, “Dive into history, 2009 edition”.
As with the feed-level alternate link, this link element gives the address of the HTML version of this article.
Entries, like feeds, need a unique identifier.
Entries have two dates: a first-published date (published) and a last-modified date (updated).
Entries can have an arbitrary number of categories. This article is filed under diveintopython, docbook, and html.
The summary element gives a brief summary of the article. (There is also a content element, not shown here, if you want to include the complete article text in your feed.) This summary element has the Atom-specific type="html" attribute, which specifies that this summary is a snippet of HTML, not plain text. This is important, since it has HTML-specific entities in it (— and …) which should be rendered as “—” and “…” rather than displayed directly.
Finally, the end tag for the entry element, signaling the end of the metadata for this article.

Parsing XML

Python can parse XML documents in several ways. It has traditional DOM and SAX parsers, but I will focus on a different library called Etree.

[download feed.xml]

>>> import xml.etree.ElementTree as etree    ①
>>> tree = etree.parse("examples/feed.xml")  ②
>>> root = tree.getroot()                    ③
>>> root                                     ④
<Element {http://www.w3.org/2005/Atom}feed at cd1eb0>

The Etree library is part of the Python standard library, in xml.etree.ElementTree.
The primary entry point for the Etree library is the parse() function, which can take a filename or a file-like object [FIXME xref]. This function parses the entire document at once. If memory is tight, there are ways to parse an XML document incrementally instead.
The parse() function returns an object which represents the entire document. This is not the root element. To get a reference to the root element, call the getroot() method.
As expected, the root element is the feed element in the http://www.w3.org/2005/Atom namespace. The string representation of this object reinforces an important point: an XML element is a combination of its namespace and its tag name (also called the local name). Every element in this document is in the Atom namespace, so the root element is represented as {http://www.w3.org/2005/Atom}feed.

☞Etree represents XML elements as {namespace}localname. You’ll see and use this format in multiple places in the Etree library.

Elements Are Lists

In Etree, an element acts like a list. The items of the list are the element’s children.

# continued from the previous example
>>> root.tag                        ①
'{http://www.w3.org/2005/Atom}feed'
>>> len(root)                       ②
8
>>> for child in root:              ③
...   print(child)                  ④
... 
<Element {http://www.w3.org/2005/Atom}title at e2b5d0>
<Element {http://www.w3.org/2005/Atom}subtitle at e2b4e0>
<Element {http://www.w3.org/2005/Atom}id at e2b6c0>
<Element {http://www.w3.org/2005/Atom}updated at e2b6f0>
<Element {http://www.w3.org/2005/Atom}link at e2b4b0>
<Element {http://www.w3.org/2005/Atom}entry at e2b720>
<Element {http://www.w3.org/2005/Atom}entry at e2b510>
<Element {http://www.w3.org/2005/Atom}entry at e2b750>

Continuing from the previous example, the root element is {http://www.w3.org/2005/Atom}feed.
The “length” of the root element is the number of child elements.
You can use the element itself as an iterator to loop through all of its child elements.
As you can see from the output, there are indeed 8 child elements: all of the feed-level metadata (title, subtitle, id, updated, and link) followed by the three entry elements.

You may have guessed this already, but I want to point it out explicitly: the list of child elements only includes direct children. Each of the entry elements contain their own children, but those are not included in the list. They would be included in the list of each entry’s children, but they are not included in the list of the feed’s children. There are ways to find elements no matter how deeply nested they are; we’ll look at two such ways later in this chapter.

Attributes Are Dictonaries

XML isn’t just a collection of elements; each element can also have its own set of attributes. Once you have a reference to a specific element, you can easily get its attributes as a Python dictionary.

To refresh your memory, here is the first few lines of feed.xml, the XML document we’re working with.

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
  <title>dive into mark</title>
  <subtitle>currently between addictions</subtitle>
  <id>tag:diveintomark.org,2001-07-29:/</id>
  <updated>2009-03-27T21:56:07Z</updated>
  <link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
  <link rel="self" type="application/atom+xml" href="http://diveintomark.org/feed/"/>
...

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
FIXME
# continuing from the previous example
>>> root.attrib                           ①
{'{http://www.w3.org/XML/1998/namespace}lang': 'en'}
>>> root[4]                               ②
<Element {http://www.w3.org/2005/Atom}link at e181b0>
>>> root[4].attrib                        ③
{'href': 'http://diveintomark.org/',
 'type': 'text/html',
 'rel': 'alternate'}
>>> root[3]                               ④
<Element {http://www.w3.org/2005/Atom}updated at e2b4e0>
>>> root[3].attrib                        ⑤
{}

The attrib property is a dictionary of the element’s attributes. The original markup here was <feed xmlns



Searching For Nodes Within An XML Document

FIXME

>>> tree.findall("{http://www.w3.org/2005/Atom}entry")
[<Element {http://www.w3.org/2005/Atom}entry at e2b4e0>, <Element {http://www.w3.org/2005/Atom}entry at e2b510>, <Element {http://www.w3.org/2005/Atom}entry at e2b540>]

>>> feed_links = tree.findall("{http://www.w3.org/2005/Atom}link")
>>> feed_links
[<Element {http://www.w3.org/2005/Atom}link at e181b0>, <Element {http://www.w3.org/2005/Atom}link at e2b4b0>]
>>> feed_links[0].attrib
{'href': 'http://diveintomark.org/', 'type': 'text/html', 'rel': 'alternate'}
>>> feed_links[1].attrib
{'href': 'http://diveintomark.org/feed/', 'type': 'application/atom+xml', 'rel': 'self'}

>>> all_links = tree.findall("//{http://www.w3.org/2005/Atom}link")
>>> all_links
[<Element {http://www.w3.org/2005/Atom}link at e181b0>, <Element {http://www.w3.org/2005/Atom}link at e2b4b0>, <Element {http://www.w3.org/2005/Atom}link at e2b570>, <Element {http://www.w3.org/2005/Atom}link at e2b480>, <Element {http://www.w3.org/2005/Atom}link at e2b5a0>]
>>> all_links[0].attrib
{'href': 'http://diveintomark.org/', 'type': 'text/html', 'rel': 'alternate'}
>>> all_links[1].attrib
{'href': 'http://diveintomark.org/feed/', 'type': 'application/atom+xml', 'rel': 'self'}
>>> all_links[2].attrib
{'href': 'http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition', 'type': 'text/html', 'rel': 'alternate'}
>>> all_links[3].attrib
{'href': 'http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress', 'type': 'text/html', 'rel': 'alternate'}
>>> all_links[4].attrib
{'href': 'http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats', 'type': 'text/html', 'rel': 'alternate'}


Going Further With lxml

FIXME

>>> from lxml import etree
.
.  FIXME (show how it's a drop-in replacement for everything we've done so far)
.

from here on out, use lxml.etree explicitly because these functions are specific to lxml
>>> import lxml.etree
>>> nsmap = {"atom": "http://www.w3.org/2005/Atom"}
>>> tree = lxml.etree.parse("examples/feed.xml")
>>> entries = tree.xpath("//atom:category[@term='accessibility']/..", namespaces=nsmap)
>>> entries
[<Element {http://www.w3.org/2005/Atom}entry at e2b630>]
>>> entry = entries[0]
>>> entry.xpath("./atom:title/text()", namespaces=nsmap)
['Accessibility is a harsh mistress']


Customizing Your XML Parser

FIXME

>>> import lxml.etree
>>> parser = lxml.etree.XMLParser(no_network=True, ns_clean=True, recover=True, remove_blank_text=True, remove_comments=True)
>>> tree = lxml.etree.parse("examples/feed.xml", parser)


Incremental Parsing

FIXME

Generating XML

FIXME

>>> import lxml.etree
>>> new_feed = lxml.etree.Element("{http://www.w3.org/2005/Atom}feed", attrib={"{http://www.w3.org/XML/1998/namespace}lang": "en"})
>>> print(lxml.etree.tounicode(new_feed))
<ns0:feed xmlns:ns0="http://www.w3.org/2005/Atom" xml:lang="en"/>


FIXME

>>> import lxml.etree
>>> new_feed = lxml.etree.Element("feed", nsmap=NSMAP)
>>> print(lxml.etree.tounicode(new_feed))
<feed xmlns="http://www.w3.org/2005/Atom"/>
>>> new_feed.set("{http://www.w3.org/XML/1998/namespace}lang", "en")
>>> print(lxml.etree.tounicode(new_feed))
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"/>


FIXME

>>> title = lxml.etree.SubElement(new_feed, "title", attrib={"type":"html"})
>>> print(lxml.etree.tounicode(new_feed))
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html"/></feed>
>>> title.text = "dive into mark"
>>> print(lxml.etree.tounicode(new_feed))
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html">dive into mark</title></feed>
>>> print(lxml.etree.tounicode(new_feed, pretty_print=True))
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<title type="html">dive into mark</title>
</feed>


Further Reading


XML on Wikipedia.org
The ElementTree XML API
Elements and Element Trees
The ElementTree iterparse Function
Parsing XML and HTML with lxml
XPath and XSLT with lxml


© 2001–9 Mark Pilgrim