From 524c8d2a47f1f9afae0592ae903c157e97fcf45b Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Tue, 19 May 2009 11:42:08 -0400 Subject: [PATCH] structure of an atom feed in xml chapter --- examples/feed.xml | 19 ++++----- xml.html | 103 +++++++++++++++++++++++++++++++++++++++------- 2 files changed, 98 insertions(+), 24 deletions(-) diff --git a/examples/feed.xml b/examples/feed.xml index 4332135..13a94e6 100644 --- a/examples/feed.xml +++ b/examples/feed.xml @@ -1,17 +1,16 @@ - dive into mark - currently between addictions + dive into mark + currently between addictions tag:diveintomark.org,2001-07-29:/ 2009-03-27T21:56:07Z - Mark http://diveintomark.org/ - <![CDATA[Dive into history, 2009 edition]]> + Dive into history, 2009 edition tag:diveintomark.org,2009-03-27:/archives/20090327172042 @@ -20,16 +19,17 @@ - Putting an entire chapter on one page sounds bloated, but - consider this: my longest chapter so far would be 75 printed pages, and it - loads in under 5 seconds. On dialup. + Putting an entire chapter on one page sounds + bloated, but consider this &mdash; my longest chapter so far + would be 75 printed pages, and it loads in under 5 seconds&hellip; + On dialup.</summary> Mark http://diveintomark.org/ - <![CDATA[Accessibility is a harsh mistress]]> + Accessibility is a harsh mistress tag:diveintomark.org,2009-03-21:/archives/20090321200928 @@ -44,8 +44,7 @@ Mark http://diveintomark.org/ - <![CDATA[A gentle introduction to video encoding, - part 1: container formats]]> + A gentle introduction to video encoding, part 1: container formats tag:diveintomark.org,2008-12-18:/archives/20081218155422 diff --git a/xml.html b/xml.html index 3f89a9b..bc5ca02 100644 --- a/xml.html +++ b/xml.html @@ -25,18 +25,18 @@ mark{display:inline}

[download feed.xml]

<?xml version="1.0" encoding="utf-8"?>
 <feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
-  <title type="text">dive into mark</title>
-  <subtitle type="text">currently between addictions</subtitle>
+  <title>dive into mark</title>
+  <subtitle>currently between addictions</subtitle>
   <id>tag:diveintomark.org,2001-07-29:/</id>
   <updated>2009-03-27T21:56:07Z</updated>
   <link rel="alternate" type="text/html" href="http://diveintomark.org/"/>
-  <link rel="self" href="http://diveintomark.org/feed/" type="application/atom+xml"/>
+  <link rel="self" type="application/atom+xml" href="http://diveintomark.org/feed/"/>
   <entry>
     <author>
       <name>Mark</name>
       <uri>http://diveintomark.org/</uri>
     </author>
-    <title type="html"><![CDATA[Dive into history, 2009 edition]]></title>
+    <title>Dive into history, 2009 edition</title>
     <link rel="alternate" type="text/html"
       href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
     <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>
@@ -45,16 +45,17 @@ mark{display:inline}
     <category scheme="http://diveintomark.org" term="diveintopython"/>
     <category scheme="http://diveintomark.org" term="docbook"/>
     <category scheme="http://diveintomark.org" term="html"/>
-    <summary type="html">Putting an entire chapter on one page sounds bloated, but
-      consider this: my longest chapter so far would be 75 printed pages, and it
-      loads in under 5 seconds. On dialup.</summary>
+  <summary type="html">Putting an entire chapter on one page sounds
+    bloated, but consider this &amp;mdash; my longest chapter so far
+    would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
+    On dialup.</summary>
   </entry>
   <entry>
     <author>
       <name>Mark</name>
       <uri>http://diveintomark.org/</uri>
     </author>
-    <title type="html"><![CDATA[Accessibility is a harsh mistress]]></title>
+    <title>Accessibility is a harsh mistress</title>
     <link rel="alternate" type="text/html"
       href="http://diveintomark.org/archives/2009/03/21/accessibility-is-a-harsh-mistress"/>
     <id>tag:diveintomark.org,2009-03-21:/archives/20090321200928</id>
@@ -69,8 +70,7 @@ mark{display:inline}
       <name>Mark</name>
       <uri>http://diveintomark.org/</uri>
     </author>
-    <title type="html"><![CDATA[A gentle introduction to video encoding,
-      part 1: container formats]]></title>
+    <title>A gentle introduction to video encoding, part 1: container formats</title>
     <link rel="alternate" type="text/html"
       href="http://diveintomark.org/archives/2008/12/18/give-part-1-container-formats"/>
     <id>tag:diveintomark.org,2008-12-18:/archives/20081218155422</id>
@@ -102,14 +102,19 @@ mark{display:inline}
 
  • This is the matching end tag of the foo element. Like balancing parentheses in writing or mathematics or code, every start tag much be closed (matched) by a corresponding end tag. -

    Elements can be nested. An element bar inside an element foo is said to be a subelement or child of foo. +

    Elements can be nested to any depth. An element bar inside an element foo is said to be a subelement or child of foo.

    <foo>
       <bar></bar>
     </foo>
     
    -

    Elements can have attributes, which are name-value pairs. Attributes are listed within the start tag of an element. Attribute names can not be repeated within an element. Attribute values must be quoted. +

    The first element in every XML document is called the root element. An XML document can only have one root element. The following is not an XML document, because it has two root elements: + +

    <foo></foo>
    +<bar></bar>
    + +

    Elements can have attributes, which are name-value pairs. Attributes are listed within the start tag of an element and separated by whitespace. Attribute names can not be repeated within an element. Attribute values must be quoted.

    <foo lang="en">          
       <bar lang="fr"></bar>  
    @@ -161,13 +166,83 @@ mark{display:inline}
     
     

    As far as an XML parser is concerned, the previous two XML documents are identical. Namespace + element name = XML identity. Prefixes only exist to refer to namespaces, so the actual prefix name (atom:) is irrelevant. The namespaces match, the element names match, the attributes (or lack of attributes) match, and each element’s text content matches, therefore the XML documents are the same. +

    Finally, XML documents can contain character encoding information on the first line, before the root element. (If you’re curious how a document can contain information which needs to be known before the document can be parsed, Section F of the XML specification details how to resolve this Catch-22.) + +

    <?xml version="1.0" encoding="utf-8"?>
    + +

    And now you know just enough XML to be dangerous! +

    The Structure Of An Atom Feed

    Think of a weblog, or in fact any website with frequently updated content, like CNN.com. The site itself has a title (“CNN.com”), a subtitle (“Breaking News, U.S., World, Weather, Entertainment & Video News”), a last-updated date (“updated 12:43 p.m. EDT, Sat May 16, 2009”), and a list of articles posted at different times. Each article also has a title, a first-published date (and maybe also a last-updated date, if they published a correction or fixed a typo), and a unique URL.

    The Atom syndication format is designed to capture all of this information in a standard format. My weblog and CNN.com are wildly different in design, scope, and audience, but they both have the same basic structure. CNN.com has a title; my blog has a title. CNN.com publishes articles; I publish articles. -

    At the top level is the “root” element, which every Atom feed shares: the <feed> element in the Atom namespace (http://www.w3.org/2005/Atom). ... FIXME +

    At the top level is the root element, which every Atom feed shares: the feed element in the http://www.w3.org/2005/Atom namespace. + +

    
    +<feed xmlns="http://www.w3.org/2005/Atom"  
    +      xml:lang="en">                       
    +
      +
    1. http://www.w3.org/2005/Atom is the Atom namespace. +
    2. Any element can contain an xml:lang attribute, which declares the language of the element and its children. In this case, the xml:lang attribute is declared once on the root element, which means the entire feed is in English. +
    + +

    An Atom feed contains several pieces of information about the feed itself. These are declared as children of the root-level feed element. + +

    <feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    +  <title>dive into mark</title>                                             
    +  <subtitle>currently between addictions</subtitle>                         
    +  <id>tag:diveintomark.org,2001-07-29:/</id>                                
    +  <updated>2009-03-27T21:56:07Z</updated>                                   
    +  <link rel="alternate" type="text/html" href="http://diveintomark.org/"/>  
    +
      +
    1. The title of this feed is dive into mark. +
    2. The subtitle of this feed is currently between addictions. +
    3. Every feed needs a globally unique identifier. See RFC 4151 for how to create one. +
    4. This feed was last updated on March 27, 2009, at 21:56 GMT. This is usually equivalent to the last-modified date of the most recent article. +
    5. Now things start to get interesting. This link element has no text content, but it has three attributes: rel, type, and href. The rel value tells you what kind of link this is; rel="alternate" means that this is a link to an alternate representation of this feed. The type="text/html" attribute means that this is a link to an HTML page. And the link target is given in the href attribute. +
    + +

    Now we know that this is a feed for a site named “dive into mark“ which is available at http://diveintomark.org/ and was last updated on March 27, 2009. + +

    +

    Although the order of elements can be relevant in some XML documents, it is not relevant in an Atom feed. +

    + +

    After the feed-level metadata is the list of the most recent articles. An article looks like this: + +

    <entry>
    +  <author>                                                                 
    +    <name>Mark</name>
    +    <uri>http://diveintomark.org/</uri>
    +  </author>
    +  <title>Dive into history, 2009 edition</title>                           
    +  <link rel="alternate" type="text/html"                                   
    +    href="http://diveintomark.org/archives/2009/03/27/dive-into-history-2009-edition"/>
    +  <id>tag:diveintomark.org,2009-03-27:/archives/20090327172042</id>        
    +  <updated>2009-03-27T21:56:07Z</updated>                                  
    +  <published>2009-03-27T17:20:42Z</published>        
    +  <category scheme="http://diveintomark.org" term="diveintopython"/>       
    +  <category scheme="http://diveintomark.org" term="docbook"/>
    +  <category scheme="http://diveintomark.org" term="html"/>
    +  <summary type="html">Putting an entire chapter on one page sounds        
    +    bloated, but consider this &amp;mdash; my longest chapter so far
    +    would be 75 printed pages, and it loads in under 5 seconds&amp;hellip;
    +    On dialup.</summary>
    +</entry>                                                                   
    +
      +
    1. The author element tells who wrote this article: some guy named Mark, whom you can find loafing at http://diveintomark.org/. (This is the same as the alternate link in the feed metadata, but it doesn’t have to be. Many weblogs have multiple authors, each with their own personal website.) +
    2. The title element gives the title of the article, “Dive into history, 2009 edition”. +
    3. As with the feed-level alternate link, this link element gives the address of the HTML version of this article. +
    4. Entries, like feeds, need a unique identifier. +
    5. Entries have two dates: a first-published date (published) and a last-modified date (updated). +
    6. Entries can have an arbitrary number of categories. This article is filed under diveintopython, docbook, and html. +
    7. The summary element gives a brief summary of the article. (There is also a content element, not shown here, if you want to include the complete article text in your feed.) This summary element has the Atom-specific type="html" attribute, which specifies that this summary is a snippet of HTML, not plain text. This is important, since it has HTML-specific entities in it (&mdash; and &hellip;) which should be rendered as “—” and “…” rather than displayed directly. +
    8. Finally, the end tag for the entry element, signaling the end of the metadata for this article. +
    + +

    Parsing XML

    @@ -322,7 +397,7 @@ from here on out, use lxml.etree explicitly because these functions are specific <feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><title type="html">dive into mark</title></feed> >>> print(lxml.etree.tounicode(new_feed, pretty_print=True)) <feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"> - <title type="html">dive into mark</title> +<title type="html">dive into mark</title> </feed>