diff --git a/xml.html b/xml.html index 78bb9ee..564adeb 100755 --- a/xml.html +++ b/xml.html @@ -596,9 +596,9 @@ except ImportError:

The XML specification mandates that all conforming XML parsers employ “draconian error handling.” That is, they must halt and catch fire as soon as they detect any sort of wellformedness error in the XML document. Wellformedness errors include mismatched start and end tags, undefined entities, illegal Unicode characters, and a number of other esoteric rules. This is in stark contrast to other common formats like HTML — your browser doesn’t stop rendering a web page if you forget to close an HTML tag or escape an ampersand in an attribute value. (It is a common misconception that HTML has no defined error handling. HTML error handling is actually quite well-defined, but it’s significantly more complicated than “halt and catch fire on first error.”) -

Some people (myself included) believe that it was a mistake for the inventors of XML to mandate draconian error handling. Don’t get me wrong; I can certainly see the allure of simplifying the error handling rules. But in practice, the concept of “wellformedness” is trickier than it sounds, especially for XML documents (like Atom feeds) that are published on the web and served over HTTP. Despite the maturity of XML, which standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom feeds on the web are plagued with wellformedness errors. +

Some people (myself included) believe that it was a mistake for the inventors of XML to mandate draconian error handling. Don’t get me wrong; I can certainly see the allure of simplifying the error handling rules. But in practice, the concept of “wellformedness” is trickier than it sounds, especially for XML documents (like Atom feeds) that are published on the web and served over HTTP. Despite the maturity of XML, which standardized on draconian error handling in 1997, surveys continually show a significant fraction of Atom feeds on the web are plagued with wellformedness errors. -

So, I have both theoretical and practical reasons to parse XML documents “at any cost,” that is, not to halt and catch fire at the first wellformedness error. If you find yourself wanting to do this too, lxml can help. +

So, I have both theoretical and practical reasons to parse XML documents “at any cost,” that is, not to halt and catch fire at the first wellformedness error. If you find yourself wanting to do this too, lxml can help.

Here is a fragment of a broken XML document. I’ve highlighted the wellformedness error. @@ -645,7 +645,7 @@ lxml.etree.XMLSyntaxError: Entity 'hellip' not defined, line 3, column 28 .

  1. To create a custom parser, instantiate the lxml.etree.XMLParser class. It can take a number of different named arguments. The one we’re interested in here is the recover argument. When set to True, the XML parser will try its best to “recover” from wellformedness errors. -
  2. To parse an XML document with your custom parser, pass the parser object as the second argument to the parse() function. Note that lxml does not raise an exception about the undefined … entity. +
  3. To parse an XML document with your custom parser, pass the parser object as the second argument to the parse() function. Note that lxml does not raise an exception about the undefined … entity.
  4. The parser keeps a log of the wellformedness errors that it has encountered. (This is actually true regardless of whether it is set to recover from those errors or not.)
  5. Since it didn’t know what to do with the undefined … entity, the parser just silently dropped it. The text content of the title element becomes 'dive into '.
  6. As you can see from the serialization, the … entity didn’t get moved; it was simply dropped.