more validation fiddling

This commit is contained in:
Mark Pilgrim
2009-02-05 15:25:11 -05:00
parent 13f50a79da
commit 7afb38878f
5 changed files with 19 additions and 14 deletions
+6 -4
View File
@@ -8,12 +8,14 @@
<link rel="shortcut icon" href="data:image/ico,">
<link rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">
<style type="text/css">
body{counter-reset:h1 19}
body{counter-reset:h1 20}
</style>
</head>
<body>
<p class="skip"><a href="#divingin">skip to main content</a>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div><p>You are here: <a href="/">Dive Into Python 3</a> <span>&#8227;</span></p> <h1>Case study: porting <code>chardet</code> to Python 3</h1></form>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div></form>
<p class="nav">You are here: <a href="/">Dive Into Python 3</a> <span>&#8227;</span>
<h1>Case study: porting <code>chardet</code> to Python 3</h1>
<blockquote class="q">
<p><span>&#x275D;</span> Words, words. They&#8217;re all we have to go on. <span>&#x275E;</span><br>&mdash; <cite>Rosencrantz and Guildenstern are Dead</cite>
</blockquote>
@@ -26,7 +28,7 @@ body{counter-reset:h1 19}
<li><a href="#faq.yippie">Yippie! Screw the standards, I&#8217;ll just auto-detect everything!</a>
<li><a href="#faq.why">Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</a>
</ol>
<li><a href="#divingin">Diving in</a>
<li><a href="#divingin2">Diving in</a>
<ol>
<li><a href="#how.bom"><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></a>
<li><a href="#how.esc">Escaped encodings</a>
@@ -67,7 +69,7 @@ body{counter-reset:h1 19}
<h3 id="faq.why">Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</h3>
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn&#8217;t work. There are also some poorly designed standards that have no way to specify encoding at all.
<p>If following the relevant standards gets you nowhere, <em>and</em> you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my <a href="http://feedparser.org/">Universal Feed Parser</a>, which calls this auto-detection library <a href="http://feedparser.org/docs/character-encoding.html">only after exhausting all other options</a>.
<h2 id="divingin">Diving in</h2>
<h2 id="divingin2">Diving in</h2>
<p>This is a brief guide to navigating the code itself.
<p>The main entry point for the detection algorithm is <code class="filename">universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code class="filename">chardet/__init__.py</code>, but that&#8217;s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
+5 -5
View File
@@ -7,10 +7,10 @@ a:link{color:#1b67c9}
a:visited{color:darkorchid}
h1 a,h2 a,h3 a,#nav a{color:inherit !important}
abbr,acronym{letter-spacing:0.1em;text-transform:lowercase;font-variant:small-caps}
h1,h2,h3,p,ul,ol,#search{margin:1.75em 0;font-size:medium}
h1,h2,h3,p,ul,ol{margin:1.75em 0;font-size:medium}
h1,.nav{display:inline}
h2,h3{clear:both}
form div{float:right}
form p,form h1{display:inline}
pre{white-space:pre-wrap;padding-left:2.154em;line-height:1.75;border-left:1px dotted}
pre,kbd,code,samp{font-family:Consolas,Inconsolata,Monaco,monospace;font-size:medium;word-spacing:0}
pre a{padding:0.4375em 0;border:0}
@@ -20,7 +20,7 @@ pre a:hover{border:0}
kbd{font-weight:bold}
.prompt{color:#667}/*the neighbor of the beast*/
td pre{margin:0;padding:0;border:0}
li ol{margin:0 inherit}
li ol{margin:0}
.c{text-align:center;font-size:small}
p.fancy:first-letter{float:left;background:transparent;color:gainsboro;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
blockquote.q{margin:auto;text-align:right;font-style:oblique}
@@ -38,9 +38,9 @@ span,tr + tr th:first-child{font-family:'Arial Unicode MS',sans-serif;font-style
table.simple th{font-family:inherit !important}
.fr{width:100%;border:1px dotted}
.fr h4{margin-top:-1.2em;margin-left:-1em;width:8.5em;border:1px dotted;padding: 3px 3px 3px 13px;background:#fff;color:inherit;position:relative}
.hover{background:#eee !important;color:inherit !important;cursor:default !important}
.hover{background:#eee;color:inherit;cursor:default}
body{counter-reset:h1}
h1:before{counter-increment:h1;content:"Chapter " counter(h1) ". "}
h1:before{content:"Chapter " counter(h1) ". "}
h1{counter-reset:h2}
h2:before{counter-increment:h2;content:counter(h1) "." counter(h2) ". "}
h2{counter-reset:h3}
+1 -1
View File
@@ -8,7 +8,7 @@
<link rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">
<style type="text/css">
body{counter-reset:h1 -1}
h1{background:papayawhip}
h1{background:papayawhip;display:block}
h2{margin-left:1.75em}
h3{margin-left:3.5em}
.appendix h1:before{content:""}
+3 -2
View File
@@ -15,7 +15,9 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
</head>
<body>
<p class="skip"><a href="#divingin">skip to main content</a>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div><p>You are here: <a href="/">Dive Into Python 3</a> <span>&#8227;</span></p> <h1>Porting code to Python 3 with <code>2to3</code></h1></form>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div></form>
<p class="nav">You are here: <a href="/">Dive Into Python 3</a> <span>&#8227;</span>
<h1>Porting code to Python 3 with <code>2to3</code></h1>
<blockquote class="q">
<p><span>&#x275D;</span> Life is pleasant. Death is peaceful. It&#8217;s the transition that&#8217;s troublesome. <span>&#x275E;</span><br>&mdash; Isaac Asimov (attributed)
</blockquote>
@@ -46,7 +48,6 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
<li><a href="#exec"><code>exec</code> statement</a>
<li><a href="#execfile"><code>execfile</code> statement</a> (3.1+)
<li><a href="#repr"><code>repr</code> literals (backticks)</a>
<li><a href="#exceptions">Exceptions</a>
<li><a href="#except"><code>try...except</code> statement</a>
<li><a href="#raise"><code>raise</code> statement</a>
<li><a href="#throw"><code>throw</code> method on generators</a>
+4 -2
View File
@@ -8,12 +8,14 @@
<link rel="shortcut icon" href="data:image/ico,">
<link rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">
<style type="text/css">
body{counter-reset:h1 0}
body{counter-reset:h1 1}
</style>
</head>
<body>
<p class="skip"><a href="#divingin">skip to main content</a>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div><p>You are here: <a href="/">Dive Into Python 3</a> <span>&#8227;</span></p> <h1>Your first Python program</h1></form>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div></form>
<p class="nav">You are here: <a href="/">Dive Into Python 3</a> <span>&#8227;</span>
<h1>Your first Python program</h1>
<blockquote class="q">
<p><span>&#x275D;</span> Don&#8217;t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate. <span>&#x275E;</span><br>&mdash; <cite>Ven. Henepola Gunararatana</cite>
</blockquote>