From 7afb38878fc88589a7d6d0111aece5614ee3478e Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Thu, 5 Feb 2009 15:25:11 -0500 Subject: [PATCH] more validation fiddling --- case-study-porting-chardet-to-python-3.html | 10 ++++++---- dip3.css | 10 +++++----- index.html | 2 +- porting-code-to-python-3-with-2to3.html | 5 +++-- your-first-python-program.html | 6 ++++-- 5 files changed, 19 insertions(+), 14 deletions(-) diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 0437de1..0cbb231 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -8,12 +8,14 @@

skip to main content -

+ +

Case study: porting chardet to Python 3

Words, words. They’re all we have to go on.
Rosencrantz and Guildenstern are Dead

@@ -26,7 +28,7 @@ body{counter-reset:h1 19}
  • Yippie! Screw the standards, I’ll just auto-detect everything!
  • Why bother with auto-detection if it’s slow, inaccurate, and non-standard? -
  • Diving in +
  • Diving in
    1. UTF-n with a BOM
    2. Escaped encodings @@ -67,7 +69,7 @@ body{counter-reset:h1 19}

      Why bother with auto-detection if it’s slow, inaccurate, and non-standard?

      Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all.

      If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options. -

      Diving in

      +

      Diving in

      This is a brief guide to navigating the code itself.

      The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)

      There are 5 categories of encodings that UniversalDetector handles: diff --git a/dip3.css b/dip3.css index 4de2f62..54fa7d5 100644 --- a/dip3.css +++ b/dip3.css @@ -7,10 +7,10 @@ a:link{color:#1b67c9} a:visited{color:darkorchid} h1 a,h2 a,h3 a,#nav a{color:inherit !important} abbr,acronym{letter-spacing:0.1em;text-transform:lowercase;font-variant:small-caps} -h1,h2,h3,p,ul,ol,#search{margin:1.75em 0;font-size:medium} +h1,h2,h3,p,ul,ol{margin:1.75em 0;font-size:medium} +h1,.nav{display:inline} h2,h3{clear:both} form div{float:right} -form p,form h1{display:inline} pre{white-space:pre-wrap;padding-left:2.154em;line-height:1.75;border-left:1px dotted} pre,kbd,code,samp{font-family:Consolas,Inconsolata,Monaco,monospace;font-size:medium;word-spacing:0} pre a{padding:0.4375em 0;border:0} @@ -20,7 +20,7 @@ pre a:hover{border:0} kbd{font-weight:bold} .prompt{color:#667}/*the neighbor of the beast*/ td pre{margin:0;padding:0;border:0} -li ol{margin:0 inherit} +li ol{margin:0} .c{text-align:center;font-size:small} p.fancy:first-letter{float:left;background:transparent;color:gainsboro;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif} blockquote.q{margin:auto;text-align:right;font-style:oblique} @@ -38,9 +38,9 @@ span,tr + tr th:first-child{font-family:'Arial Unicode MS',sans-serif;font-style table.simple th{font-family:inherit !important} .fr{width:100%;border:1px dotted} .fr h4{margin-top:-1.2em;margin-left:-1em;width:8.5em;border:1px dotted;padding: 3px 3px 3px 13px;background:#fff;color:inherit;position:relative} -.hover{background:#eee !important;color:inherit !important;cursor:default !important} +.hover{background:#eee;color:inherit;cursor:default} body{counter-reset:h1} -h1:before{counter-increment:h1;content:"Chapter " counter(h1) ". "} +h1:before{content:"Chapter " counter(h1) ". "} h1{counter-reset:h2} h2:before{counter-increment:h2;content:counter(h1) "." counter(h2) ". "} h2{counter-reset:h3} diff --git a/index.html b/index.html index eb572c8..a04e0d1 100644 --- a/index.html +++ b/index.html @@ -8,7 +8,7 @@

      skip to main content -

      + +

      Your first Python program

      Don’t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate.
      Ven. Henepola Gunararatana