diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 0437de1..0cbb231 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -8,12 +8,14 @@
+ +chardet to Python 3@@ -26,7 +28,7 @@ body{counter-reset:h1 19}❝ Words, words. They’re all we have to go on. ❞
— Rosencrantz and Guildenstern are Dead
UTF-n with a BOM
Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all.
If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options. -
This is a brief guide to navigating the code itself.
The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)
There are 5 categories of encodings that UniversalDetector handles:
diff --git a/dip3.css b/dip3.css
index 4de2f62..54fa7d5 100644
--- a/dip3.css
+++ b/dip3.css
@@ -7,10 +7,10 @@ a:link{color:#1b67c9}
a:visited{color:darkorchid}
h1 a,h2 a,h3 a,#nav a{color:inherit !important}
abbr,acronym{letter-spacing:0.1em;text-transform:lowercase;font-variant:small-caps}
-h1,h2,h3,p,ul,ol,#search{margin:1.75em 0;font-size:medium}
+h1,h2,h3,p,ul,ol{margin:1.75em 0;font-size:medium}
+h1,.nav{display:inline}
h2,h3{clear:both}
form div{float:right}
-form p,form h1{display:inline}
pre{white-space:pre-wrap;padding-left:2.154em;line-height:1.75;border-left:1px dotted}
pre,kbd,code,samp{font-family:Consolas,Inconsolata,Monaco,monospace;font-size:medium;word-spacing:0}
pre a{padding:0.4375em 0;border:0}
@@ -20,7 +20,7 @@ pre a:hover{border:0}
kbd{font-weight:bold}
.prompt{color:#667}/*the neighbor of the beast*/
td pre{margin:0;padding:0;border:0}
-li ol{margin:0 inherit}
+li ol{margin:0}
.c{text-align:center;font-size:small}
p.fancy:first-letter{float:left;background:transparent;color:gainsboro;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
blockquote.q{margin:auto;text-align:right;font-style:oblique}
@@ -38,9 +38,9 @@ span,tr + tr th:first-child{font-family:'Arial Unicode MS',sans-serif;font-style
table.simple th{font-family:inherit !important}
.fr{width:100%;border:1px dotted}
.fr h4{margin-top:-1.2em;margin-left:-1em;width:8.5em;border:1px dotted;padding: 3px 3px 3px 13px;background:#fff;color:inherit;position:relative}
-.hover{background:#eee !important;color:inherit !important;cursor:default !important}
+.hover{background:#eee;color:inherit;cursor:default}
body{counter-reset:h1}
-h1:before{counter-increment:h1;content:"Chapter " counter(h1) ". "}
+h1:before{content:"Chapter " counter(h1) ". "}
h1{counter-reset:h2}
h2:before{counter-increment:h2;content:counter(h1) "." counter(h2) ". "}
h2{counter-reset:h3}
diff --git a/index.html b/index.html
index eb572c8..a04e0d1 100644
--- a/index.html
+++ b/index.html
@@ -8,7 +8,7 @@
❝ Don’t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate. ❞
— Ven. Henepola Gunararatana