mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 15:00:18 +00:00
1188 lines
77 KiB
HTML
1188 lines
77 KiB
HTML
<!DOCTYPE html>
|
|
<head>
|
|
<meta charset=utf-8>
|
|
<title>Case study: porting chardet to Python 3 - Dive into Python 3</title>
|
|
<!--[if IE]><script src=html5.js></script><![endif]-->
|
|
<link rel=stylesheet href=dip3.css>
|
|
<style>
|
|
body{counter-reset:h1 20}
|
|
ins,del{line-height:2.154;text-decoration:none;font-style:normal;display:inline-block;width:100%}
|
|
ins{background:#9f9}
|
|
del{background:#f87}
|
|
</style>
|
|
<link rel=stylesheet media='only screen and (max-device-width: 480px)' href=mobile.css>
|
|
<link rel=stylesheet media=print href=print.css>
|
|
<meta name=viewport content='initial-scale=1.0'>
|
|
</head>
|
|
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=25> <input type=submit name=sa value=Search></div></form>
|
|
<p>You are here: <a href=index.html>Home</a> <span>‣</span> <a href=table-of-contents.html#case-study-porting-chardet-to-python-3>Dive Into Python 3</a> <span>‣</span>
|
|
<p id=level>Difficulty level: <span title=pro>♦♦♦♦♦</span>
|
|
<h1>Case Study: Porting <code>chardet</code> to Python 3</h1>
|
|
<blockquote class=q>
|
|
<p><span>❝</span> Words, words. They’re all we have to go on. <span>❞</span><br>— <a href=http://www.imdb.com/title/tt0100519/quotes>Rosencrantz and Guildenstern are Dead</a>
|
|
</blockquote>
|
|
<p id=toc>
|
|
<h2 id=divingin>Diving In</h2>
|
|
<p class=f>Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In <a href=strings.html>Chapter 3</a>, I talked about the history of character encoding and the creation of Unicode, the “one encoding to rule them all.” I’d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
|
|
<p>I’d also like a pony.
|
|
<p>A Unicode pony.
|
|
<p>A Unipony, as it were.
|
|
<p>I’ll settle for character encoding auto-detection.
|
|
|
|
<h2 id=faq.what>What is Character Encoding Auto-Detection?</h2>
|
|
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key.
|
|
|
|
<h3 id=faq.impossible>Isn’t That Impossible?</h3>
|
|
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.
|
|
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
|
|
|
|
<h3 id=faq.who>Does Such An Algorithm Exist?</h3>
|
|
<p>As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>Mozilla Firefox contains an encoding auto-detection library</a> which is open source. <a href=http://chardet.feedparser.org/>I ported the library to Python 2</a> and dubbed it the <code>chardet</code> module. This chapter will take you step-by-step through the process of porting the <code>chardet</code> module from Python 2 to Python 3.
|
|
|
|
<h2 id=divingin2>Introducing The <code>chardet</code> Module</h2>
|
|
<p>[FIXME download link, possibly on chardet.feedparser.org, possibly local]
|
|
<p>Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself.
|
|
<aside>Encoding detection is really language detection in drag.</aside>
|
|
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that’s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
|
|
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
|
|
<ol>
|
|
<li><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr>. This includes <code>UTF-8</code>, both <abbr title="Big Endian">BE</abbr> and <abbr title="Little Endian">LE</abbr> variants of <code>UTF-16</code>, and all 4 byte-order variants of <code>UTF-32</code>.
|
|
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <code>ISO-2022-JP</code> (Japanese) and <code>HZ-GB-2312</code> (Chinese).
|
|
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <code>Big5</code> (Chinese), <code>SHIFT_JIS</code> (Japanese), <code>EUC-KR</code> (Korean), and <code>UTF-8</code> without a <abbr title="Byte Order Mark">BOM</abbr>.
|
|
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
|
|
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.
|
|
</ol>
|
|
<h3 id=how.bom><code>UTF-n</code> With A <abbr title="Byte Order Mark">BOM</abbr></h3>
|
|
<p>If the text starts with a <abbr title="Byte Order Mark">BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr title="Byte Order Mark">BOM</abbr> will tell us exactly which one; that’s what it’s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
|
|
<h3 id=how.esc>Escaped Encodings</h3>
|
|
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code>escprober.py</code>) and feeds it the text.
|
|
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
|
|
<h3 id=how.mb>Multi-Byte Encodings</h3>
|
|
<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
|
|
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
|
|
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code>mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
|
|
<p>The distribution analyzers (each defined in <code>chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
|
|
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code>sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code>jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.
|
|
<h3 id=how.sb>Single-Byte Encodings</h3>
|
|
<aside>Seriously, where’s my Unicode pony?</aside>
|
|
<p>The single-byte encoding prober, <code>SBCSGroupProber</code> (defined in <code>sbcsgroupprober.py</code>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <code>windows-1251</code>, <code>KOI8-R</code>, <code>ISO-8859-5</code>, <code>MacCyrillic</code>, <code>IBM855</code>, and <code>IBM866</code> (Russian); <code>ISO-8859-7</code> and <code>windows-1253</code> (Greek); <code>ISO-8859-5</code> and <code>windows-1251</code> (Bulgarian); <code>ISO-8859-2</code> and <code>windows-1250</code> (Hungarian); <code>TIS-620</code> (Thai); <code>windows-1255</code> and <code>ISO-8859-8</code> (Hebrew).
|
|
<p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code>sbcharsetprober.py</code>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
|
|
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
|
|
<h3 id=how.windows1252><code>windows-1252</code></h3>
|
|
<p>If <code>UniversalDetector</code> detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a <code>Latin1Prober</code> (defined in <code>latin1prober.py</code>) to try to detect English text in a <code>windows-1252</code> encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish <code>windows-1252</code> is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. <code>Latin1Prober</code> automatically reduces its confidence rating to allow more accurate probers to win if at all possible.
|
|
<h2 id=running2to3>Running <code>2to3</code></h2>
|
|
<p>We’re going to migrate the <code>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy — a function was renamed or moved to a different modules — but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>. In this chapter, we’ll start by running <code>2to3</code> on the <code>chardet</code> package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.
|
|
<p>The main <code>chardet</code> package is split across several different files, all in the same directory. The <code>2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code>2to3</code> will convert each of the files in turn.
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w chardet\</kbd>
|
|
<samp>RefactoringTool: Skipping implicit fixer: buffer
|
|
RefactoringTool: Skipping implicit fixer: idioms
|
|
RefactoringTool: Skipping implicit fixer: set_literal
|
|
RefactoringTool: Skipping implicit fixer: ws_comma
|
|
--- chardet\__init__.py (original)
|
|
+++ chardet\__init__.py (refactored)
|
|
@@ -18,7 +18,7 @@
|
|
__version__ = "1.0.1"
|
|
|
|
def detect(aBuf):
|
|
<del>- import universaldetector</del>
|
|
<ins>+ from . import universaldetector</ins>
|
|
u = universaldetector.UniversalDetector()
|
|
u.reset()
|
|
u.feed(aBuf)
|
|
--- chardet\big5prober.py (original)
|
|
+++ chardet\big5prober.py (refactored)
|
|
@@ -25,10 +25,10 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from mbcharsetprober import MultiByteCharSetProber</del>
|
|
<del>-from codingstatemachine import CodingStateMachine</del>
|
|
<del>-from chardistribution import Big5DistributionAnalysis</del>
|
|
<del>-from mbcssm import Big5SMModel</del>
|
|
<ins>+from .mbcharsetprober import MultiByteCharSetProber</ins>
|
|
<ins>+from .codingstatemachine import CodingStateMachine</ins>
|
|
<ins>+from .chardistribution import Big5DistributionAnalysis</ins>
|
|
<ins>+from .mbcssm import Big5SMModel</ins>
|
|
|
|
class Big5Prober(MultiByteCharSetProber):
|
|
def __init__(self):
|
|
--- chardet\chardistribution.py (original)
|
|
+++ chardet\chardistribution.py (refactored)
|
|
@@ -25,12 +25,12 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-import constants</del>
|
|
<del>-from euctwfreq import EUCTWCharToFreqOrder, EUCTW_TABLE_SIZE, EUCTW_TYPICAL_DISTRIBUTION_RATIO</del>
|
|
<del>-from euckrfreq import EUCKRCharToFreqOrder, EUCKR_TABLE_SIZE, EUCKR_TYPICAL_DISTRIBUTION_RATIO</del>
|
|
<del>-from gb2312freq import GB2312CharToFreqOrder, GB2312_TABLE_SIZE, GB2312_TYPICAL_DISTRIBUTION_RATIO</del>
|
|
<del>-from big5freq import Big5CharToFreqOrder, BIG5_TABLE_SIZE, BIG5_TYPICAL_DISTRIBUTION_RATIO</del>
|
|
<del>-from jisfreq import JISCharToFreqOrder, JIS_TABLE_SIZE, JIS_TYPICAL_DISTRIBUTION_RATIO</del>
|
|
<ins>+from . import constants</ins>
|
|
<ins>+from .euctwfreq import EUCTWCharToFreqOrder, EUCTW_TABLE_SIZE, EUCTW_TYPICAL_DISTRIBUTION_RATIO</ins>
|
|
<ins>+from .euckrfreq import EUCKRCharToFreqOrder, EUCKR_TABLE_SIZE, EUCKR_TYPICAL_DISTRIBUTION_RATIO</ins>
|
|
<ins>+from .gb2312freq import GB2312CharToFreqOrder, GB2312_TABLE_SIZE, GB2312_TYPICAL_DISTRIBUTION_RATIO</ins>
|
|
<ins>+from .big5freq import Big5CharToFreqOrder, BIG5_TABLE_SIZE, BIG5_TYPICAL_DISTRIBUTION_RATIO</ins>
|
|
<ins>+from .jisfreq import JISCharToFreqOrder, JIS_TABLE_SIZE, JIS_TYPICAL_DISTRIBUTION_RATIO</ins>
|
|
|
|
ENOUGH_DATA_THRESHOLD = 1024
|
|
SURE_YES = 0.99
|
|
--- chardet\charsetgroupprober.py (original)
|
|
+++ chardet\charsetgroupprober.py (refactored)
|
|
@@ -26,7 +26,7 @@
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
import constants, sys
|
|
<del>-from charsetprober import CharSetProber</del>
|
|
<ins>+from .charsetprober import CharSetProber</ins>
|
|
|
|
class CharSetGroupProber(CharSetProber):
|
|
def __init__(self):
|
|
--- chardet\codingstatemachine.py (original)
|
|
+++ chardet\codingstatemachine.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from constants import eStart, eError, eItsMe</del>
|
|
<ins>+from .constants import eStart, eError, eItsMe</ins>
|
|
|
|
class CodingStateMachine:
|
|
def __init__(self, sm):
|
|
--- chardet\constants.py (original)
|
|
+++ chardet\constants.py (refactored)
|
|
@@ -38,10 +38,10 @@
|
|
|
|
SHORTCUT_THRESHOLD = 0.95
|
|
|
|
<del>-import __builtin__</del>
|
|
<ins>+import builtins</ins>
|
|
if not hasattr(__builtin__, 'False'):
|
|
False = 0
|
|
True = 1
|
|
else:
|
|
<del>- False = __builtin__.False</del>
|
|
<del>- True = __builtin__.True</del>
|
|
<ins>+ False = builtins.False</ins>
|
|
<ins>+ True = builtins.True</ins>
|
|
--- chardet\escprober.py (original)
|
|
+++ chardet\escprober.py (refactored)
|
|
@@ -26,9 +26,9 @@
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
import constants, sys
|
|
<del>-from escsm import HZSMModel, ISO2022CNSMModel, ISO2022JPSMModel, ISO2022KRSMModel</del>
|
|
<del>-from charsetprober import CharSetProber</del>
|
|
<del>-from codingstatemachine import CodingStateMachine</del>
|
|
<ins>+from .escsm import HZSMModel, ISO2022CNSMModel, ISO2022JPSMModel, ISO2022KRSMModel</ins>
|
|
<ins>+from .charsetprober import CharSetProber</ins>
|
|
<ins>+from .codingstatemachine import CodingStateMachine</ins>
|
|
|
|
class EscCharSetProber(CharSetProber):
|
|
def __init__(self):
|
|
--- chardet\escsm.py (original)
|
|
+++ chardet\escsm.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from constants import eStart, eError, eItsMe</del>
|
|
<ins>+from .constants import eStart, eError, eItsMe</ins>
|
|
|
|
HZ_cls = ( \
|
|
1,0,0,0,0,0,0,0, # 00 - 07
|
|
--- chardet\eucjpprober.py (original)
|
|
+++ chardet\eucjpprober.py (refactored)
|
|
@@ -26,12 +26,12 @@
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
import constants, sys
|
|
<del>-from constants import eStart, eError, eItsMe</del>
|
|
<del>-from mbcharsetprober import MultiByteCharSetProber</del>
|
|
<del>-from codingstatemachine import CodingStateMachine</del>
|
|
<del>-from chardistribution import EUCJPDistributionAnalysis</del>
|
|
<del>-from jpcntx import EUCJPContextAnalysis</del>
|
|
<del>-from mbcssm import EUCJPSMModel</del>
|
|
<ins>+from .constants import eStart, eError, eItsMe</ins>
|
|
<ins>+from .mbcharsetprober import MultiByteCharSetProber</ins>
|
|
<ins>+from .codingstatemachine import CodingStateMachine</ins>
|
|
<ins>+from .chardistribution import EUCJPDistributionAnalysis</ins>
|
|
<ins>+from .jpcntx import EUCJPContextAnalysis</ins>
|
|
<ins>+from .mbcssm import EUCJPSMModel</ins>
|
|
|
|
class EUCJPProber(MultiByteCharSetProber):
|
|
def __init__(self):
|
|
--- chardet\euckrprober.py (original)
|
|
+++ chardet\euckrprober.py (refactored)
|
|
@@ -25,10 +25,10 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from mbcharsetprober import MultiByteCharSetProber</del>
|
|
<del>-from codingstatemachine import CodingStateMachine</del>
|
|
<del>-from chardistribution import EUCKRDistributionAnalysis</del>
|
|
<del>-from mbcssm import EUCKRSMModel</del>
|
|
<ins>+from .mbcharsetprober import MultiByteCharSetProber</ins>
|
|
<ins>+from .codingstatemachine import CodingStateMachine</ins>
|
|
<ins>+from .chardistribution import EUCKRDistributionAnalysis</ins>
|
|
<ins>+from .mbcssm import EUCKRSMModel</ins>
|
|
|
|
class EUCKRProber(MultiByteCharSetProber):
|
|
def __init__(self):
|
|
--- chardet\euctwprober.py (original)
|
|
+++ chardet\euctwprober.py (refactored)
|
|
@@ -25,10 +25,10 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from mbcharsetprober import MultiByteCharSetProber</del>
|
|
<del>-from codingstatemachine import CodingStateMachine</del>
|
|
<del>-from chardistribution import EUCTWDistributionAnalysis</del>
|
|
<del>-from mbcssm import EUCTWSMModel</del>
|
|
<ins>+from .mbcharsetprober import MultiByteCharSetProber</ins>
|
|
<ins>+from .codingstatemachine import CodingStateMachine</ins>
|
|
<ins>+from .chardistribution import EUCTWDistributionAnalysis</ins>
|
|
<ins>+from .mbcssm import EUCTWSMModel</ins>
|
|
|
|
class EUCTWProber(MultiByteCharSetProber):
|
|
def __init__(self):
|
|
--- chardet\gb2312prober.py (original)
|
|
+++ chardet\gb2312prober.py (refactored)
|
|
@@ -25,10 +25,10 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from mbcharsetprober import MultiByteCharSetProber</del>
|
|
<del>-from codingstatemachine import CodingStateMachine</del>
|
|
<del>-from chardistribution import GB2312DistributionAnalysis</del>
|
|
<del>-from mbcssm import GB2312SMModel</del>
|
|
<ins>+from .mbcharsetprober import MultiByteCharSetProber</ins>
|
|
<ins>+from .codingstatemachine import CodingStateMachine</ins>
|
|
<ins>+from .chardistribution import GB2312DistributionAnalysis</ins>
|
|
<ins>+from .mbcssm import GB2312SMModel</ins>
|
|
|
|
class GB2312Prober(MultiByteCharSetProber):
|
|
def __init__(self):
|
|
--- chardet\hebrewprober.py (original)
|
|
+++ chardet\hebrewprober.py (refactored)
|
|
@@ -25,8 +25,8 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from charsetprober import CharSetProber</del>
|
|
<del>-import constants</del>
|
|
<ins>+from .charsetprober import CharSetProber</ins>
|
|
<ins>+from . import constants</ins>
|
|
|
|
# This prober doesn't actually recognize a language or a charset.
|
|
# It is a helper prober for the use of the Hebrew model probers
|
|
--- chardet\jpcntx.py (original)
|
|
+++ chardet\jpcntx.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-import constants</del>
|
|
<ins>+from . import constants</ins>
|
|
|
|
NUM_OF_CATEGORY = 6
|
|
DONT_KNOW = -1
|
|
--- chardet\langbulgarianmodel.py (original)
|
|
+++ chardet\langbulgarianmodel.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-import constants</del>
|
|
<ins>+from . import constants</ins>
|
|
|
|
# 255: Control characters that usually does not exist in any text
|
|
# 254: Carriage/Return
|
|
--- chardet\langcyrillicmodel.py (original)
|
|
+++ chardet\langcyrillicmodel.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-import constants</del>
|
|
<ins>+from . import constants</ins>
|
|
|
|
# KOI8-R language model
|
|
# Character Mapping Table:
|
|
--- chardet\langgreekmodel.py (original)
|
|
+++ chardet\langgreekmodel.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-import constants</del>
|
|
<ins>+from . import constants</ins>
|
|
|
|
# 255: Control characters that usually does not exist in any text
|
|
# 254: Carriage/Return
|
|
--- chardet\langhebrewmodel.py (original)
|
|
+++ chardet\langhebrewmodel.py (refactored)
|
|
@@ -27,7 +27,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-import constants</del>
|
|
<ins>+from . import constants</ins>
|
|
|
|
# 255: Control characters that usually does not exist in any text
|
|
# 254: Carriage/Return
|
|
--- chardet\langhungarianmodel.py (original)
|
|
+++ chardet\langhungarianmodel.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-import constants</del>
|
|
<ins>+from . import constants</ins>
|
|
|
|
# 255: Control characters that usually does not exist in any text
|
|
# 254: Carriage/Return
|
|
--- chardet\langthaimodel.py (original)
|
|
+++ chardet\langthaimodel.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-import constants</del>
|
|
<ins>+from . import constants</ins>
|
|
|
|
# 255: Control characters that usually does not exist in any text
|
|
# 254: Carriage/Return
|
|
--- chardet\latin1prober.py (original)
|
|
+++ chardet\latin1prober.py (refactored)
|
|
@@ -26,8 +26,8 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from charsetprober import CharSetProber</del>
|
|
<del>-import constants</del>
|
|
<ins>+from .charsetprober import CharSetProber</ins>
|
|
<ins>+from . import constants</ins>
|
|
import operator
|
|
|
|
FREQ_CAT_NUM = 4
|
|
--- chardet\mbcharsetprober.py (original)
|
|
+++ chardet\mbcharsetprober.py (refactored)
|
|
@@ -28,8 +28,8 @@
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
import constants, sys
|
|
<del>-from constants import eStart, eError, eItsMe</del>
|
|
<del>-from charsetprober import CharSetProber</del>
|
|
<ins>+from .constants import eStart, eError, eItsMe</ins>
|
|
<ins>+from .charsetprober import CharSetProber</ins>
|
|
|
|
class MultiByteCharSetProber(CharSetProber):
|
|
def __init__(self):
|
|
--- chardet\mbcsgroupprober.py (original)
|
|
+++ chardet\mbcsgroupprober.py (refactored)
|
|
@@ -27,14 +27,14 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from charsetgroupprober import CharSetGroupProber</del>
|
|
<del>-from utf8prober import UTF8Prober</del>
|
|
<del>-from sjisprober import SJISProber</del>
|
|
<del>-from eucjpprober import EUCJPProber</del>
|
|
<del>-from gb2312prober import GB2312Prober</del>
|
|
<del>-from euckrprober import EUCKRProber</del>
|
|
<del>-from big5prober import Big5Prober</del>
|
|
<del>-from euctwprober import EUCTWProber</del>
|
|
<ins>+from .charsetgroupprober import CharSetGroupProber</ins>
|
|
<ins>+from .utf8prober import UTF8Prober</ins>
|
|
<ins>+from .sjisprober import SJISProber</ins>
|
|
<ins>+from .eucjpprober import EUCJPProber</ins>
|
|
<ins>+from .gb2312prober import GB2312Prober</ins>
|
|
<ins>+from .euckrprober import EUCKRProber</ins>
|
|
<ins>+from .big5prober import Big5Prober</ins>
|
|
<ins>+from .euctwprober import EUCTWProber</ins>
|
|
|
|
class MBCSGroupProber(CharSetGroupProber):
|
|
def __init__(self):
|
|
--- chardet\mbcssm.py (original)
|
|
+++ chardet\mbcssm.py (refactored)
|
|
@@ -25,7 +25,7 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from constants import eStart, eError, eItsMe</del>
|
|
<ins>+from .constants import eStart, eError, eItsMe</ins>
|
|
|
|
# BIG5
|
|
|
|
--- chardet\sbcharsetprober.py (original)
|
|
+++ chardet\sbcharsetprober.py (refactored)
|
|
@@ -27,7 +27,7 @@
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
import constants, sys
|
|
<del>-from charsetprober import CharSetProber</del>
|
|
<ins>+from .charsetprober import CharSetProber</ins>
|
|
|
|
SAMPLE_SIZE = 64
|
|
SB_ENOUGH_REL_THRESHOLD = 1024
|
|
--- chardet\sbcsgroupprober.py (original)
|
|
+++ chardet\sbcsgroupprober.py (refactored)
|
|
@@ -27,15 +27,15 @@
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
import constants, sys
|
|
<del>-from charsetgroupprober import CharSetGroupProber</del>
|
|
<del>-from sbcharsetprober import SingleByteCharSetProber</del>
|
|
<del>-from langcyrillicmodel import Win1251CyrillicModel, Koi8rModel, Latin5CyrillicModel, MacCyrillicModel, Ibm866Model, Ibm855Model</del>
|
|
<del>-from langgreekmodel import Latin7GreekModel, Win1253GreekModel</del>
|
|
<del>-from langbulgarianmodel import Latin5BulgarianModel, Win1251BulgarianModel</del>
|
|
<del>-from langhungarianmodel import Latin2HungarianModel, Win1250HungarianModel</del>
|
|
<del>-from langthaimodel import TIS620ThaiModel</del>
|
|
<del>-from langhebrewmodel import Win1255HebrewModel</del>
|
|
<del>-from hebrewprober import HebrewProber</del>
|
|
<ins>+from .charsetgroupprober import CharSetGroupProber</ins>
|
|
<ins>+from .sbcharsetprober import SingleByteCharSetProber</ins>
|
|
<ins>+from .langcyrillicmodel import Win1251CyrillicModel, Koi8rModel, Latin5CyrillicModel, MacCyrillicModel, Ibm866Model, Ibm855Model</ins>
|
|
<ins>+from .langgreekmodel import Latin7GreekModel, Win1253GreekModel</ins>
|
|
<ins>+from .langbulgarianmodel import Latin5BulgarianModel, Win1251BulgarianModel</ins>
|
|
<ins>+from .langhungarianmodel import Latin2HungarianModel, Win1250HungarianModel</ins>
|
|
<ins>+from .langthaimodel import TIS620ThaiModel</ins>
|
|
<ins>+from .langhebrewmodel import Win1255HebrewModel</ins>
|
|
<ins>+from .hebrewprober import HebrewProber</ins>
|
|
|
|
class SBCSGroupProber(CharSetGroupProber):
|
|
def __init__(self):
|
|
--- chardet\sjisprober.py (original)
|
|
+++ chardet\sjisprober.py (refactored)
|
|
@@ -25,13 +25,13 @@
|
|
# 02110-1301 USA
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
<del>-from mbcharsetprober import MultiByteCharSetProber</del>
|
|
<del>-from codingstatemachine import CodingStateMachine</del>
|
|
<del>-from chardistribution import SJISDistributionAnalysis</del>
|
|
<del>-from jpcntx import SJISContextAnalysis</del>
|
|
<del>-from mbcssm import SJISSMModel</del>
|
|
<ins>+from .mbcharsetprober import MultiByteCharSetProber</ins>
|
|
<ins>+from .codingstatemachine import CodingStateMachine</ins>
|
|
<ins>+from .chardistribution import SJISDistributionAnalysis</ins>
|
|
<ins>+from .jpcntx import SJISContextAnalysis</ins>
|
|
<ins>+from .mbcssm import SJISSMModel</ins>
|
|
import constants, sys
|
|
<del>-from constants import eStart, eError, eItsMe</del>
|
|
<ins>+from .constants import eStart, eError, eItsMe</ins>
|
|
|
|
class SJISProber(MultiByteCharSetProber):
|
|
def __init__(self):
|
|
--- chardet\universaldetector.py (original)
|
|
+++ chardet\universaldetector.py (refactored)
|
|
@@ -27,10 +27,10 @@
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
import constants, sys
|
|
<del>-from latin1prober import Latin1Prober # windows-1252</del>
|
|
<del>-from mbcsgroupprober import MBCSGroupProber # multi-byte character sets</del>
|
|
<del>-from sbcsgroupprober import SBCSGroupProber # single-byte character sets</del>
|
|
<del>-from escprober import EscCharSetProber # ISO-2122, etc.</del>
|
|
<ins>+from .latin1prober import Latin1Prober # windows-1252</ins>
|
|
<ins>+from .mbcsgroupprober import MBCSGroupProber # multi-byte character sets</ins>
|
|
<ins>+from .sbcsgroupprober import SBCSGroupProber # single-byte character sets</ins>
|
|
<ins>+from .escprober import EscCharSetProber # ISO-2122, etc.</ins>
|
|
import re
|
|
|
|
MINIMUM_THRESHOLD = 0.20
|
|
--- chardet\utf8prober.py (original)
|
|
+++ chardet\utf8prober.py (refactored)
|
|
@@ -26,10 +26,10 @@
|
|
######################### END LICENSE BLOCK #########################
|
|
|
|
import constants, sys
|
|
<del>-from constants import eStart, eError, eItsMe</del>
|
|
<del>-from charsetprober import CharSetProber</del>
|
|
<del>-from codingstatemachine import CodingStateMachine</del>
|
|
<del>-from mbcssm import UTF8SMModel</del>
|
|
<ins>+from .constants import eStart, eError, eItsMe</ins>
|
|
<ins>+from .charsetprober import CharSetProber</ins>
|
|
<ins>+from .codingstatemachine import CodingStateMachine</ins>
|
|
<ins>+from .mbcssm import UTF8SMModel</ins>
|
|
|
|
ONE_CHAR_PROB = 0.5
|
|
|
|
RefactoringTool: Files that were modified:
|
|
RefactoringTool: chardet\__init__.py
|
|
RefactoringTool: chardet\big5prober.py
|
|
RefactoringTool: chardet\chardistribution.py
|
|
RefactoringTool: chardet\charsetgroupprober.py
|
|
RefactoringTool: chardet\codingstatemachine.py
|
|
RefactoringTool: chardet\constants.py
|
|
RefactoringTool: chardet\escprober.py
|
|
RefactoringTool: chardet\escsm.py
|
|
RefactoringTool: chardet\eucjpprober.py
|
|
RefactoringTool: chardet\euckrprober.py
|
|
RefactoringTool: chardet\euctwprober.py
|
|
RefactoringTool: chardet\gb2312prober.py
|
|
RefactoringTool: chardet\hebrewprober.py
|
|
RefactoringTool: chardet\jpcntx.py
|
|
RefactoringTool: chardet\langbulgarianmodel.py
|
|
RefactoringTool: chardet\langcyrillicmodel.py
|
|
RefactoringTool: chardet\langgreekmodel.py
|
|
RefactoringTool: chardet\langhebrewmodel.py
|
|
RefactoringTool: chardet\langhungarianmodel.py
|
|
RefactoringTool: chardet\langthaimodel.py
|
|
RefactoringTool: chardet\latin1prober.py
|
|
RefactoringTool: chardet\mbcharsetprober.py
|
|
RefactoringTool: chardet\mbcsgroupprober.py
|
|
RefactoringTool: chardet\mbcssm.py
|
|
RefactoringTool: chardet\sbcharsetprober.py
|
|
RefactoringTool: chardet\sbcsgroupprober.py
|
|
RefactoringTool: chardet\sjisprober.py
|
|
RefactoringTool: chardet\universaldetector.py
|
|
RefactoringTool: chardet\utf8prober.py</samp></pre>
|
|
<p>Now run the <code>2to3</code> script on the testing harness, <code>test.py</code>.
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w test.py</kbd>
|
|
<samp>RefactoringTool: Skipping implicit fixer: buffer
|
|
RefactoringTool: Skipping implicit fixer: idioms
|
|
RefactoringTool: Skipping implicit fixer: set_literal
|
|
RefactoringTool: Skipping implicit fixer: ws_comma
|
|
--- test.py (original)
|
|
+++ test.py (refactored)
|
|
@@ -4,7 +4,7 @@
|
|
count = 0
|
|
u = UniversalDetector()
|
|
for f in glob.glob(sys.argv[1]):
|
|
<del>- print f.ljust(60),</del>
|
|
<ins>+ print(f.ljust(60), end=' ')</ins>
|
|
u.reset()
|
|
for line in file(f, 'rb'):
|
|
u.feed(line)
|
|
@@ -12,8 +12,8 @@
|
|
u.close()
|
|
result = u.result
|
|
if result['encoding']:
|
|
<del>- print result['encoding'], 'with confidence', result['confidence']</del>
|
|
<ins>+ print(result['encoding'], 'with confidence', result['confidence'])</ins>
|
|
else:
|
|
<del>- print '******** no result'</del>
|
|
<ins>+ print('******** no result')</ins>
|
|
count += 1
|
|
<del>-print count, 'tests'</del>
|
|
<ins>+print(count, 'tests')</ins>
|
|
RefactoringTool: Files that were modified:
|
|
RefactoringTool: test.py</samp></pre>
|
|
<p>[FIXME explain the difference in import syntax]
|
|
<p>Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
|
|
<h2 id=manual>Fixing What <code>2to3</code> Can’t</h2>
|
|
<h3 id=falseisinvalidsyntax><code>False</code> is invalid syntax</h3>
|
|
<aside>You do have tests, right?</aside>
|
|
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere.
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 1, in <module>
|
|
from chardet.universaldetector import UniversalDetector
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 51
|
|
self.done = constants.False
|
|
^
|
|
SyntaxError: invalid syntax</samp></pre>
|
|
<p>Hmm, a small snag. In Python 3, <code>False</code> is a reserved word, so you can’t use it as a variable name. Let’s look at <code>constants.py</code> to see where it’s defined. Here’s the original version from <code>constants.py</code>, before the <code>2to3</code> script changed it:
|
|
<pre><code>import __builtin__
|
|
if not hasattr(__builtin__, 'False'):
|
|
False = 0
|
|
True = 1
|
|
else:
|
|
False = __builtin__.False
|
|
True = __builtin__.True</code></pre>
|
|
<p>This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in <code>Boolean</code> type. This code detects the absence of the built-in constants <code>True</code> and <code>False</code>, and defines them if necessary.
|
|
<p>However, Python 3 will always have a <code>Boolean</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of <code>constants.True</code> and <code>constants.False</code> with <code>True</code> and <code>False</code>, respectively, then delete this dead code from <code>constants.py</code>.
|
|
<p>So this line in <code>universaldetector.py</code>:
|
|
<pre><code>self.done = constants.False</code></pre>
|
|
<p>Becomes
|
|
<pre><code>self.done = False</code></pre>
|
|
<p>Ah, wasn’t that satisfying? The code is shorter and more readable already.
|
|
<h3 id=nomodulenamedconstants>No module named <code>constants</code></h3>
|
|
<p>Time to run <code>test.py</code> again and see how far it gets.
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 1, in <module>
|
|
from chardet.universaldetector import UniversalDetector
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module>
|
|
import constants, sys
|
|
ImportError: No module named constants</samp></pre>
|
|
<p>What’s that you say? No module named <code>constants</code>? Of course there’s a module named <code>constants</code>. …Oh wait, no there isn’t. Remember when the <code>2to3</code> script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
|
|
<pre><code>from . import constants</code></pre>
|
|
<p>But wait. Wasn’t the <code>2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code>constants</code> module within the library, and an absolute import of the <code>sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the <code>2to3</code> script is not smart enough to split the import statement into two.
|
|
<p>The solution is to split the import statement manually. So this two-in-one import:
|
|
<pre><code>import constants, sys</code></pre>
|
|
<p>Needs to become two separate imports:
|
|
<pre><code>from . import constants
|
|
import sys</code></pre>
|
|
<p>There are variations of this problem scattered throughout the <code>chardet</code> library. In some places it’s “<code>import constants, sys</code>”; in other places, it’s “<code>import constants, re</code>”. The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
|
|
<p>Onward!
|
|
<h3 id=namefileisnotdefined>Name <var>'file'</var> is not defined</h3>
|
|
<aside>open() is the new file(). PapayaWhip is the new black.</aside>
|
|
<p>And here we go again, running <code>test.py</code> to try to execute our test cases…
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 9, in <module>
|
|
for line in file(f, 'rb'):
|
|
NameError: name 'file' is not defined</samp></pre>
|
|
<p>This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global <var>file()</var> function was an alias for <var>open()</var>, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the <code>io</code> module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global <var>file()</var> function no longer exists. However, the <var>open()</var> function does still exist. (Technically, it’s an alias for <var>io.open()</var>, but never mind that right now.)
|
|
<p>Thus, the simplest solution to the problem of the missing <var>file()</var> is to call <var>open()</var> instead:
|
|
<pre><code>for line in open(f, 'rb'):</code></pre>
|
|
<p>And that’s all I have to say about that.
|
|
<h3 id=cantuseastringpattern>Can’t use a string pattern on a bytes-like object</h3>
|
|
<p>Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.”
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 10, in <module>
|
|
u.feed(line)
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 98, in feed
|
|
if self._highBitDetector.search(aBuf):
|
|
TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
|
<p>To debug this, let’s see what <var>self._highBitDetector</var> is. It’s defined in the <var>__init__</var> method of the <var>UniversalDetector</var> class:
|
|
<pre><code>class UniversalDetector:
|
|
def __init__(self):
|
|
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
|
|
<p>This pre-compiles a regular expression designed to find non-<abbr>ASCII</abbr> characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-<abbr>ASCII</abbr> <em>bytes</em> in the range 128-255.
|
|
<p>And therein lies the problem.
|
|
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
|
|
<pre><code>def feed(self, aBuf):
|
|
.
|
|
.
|
|
.
|
|
if self._mInputState == ePureAscii:
|
|
if self._highBitDetector.search(aBuf):</code></pre>
|
|
<p>And what is <var>aBuf</var>? Let’s backtrack further to a place that calls <code>UniversalDetector.feed()</code>. One place that calls it is the test harness, <code>test.py</code>.
|
|
<pre><code>u = UniversalDetector()
|
|
.
|
|
.
|
|
.
|
|
for line in open(f, 'rb'):
|
|
u.feed(line)</code></pre>
|
|
<aside>Not an array of characters, but an array of bytes.</aside>
|
|
<p>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for “read”; OK, big deal, we’re reading the file. Ah, but <code>'b'</code> is for “binary.” Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
|
|
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
|
|
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
|
|
<pre><code> class UniversalDetector:
|
|
def __init__(self):
|
|
<del>- self._highBitDetector = re.compile(r'[\x80-\xFF]')</del>
|
|
<del>- self._escDetector = re.compile(r'(\033|~{)')</del>
|
|
<ins>+ self._highBitDetector = re.compile(b'[\x80-\xFF]')</ins>
|
|
<ins>+ self._escDetector = re.compile(b'(\033|~{)')</ins>
|
|
self._mEscCharSetProber = None
|
|
self._mCharSetProbers = []
|
|
self.reset()</code></pre>
|
|
<p>Searching the entire codebase for other uses of the <code>re</code> module turns up two more instances, in <code>charsetprober.py</code>. Again, the code is defining regular expressions as strings but executing them on <var>aBuf</var>, which is a byte array. The solution is the same: define the regular expression patterns as byte arrays.
|
|
<pre><code> class CharSetProber:
|
|
.
|
|
.
|
|
.
|
|
def filter_high_bit_only(self, aBuf):
|
|
<del>- aBuf = re.sub(r'([\x00-\x7F])+', ' ', aBuf)</del>
|
|
<ins>+ aBuf = re.sub(b'([\x00-\x7F])+', b' ', aBuf)</ins>
|
|
return aBuf
|
|
|
|
def filter_without_english_letters(self, aBuf):
|
|
<del>- aBuf = re.sub(r'([A-Za-z])+', ' ', aBuf)</del>
|
|
<ins>+ aBuf = re.sub(b'([A-Za-z])+', b' ', aBuf)</ins>
|
|
return aBuf</code></pre>
|
|
|
|
<h3 id=cantconvertbytesobject>Can't convert <code>'bytes'</code> object to <code>str</code> implicitly</h3>
|
|
<p>Curiouser and curiouser…
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 10, in <module>
|
|
u.feed(line)
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed
|
|
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
|
|
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
|
|
<p>There’s an unfortunate clash of coding style and Python interpreter here. The <code>TypeError</code> could be anywhere on that line, but the traceback doesn’t tell you exactly where it is. It could be in the first conditional or the second, and the traceback would look the same. To narrow it down, you should split the line in half, like this:
|
|
<pre><code>elif (self._mInputState == ePureAscii) and \
|
|
self._escDetector.search(self._mLastChar + aBuf):</code></pre>
|
|
<p>And re-run the test:
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 10, in <module>
|
|
u.feed(line)
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
|
|
self._escDetector.search(self._mLastChar + aBuf):
|
|
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
|
|
<p>Aha! The problem was not in the first conditional (<code>self._mInputState == ePureAscii</code>) but in the second one. So what could cause a <code>TypeError</code> there? Perhaps you’re thinking that the <code>search()</code> method is expecting a value of a different type, but that wouldn’t generate this traceback. Python functions can take any value; if you pass the right number of arguments, the function will execute. It may <em>crash</em> if you pass it a value of a different type than it’s expecting, but if that happened, the traceback would point to somewhere inside the function. But this traceback says it never got as far as calling the <code>search()</code> method. So the problem must be in that <code>+</code> operation, as it’s trying to construct the value that it will eventually pass to the <code>search()</code> method.
|
|
<p>We know from <a href=#cantuseastringpattern>previous debugging</a> that <var>aBuf</var> is a byte array. So what is <code>self._mLastChar</code>? It’s an instance variable, defined in the <code>reset()</code> method, which is actually called from the <code>__init__()</code> method.
|
|
<pre><code>class UniversalDetector:
|
|
def __init__(self):
|
|
self._highBitDetector = re.compile(b'[\x80-\xFF]')
|
|
self._escDetector = re.compile(b'(\033|~{)')
|
|
self._mEscCharSetProber = None
|
|
self._mCharSetProbers = []
|
|
<mark> self.reset()</mark>
|
|
|
|
def reset(self):
|
|
self.result = {'encoding': None, 'confidence': 0.0}
|
|
self.done = False
|
|
self._mStart = True
|
|
self._mGotData = False
|
|
self._mInputState = ePureAscii
|
|
<mark> self._mLastChar = ''</mark></code></pre>
|
|
<p>And now we have our answer. Do you see it? <var>self._mLastChar</var> is a string, but <var>aBuf</var> is a byte array. And you can’t concatenate a string to a byte array — not even a zero-length string.
|
|
<p>So what is <var>self._mLastChar</var> anyway? The answer is in the <code>feed()</code> method, just a few lines down from where the trackback occurred.
|
|
<pre><code>if self._mInputState == ePureAscii:
|
|
if self._highBitDetector.search(aBuf):
|
|
self._mInputState = eHighbyte
|
|
elif (self._mInputState == ePureAscii) and \
|
|
self._escDetector.search(self._mLastChar + aBuf):
|
|
self._mInputState = eEscAscii
|
|
|
|
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
|
|
<p>The calling function calls this <code>feed()</code> method over and over again with a few bytes at a time. The method processes the bytes it was given (passed in as <var>aBuf</var>), then stores the last byte in <var>self._mLastChar</var> in case it’s needed during the next call. (In a multi-byte encoding, the <code>feed()</code> method might get called with half of a character, then called again with the other half.) But because <var>aBuf</var> is now a byte array instead of a string, <var>self._mLastChar</var> needs to be a byte array as well. Thus:
|
|
<pre><code> def reset(self):
|
|
.
|
|
.
|
|
.
|
|
<del>- self._mLastChar = ''</del>
|
|
<ins>+ self._mLastChar = b''</ins></code></pre>
|
|
<p>Searching the entire codebase for “<code>mLastChar</code>” turns up a similar problem in <code>mbcharsetprober.py</code>, but instead of tracking the last character, it tracks the last <em>two</em> characters. The <code>MultiByteCharSetProber</code> class uses a list of 1-character strings to track the last two characters; in Python 3, it needs to use a list of integers.
|
|
<pre><code>
|
|
class MultiByteCharSetProber(CharSetProber):
|
|
def __init__(self):
|
|
CharSetProber.__init__(self)
|
|
self._mDistributionAnalyzer = None
|
|
self._mCodingSM = None
|
|
<del>- self._mLastChar = ['\x00', '\x00']</del>
|
|
<ins>+ self._mLastChar = [0, 0]</ins>
|
|
|
|
def reset(self):
|
|
CharSetProber.reset(self)
|
|
if self._mCodingSM:
|
|
self._mCodingSM.reset()
|
|
if self._mDistributionAnalyzer:
|
|
self._mDistributionAnalyzer.reset()
|
|
<del>- self._mLastChar = ['\x00', '\x00']</del>
|
|
<ins>+ self._mLastChar = [0, 0]</ins></code></pre>
|
|
<h3 id=unsupportedoperandtypeforplus>Unsupported operand type(s) for +: <code>'int'</code> and <code>'bytes'</code></h3>
|
|
<p>I have good news, and I have bad news. The good news is we’re making progress…
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 10, in <module>
|
|
u.feed(line)
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 101, in feed
|
|
self._escDetector.search(self._mLastChar + aBuf):
|
|
TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp></pre>
|
|
<p>…The bad news is it doesn’t always feel like progress.
|
|
<p>But this is progress! Really! Even though the traceback calls out the same line of code, it’s a different error than it used to be. Progress! So what’s the problem now? The last time I checked, this line of code didn’t try to concatenate an <code>int</code> with a byte array (<code>bytes</code>). In fact, you just spent a lot of time <a href=#cantconvertbytesobject>ensuring that <var>self._mLastChar</var> was a byte array</a>. How did it turn into an <code>int</code>?
|
|
<p>The answer lies not in the previous lines of code, but in the following lines.
|
|
<pre><code>if self._mInputState == ePureAscii:
|
|
if self._highBitDetector.search(aBuf):
|
|
self._mInputState = eHighbyte
|
|
elif (self._mInputState == ePureAscii) and \
|
|
self._escDetector.search(self._mLastChar + aBuf):
|
|
self._mInputState = eEscAscii
|
|
|
|
<mark>self._mLastChar = aBuf[-1]</mark></code></pre>
|
|
<aside>Each item in a string is a string. Each item in a byte array is an integer.</aside>
|
|
<p>This error doesn’t occur the first time the <code>feed()</code> method gets called; it occurs the <em>second time</em>, after <var>self._mLastChar</var> has been set to the last byte of <var>aBuf</var>. Well, what’s the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:
|
|
<pre class=screen>
|
|
<a><samp class=p>>>> </samp><kbd>aBuf = b'\xEF\xBB\xBF'</kbd> <span>①</span></a>
|
|
<samp class=p>>>> </samp><kbd>len(aBuf)</kbd>
|
|
<samp>3</samp>
|
|
<samp class=p>>>> </samp><kbd>mLastChar = aBuf[-1]</kbd>
|
|
<a><samp class=p>>>> </samp><kbd>mLastChar</kbd> <span>②</span></a>
|
|
<samp>191</samp>
|
|
<a><samp class=p>>>> </samp><kbd>type(mLastChar)</kbd> <span>③</span></a>
|
|
<samp><class 'int'></samp>
|
|
<a><samp class=p>>>> </samp><kbd>mLastChar + aBuf</kbd> <span>④</span></a>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "<stdin>", line 1, in <module>
|
|
TypeError: unsupported operand type(s) for +: 'int' and 'bytes'</samp>
|
|
<a><samp class=p>>>> </samp><kbd>mLastChar = aBuf[-1:]</kbd> <span>⑤</span></a>
|
|
<samp class=p>>>> </samp><kbd>mLastChar</kbd>
|
|
<samp>b'\xbf'</samp>
|
|
<a><samp class=p>>>> </samp><kbd>mLastChar + aBuf</kbd> <span>⑥</span></a>
|
|
<samp>b'\xbf\xef\xbb\xbf'</samp></pre>
|
|
<ol>
|
|
<li>Define a byte array of length 3.
|
|
<li>The last element of the byte array is 191.
|
|
<li>That’s an integer.
|
|
<li>Concatenating an integer with a byte array doesn’t work. You’ve now replicated the error you just found in <code>universaldetector.py</code>.
|
|
<li>Ah, here’s the fix. Instead of taking the last element of the byte array, use <a href=native-datatypes.html#slicinglists>list slicing</a> to create a new byte array containing just the last element. That is, start with the last element and continue the slice until the end of the byte array. Now <var>mLastChar</var> is a byte array of length 1.
|
|
<li>Concatenating a byte array of length 1 with a byte array of length 3 returns a new byte array of length 4.
|
|
</ol>
|
|
<p>So, to ensure that the <code>feed()</code> method in <code>universaldetector.py</code> continues to work no matter how often it’s called, you need to <a href=#cantconvertbytesobject>initialize <var>self._mLastChar</var> as a 0-length byte array</a>, then <em>make sure it stays a byte array</em>.
|
|
<pre><code> self._escDetector.search(self._mLastChar + aBuf):
|
|
self._mInputState = eEscAscii
|
|
|
|
<del>- self._mLastChar = aBuf[-1]</del>
|
|
<ins>+ self._mLastChar = aBuf[-1:]</ins></code></pre>
|
|
<h3 id=ordexpectedstring><code>ord()</code> expected string of length 1, but <code>int</code> found</h3>
|
|
<p>Tired yet? You’re almost there…
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
|
|
tests\Big5\0804.blogspot.com.xml</samp>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 10, in <module>
|
|
u.feed(line)
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 116, in feed
|
|
if prober.feed(aBuf) == constants.eFoundIt:
|
|
File "C:\home\chardet\chardet\charsetgroupprober.py", line 60, in feed
|
|
st = prober.feed(aBuf)
|
|
File "C:\home\chardet\chardet\utf8prober.py", line 53, in feed
|
|
codingState = self._mCodingSM.next_state(c)
|
|
File "C:\home\chardet\chardet\codingstatemachine.py", line 43, in next_state
|
|
byteCls = self._mModel['classTable'][ord(c)]
|
|
TypeError: ord() expected string of length 1, but int found</samp></pre>
|
|
<p>OK, so <var>c</var> is an <code>int</code>, but the <code>ord()</code> function was expecting a 1-character string. Fair enough. Where is <var>c</var> defined?
|
|
<pre><code># codingstatemachine.py
|
|
def next_state(self, c):
|
|
# for each byte we get its class
|
|
# if it is first byte, we also get byte length
|
|
byteCls = self._mModel['classTable'][ord(c)]</code></pre>
|
|
<p>That’s no help; it’s just passed into the function. Let’s pop the stack.
|
|
<pre><code># utf8prober.py
|
|
def feed(self, aBuf):
|
|
for c in aBuf:
|
|
codingState = self._mCodingSM.next_state(c)</code></pre>
|
|
<p>And now we have the answer. Do you see it? In Python 2, <var>aBuf</var> was a string, so <var>c</var> was a 1-character string. (That’s what you get when you iterate over a string — all the characters, one by one.) But now, <var>aBuf</var> is a byte array, so <var>c</var> is an <code>int</code>, not a 1-character string. In other words, there’s no need to call the <code>ord()</code> function because <var>c</var> is already an <code>int</code>!
|
|
<p>Thus:
|
|
<pre><code> def next_state(self, c):
|
|
# for each byte we get its class
|
|
# if it is first byte, we also get byte length
|
|
<del>- byteCls = self._mModel['classTable'][ord(c)]</del>
|
|
<ins>+ byteCls = self._mModel['classTable'][c]</ins></code></pre>
|
|
<p>Searching the entire codebase for instances of “<code>ord(c)</code>” uncovers similar problems in <code>sbcharsetprober.py</code>…
|
|
<pre><code># sbcharsetprober.py
|
|
def feed(self, aBuf):
|
|
if not self._mModel['keepEnglishLetter']:
|
|
aBuf = self.filter_without_english_letters(aBuf)
|
|
aLen = len(aBuf)
|
|
if not aLen:
|
|
return self.get_state()
|
|
for c in aBuf:
|
|
<mark> order = self._mModel['charToOrderMap'][ord(c)]</mark></code></pre>
|
|
<p>…and <code>latin1prober.py</code>…
|
|
<pre><code># latin1prober.py
|
|
def feed(self, aBuf):
|
|
aBuf = self.filter_with_english_letters(aBuf)
|
|
for c in aBuf:
|
|
<mark> charClass = Latin1_CharToClass[ord(c)]</mark></code></pre>
|
|
<p><var>c</var> is iterating over <var>aBuf</var>, which means it is an integer, not a 1-character string. The solution is the same: change <code>ord(c)</code> to just plain <code>c</code>.
|
|
<pre><code> # sbcharsetprober.py
|
|
def feed(self, aBuf):
|
|
if not self._mModel['keepEnglishLetter']:
|
|
aBuf = self.filter_without_english_letters(aBuf)
|
|
aLen = len(aBuf)
|
|
if not aLen:
|
|
return self.get_state()
|
|
for c in aBuf:
|
|
<del>- order = self._mModel['charToOrderMap'][ord(c)]</del>
|
|
<ins>+ order = self._mModel['charToOrderMap'][c]</ins>
|
|
|
|
# latin1prober.py
|
|
def feed(self, aBuf):
|
|
aBuf = self.filter_with_english_letters(aBuf)
|
|
for c in aBuf:
|
|
<del>- charClass = Latin1_CharToClass[ord(c)]</del>
|
|
<ins>+ charClass = Latin1_CharToClass[c]</ins>
|
|
</code></pre>
|
|
<h3 id=unorderabletypes>Unorderable types: <code>int()</code> >= <code>str()</code></h3>
|
|
<p>Let’s go again.
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
|
|
tests\Big5\0804.blogspot.com.xml</samp>
|
|
<samp>Traceback (most recent call last):
|
|
File "test.py", line 10, in <module>
|
|
u.feed(line)
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 116, in feed
|
|
if prober.feed(aBuf) == constants.eFoundIt:
|
|
File "C:\home\chardet\chardet\charsetgroupprober.py", line 60, in feed
|
|
st = prober.feed(aBuf)
|
|
File "C:\home\chardet\chardet\sjisprober.py", line 68, in feed
|
|
self._mContextAnalyzer.feed(self._mLastChar[2 - charLen :], charLen)
|
|
File "C:\home\chardet\chardet\jpcntx.py", line 145, in feed
|
|
order, charLen = self.get_order(aBuf[i:i+2])
|
|
File "C:\home\chardet\chardet\jpcntx.py", line 176, in get_order
|
|
if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \
|
|
TypeError: unorderable types: int() >= str()</samp></pre>
|
|
<p>Did you notice? This time around, the code passed the first test case (<code>tests\ascii\howto.diveintomark.org.xml</code>). You’re making real progress here.
|
|
<p>So what’s this all about? “Unorderable types”? Once again, the difference between byte arrays and strings is rearing its ugly head. Take a look at the code:
|
|
<pre><code>class SJISContextAnalysis(JapaneseContextAnalysis):
|
|
def get_order(self, aStr):
|
|
if not aStr: return -1, 1
|
|
# find out current char's byte length
|
|
<mark> if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \</mark>
|
|
((aStr[0] >= '\xE0') and (aStr[0] <= '\xFC')):
|
|
charLen = 2
|
|
else:
|
|
charLen = 1</code></pre>
|
|
<p>And where does <var>aStr</var> come from? Let’s pop the stack:
|
|
<pre><code>def feed(self, aBuf, aLen):
|
|
.
|
|
.
|
|
.
|
|
i = self._mNeedToSkipCharNum
|
|
while i < aLen:
|
|
<mark> order, charLen = self.get_order(aBuf[i:i+2])</mark></code></pre>
|
|
<p>Oh look, it’s our old friend, <var>aBuf</var>. As you might have guessed from every other issue we’ve encountered in this chapter, <var>aBuf</var> is a byte array. Here, the <code>feed()</code> method isn’t just passing it on wholesale; it’s slicing it. But as you saw <a href=#unsupportedoperandtypeforplus>earlier in this chapter</a>, slicing a byte array returns a byte array, so the <var>aStr</var> parameter that gets passed to the <code>get_order()</code> method is still a byte array.
|
|
<p>And what is this code trying to do with <var>aStr</var>? It’s taking the first element of the byte array and comparing it to a string of length 1. In Python 2, that worked, because <var>aStr</var> and <var>aBuf</var> were strings, and <var>aStr[0]</var> would be a string, and you can compare strings for inequality. But in Python 3, <var>aStr</var> and <var>aBuf</var> are byte arrays, <var>aStr[0]</var> is an integer, and you can’t compare integers and strings for inequality without explicitly coercing one of them.
|
|
<p>In this case, there’s no need to make the code more complicated by adding an explicit coercion. <var>aStr[0]</var> yields an integer; the things you’re comparing to are all constants. Let’s change them from 1-character strings to integers.
|
|
<pre><code> class SJISContextAnalysis(JapaneseContextAnalysis):
|
|
def get_order(self, aStr):
|
|
if not aStr: return -1, 1
|
|
# find out current char's byte length
|
|
<del>- if ((aStr[0] >= '\x81') and (aStr[0] <= '\x9F')) or \</del>
|
|
<del>- ((aStr[0] >= '\xE0') and (aStr[0] <= '\xFC')):</del>
|
|
<ins>+ if ((aStr[0] >= 0x81) and (aStr[0] <= 0x9F)) or \</ins>
|
|
<ins>+ ((aStr[0] >= 0xE0) and (aStr[0] <= 0xFC)):</ins>
|
|
charLen = 2
|
|
else:
|
|
charLen = 1
|
|
|
|
# return its order if it is hiragana
|
|
if len(aStr) > 1:
|
|
<del>- if (aStr[0] == '\202') and \</del>
|
|
<del>- (aStr[1] >= '\x9F') and \</del>
|
|
<del>- (aStr[1] <= '\xF1'):</del>
|
|
<del>- return ord(aStr[1]) - 0x9F, charLen</del>
|
|
<ins>+ if (aStr[0] == 0x202) and \</ins>
|
|
<ins>+ (aStr[1] >= 0x9F) and \</ins>
|
|
<ins>+ (aStr[1] <= 0xF1):</ins>
|
|
<ins>+ return aStr[1] - 0x9F, charLen</ins>
|
|
|
|
return -1, charLen
|
|
|
|
class EUCJPContextAnalysis(JapaneseContextAnalysis):
|
|
def get_order(self, aStr):
|
|
if not aStr: return -1, 1
|
|
# find out current char's byte length
|
|
<del>- if (aStr[0] == '\x8E') or \</del>
|
|
<del>- ((aStr[0] >= '\xA1') and (aStr[0] <= '\xFE')):</del>
|
|
<ins>+ if (aStr[0] == 0x8E) or \</ins>
|
|
<ins>+ ((aStr[0] >= 0xA1) and (aStr[0] <= 0xFE)):</ins>
|
|
charLen = 2
|
|
<del>- elif aStr[0] == '\x8F':</del>
|
|
<ins>+ elif aStr[0] == 0x8F:</ins>
|
|
charLen = 3
|
|
else:
|
|
charLen = 1
|
|
|
|
# return its order if it is hiragana
|
|
if len(aStr) > 1:
|
|
<del>- if (aStr[0] == '\xA4') and \</del>
|
|
<del>- (aStr[1] >= '\xA1') and \</del>
|
|
<del>- (aStr[1] <= '\xF3'):</del>
|
|
<del>- return ord(aStr[1]) - 0xA1, charLen</del>
|
|
<ins>+ if (aStr[0] == 0xA4) and \</ins>
|
|
<ins>+ (aStr[1] >= 0xA1) and \</ins>
|
|
<ins>+ (aStr[1] <= 0xF3):</ins>
|
|
<ins>+ return aStr[1] - 0xA1, charLen</ins>
|
|
|
|
return -1, charLen</code></pre>
|
|
<p>Searching the entire codebase for occurrences of the <code>ord()</code> function uncovers the same problem in <code>chardistribution.py</code>:
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
|
|
tests\Big5\0804.blogspot.com.xml</samp>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 10, in <module>
|
|
u.feed(line)
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 117, in feed
|
|
if prober.feed(aBuf) == constants.eFoundIt:
|
|
File "C:\home\chardet\chardet\charsetgroupprober.py", line 60, in feed
|
|
st = prober.feed(aBuf)
|
|
File "C:\home\chardet\chardet\sjisprober.py", line 72, in feed
|
|
self._mDistributionAnalyzer.feed(aBuf[i - 1 : i + 1], charLen)
|
|
File "C:\home\chardet\chardet\chardistribution.py", line 56, in feed
|
|
order = self.get_order(aStr)
|
|
File "C:\home\chardet\chardet\chardistribution.py", line 174, in get_order
|
|
if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):
|
|
TypeError: unorderable types: int() >= str()</samp></pre>
|
|
<p>The fix is the same:
|
|
<pre><code> class EUCTWDistributionAnalysis(CharDistributionAnalysis):
|
|
def __init__(self):
|
|
CharDistributionAnalysis.__init__(self)
|
|
self._mCharToFreqOrder = EUCTWCharToFreqOrder
|
|
self._mTableSize = EUCTW_TABLE_SIZE
|
|
self._mTypicalDistributionRatio = EUCTW_TYPICAL_DISTRIBUTION_RATIO
|
|
|
|
def get_order(self, aStr):
|
|
<del>- if aStr[0] >= '\xC4':</del>
|
|
<del>- return 94 * (ord(aStr[0]) - 0xC4) + ord(aStr[1]) - 0xA1</del>
|
|
<ins>+ if aStr[0] >= 0xC4:</ins>
|
|
<ins>+ return 94 * (aStr[0] - 0xC4) + aStr[1] - 0xA1</ins>
|
|
else:
|
|
return -1
|
|
|
|
class EUCKRDistributionAnalysis(CharDistributionAnalysis):
|
|
def __init__(self):
|
|
CharDistributionAnalysis.__init__(self)
|
|
self._mCharToFreqOrder = EUCKRCharToFreqOrder
|
|
self._mTableSize = EUCKR_TABLE_SIZE
|
|
self._mTypicalDistributionRatio = EUCKR_TYPICAL_DISTRIBUTION_RATIO
|
|
|
|
def get_order(self, aStr):
|
|
<del>- if aStr[0] >= '\xB0':</del>
|
|
<del>- return 94 * (ord(aStr[0]) - 0xB0) + ord(aStr[1]) - 0xA1</del>
|
|
<ins>+ if aStr[0] >= '\xB0':</ins>
|
|
<ins>+ return 94 * (aStr[0] - 0xB0) + aStr[1] - 0xA1</ins>
|
|
else:
|
|
return -1;
|
|
|
|
class GB2312DistributionAnalysis(CharDistributionAnalysis):
|
|
def __init__(self):
|
|
CharDistributionAnalysis.__init__(self)
|
|
self._mCharToFreqOrder = GB2312CharToFreqOrder
|
|
self._mTableSize = GB2312_TABLE_SIZE
|
|
self._mTypicalDistributionRatio = GB2312_TYPICAL_DISTRIBUTION_RATIO
|
|
|
|
def get_order(self, aStr):
|
|
<del>- if (aStr[0] >= '\xB0') and (aStr[1] >= '\xA1'):</del>
|
|
<del>- return 94 * (ord(aStr[0]) - 0xB0) + ord(aStr[1]) - 0xA1</del>
|
|
<ins>+ if (aStr[0] >= 0xB0) and (aStr[1] >= 0xA1):</ins>
|
|
<ins>+ return 94 * (aStr[0] - 0xB0) + aStr[1] - 0xA1</ins>
|
|
else:
|
|
return -1;
|
|
|
|
class Big5DistributionAnalysis(CharDistributionAnalysis):
|
|
def __init__(self):
|
|
CharDistributionAnalysis.__init__(self)
|
|
self._mCharToFreqOrder = Big5CharToFreqOrder
|
|
self._mTableSize = BIG5_TABLE_SIZE
|
|
self._mTypicalDistributionRatio = BIG5_TYPICAL_DISTRIBUTION_RATIO
|
|
|
|
def get_order(self, aStr):
|
|
<del>- if aStr[0] >= '\xA4':</del>
|
|
<del>- if aStr[1] >= '\xA1':</del>
|
|
<del>- return 157 * (ord(aStr[0]) - 0xA4) + ord(aStr[1]) - 0xA1 + 63</del>
|
|
<ins>+ if aStr[0] >= 0xA4:</ins>
|
|
<ins>+ if aStr[1] >= 0xA1:</ins>
|
|
<ins>+ return 157 * (aStr[0] - 0xA4) + aStr[1] - 0xA1 + 63</ins>
|
|
else:
|
|
<del>- return 157 * (ord(aStr[0]) - 0xA4) + ord(aStr[1]) - 0x40</del>
|
|
<ins>+ return 157 * (aStr[0] - 0xA4) + aStr[1] - 0x40</ins>
|
|
else:
|
|
return -1
|
|
|
|
class SJISDistributionAnalysis(CharDistributionAnalysis):
|
|
def __init__(self):
|
|
CharDistributionAnalysis.__init__(self)
|
|
self._mCharToFreqOrder = JISCharToFreqOrder
|
|
self._mTableSize = JIS_TABLE_SIZE
|
|
self._mTypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO
|
|
|
|
def get_order(self, aStr):
|
|
<del>- if (aStr[0] >= '\x81') and (aStr[0] <= '\x9F'):</del>
|
|
<del>- order = 188 * (ord(aStr[0]) - 0x81)</del>
|
|
<del>- elif (aStr[0] >= '\xE0') and (aStr[0] <= '\xEF'):</del>
|
|
<del>- order = 188 * (ord(aStr[0]) - 0xE0 + 31)</del>
|
|
<ins>+ if (aStr[0] >= 0x81) and (aStr[0] <= 0x9F):</ins>
|
|
<ins>+ order = 188 * (aStr[0] - 0x81)</ins>
|
|
<ins>+ elif (aStr[0] >= 0xE0) and (aStr[0] <= 0xEF):</ins>
|
|
<ins>+ order = 188 * (aStr[0] - 0xE0 + 31)</ins>
|
|
else:
|
|
return -1;
|
|
<del>- order = order + ord(aStr[1]) - 0x40</del>
|
|
<del>- if aStr[1] > '\x7F':</del>
|
|
<ins>+ order = order + aStr[1] - 0x40</ins>
|
|
<ins>+ if aStr[1] > 0x7F:</ins>
|
|
order =- 1
|
|
return order
|
|
|
|
class EUCJPDistributionAnalysis(CharDistributionAnalysis):
|
|
def __init__(self):
|
|
CharDistributionAnalysis.__init__(self)
|
|
self._mCharToFreqOrder = JISCharToFreqOrder
|
|
self._mTableSize = JIS_TABLE_SIZE
|
|
self._mTypicalDistributionRatio = JIS_TYPICAL_DISTRIBUTION_RATIO
|
|
|
|
def get_order(self, aStr):
|
|
<del>- if aStr[0] >= '\xA0':</del>
|
|
<del>- return 94 * (ord(aStr[0]) - 0xA1) + ord(aStr[1]) - 0xA1</del>
|
|
<ins>+ if aStr[0] >= 0xA0:</ins>
|
|
<ins>+ return 94 * (aStr[0] - 0xA1) + aStr[1] - 0xA1</ins>
|
|
else:
|
|
return -1</code></pre>
|
|
<h3 id=reduceisnotdefined>Global name <code>'reduce'</code> is not defined</h3>
|
|
<p>Once more into the breach…
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
|
|
tests\Big5\0804.blogspot.com.xml</samp>
|
|
<samp class=traceback>Traceback (most recent call last):
|
|
File "test.py", line 12, in <module>
|
|
u.close()
|
|
File "C:\home\chardet\chardet\universaldetector.py", line 141, in close
|
|
proberConfidence = prober.get_confidence()
|
|
File "C:\home\chardet\chardet\latin1prober.py", line 126, in get_confidence
|
|
total = reduce(operator.add, self._mFreqCounter)
|
|
NameError: global name 'reduce' is not defined</samp></pre>
|
|
<p>According to the official <a href=http://docs.python.org/3.0/whatsnew/3.0.html#builtins>What’s New In Python 3.0</a> guide, the <code>reduce()</code> function has been moved out of the global namespace and into the <code>functools</code> module. Quoting the guide: “Use <code>functools.reduce()</code> if you really need it; however, 99 percent of the time an explicit <code>for</code> loop is more readable.” You can read more about the decision from Guido van Rossum’s weblog: <a href="http://www.artima.com/weblogs/viewpost.jsp?thread=98196">The fate of reduce() in Python 3000</a>.
|
|
<pre><code>def get_confidence(self):
|
|
if self.get_state() == constants.eNotMe:
|
|
return 0.01
|
|
|
|
<mark> total = reduce(operator.add, self._mFreqCounter)</mark></code></pre>
|
|
<p>The <code>reduce()</code> function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result.
|
|
<p>This monstrosity was so common that Python added a global <code>sum()</code> function.
|
|
<pre><code> def get_confidence(self):
|
|
if self.get_state() == constants.eNotMe:
|
|
return 0.01
|
|
|
|
<del>- total = reduce(operator.add, self._mFreqCounter)</del>
|
|
<ins>+ total = sum(self._mFreqCounter)</ins></code></pre>
|
|
<p>Since you’re no longer using the <code>operator</code> module, you can remove that <code>import</code> from the top of the file as well.
|
|
<pre><code> from .charsetprober import CharSetProber
|
|
from . import constants
|
|
<del>- import operator</del></code></pre>
|
|
<p>I CAN HAZ TESTZ?
|
|
<pre class=screen><samp class=p>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
|
|
<samp>tests\ascii\howto.diveintomark.org.xml ascii with confidence 1.0
|
|
tests\Big5\0804.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\blog.worren.net.xml Big5 with confidence 0.99
|
|
tests\Big5\carbonxiv.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\catshadow.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\coolloud.org.tw.xml Big5 with confidence 0.99
|
|
tests\Big5\digitalwall.com.xml Big5 with confidence 0.99
|
|
tests\Big5\ebao.us.xml Big5 with confidence 0.99
|
|
tests\Big5\fudesign.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\kafkatseng.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\ke207.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\leavesth.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\letterlego.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\linyijen.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\marilynwu.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\myblog.pchome.com.tw.xml Big5 with confidence 0.99
|
|
tests\Big5\oui-design.com.xml Big5 with confidence 0.99
|
|
tests\Big5\sanwenji.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\sinica.edu.tw.xml Big5 with confidence 0.99
|
|
tests\Big5\sylvia1976.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\tlkkuo.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\tw.blog.xubg.com.xml Big5 with confidence 0.99
|
|
tests\Big5\unoriginalblog.com.xml Big5 with confidence 0.99
|
|
tests\Big5\upsaid.com.xml Big5 with confidence 0.99
|
|
tests\Big5\willythecop.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\Big5\ytc.blogspot.com.xml Big5 with confidence 0.99
|
|
tests\EUC-JP\aivy.co.jp.xml EUC-JP with confidence 0.99
|
|
tests\EUC-JP\akaname.main.jp.xml EUC-JP with confidence 0.99
|
|
tests\EUC-JP\arclamp.jp.xml EUC-JP with confidence 0.99
|
|
.
|
|
.
|
|
.
|
|
316 tests</samp></pre>
|
|
<p>Holy crap, it actually works! <em><a href=http://www.hampsterdance.com/>/me does a little dance</a></em>
|
|
<h2 id=summary>Summary</h2>
|
|
<p>What have we learned?
|
|
<ol>
|
|
<li>Porting any non-trivial amount of code from Python 2 to Python 3 is going to be a pain. There’s no way around it. It’s hard.
|
|
<li>The <a href=porting-code-to-python-3-with-2to3.html>automated <code>2to3</code> tool</a> is helpful as far as it goes, but it will only do the easy parts — function renames, module renames, syntax changes. It’s an impressive piece of engineering, but in the end it’s just an intelligent search-and-replace bot.
|
|
<li>The #1 porting problem in this library was the difference between strings and bytes. In this case that seems obvious, since the whole point of the <code>chardet</code> library is to convert a stream of bytes into a string. But “a stream of bytes” comes up more often than you might think. Reading a file in “binary” mode? You’ll get a stream of bytes. Fetching a web page? Calling a web <abbr>API</abbr>? They return a stream of bytes, too.
|
|
<li><em>You</em> need to understand your program. Thoroughly. Preferably because you wrote it, but at the very least, you need to be comfortable with all its quirks and musty corners. The bugs are everywhere.
|
|
<li>Test cases are essential. Don’t port anything without them. Don’t even try. The <em>only</em> reason I have any confidence at all that <code>chardet</code> works in Python 3 is because I had a test suite that exercised every line of code in the entire library. I <em>never</em> would have found half of these problems with manual spot-checking.
|
|
</ol>
|
|
|
|
<p class=nav><a rel=prev class=todo><span>☜</span></a> <a rel=next href=porting-code-to-python-3-with-2to3.html title="onward to “Porting Code to Python 3 with 2to3”"><span>☞</span></a>
|
|
<p class=c>© 2001–9 <a href=about.html>Mark Pilgrim</a>
|
|
<script src=jquery.js></script>
|
|
<script src=dip3.js></script>
|