Files
dive-into-python3/case-study-porting-chardet-to-python-3.html
T
2009-01-26 17:10:42 -05:00

729 lines
32 KiB
HTML

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Case study: porting chardet to Python 3 - Dive into Python 3</title>
<link rel="stylesheet" type="text/css" href="dip3.css">
</head>
<body>
<h1>Case study: porting chardet to Python 3</h1>
<ol class="toc">
<li><a href="#divingin">Diving in</a></li>
<li><a href="#running2to3">Running <code class="filename">2to3</code></a></li>
<li><a href="#falseisinvalidsyntax"><code>False</code> is invalid syntax</a></li>
<li><a href="#nomodulenamedconstants">No module named <code class="filename">constants</code></a></li>
<li><a href="#namefileisnotdefined">Name '<var>file</var>' is not defined</a></li>
<li><a href="#cantuseastringpattern">Can't use a string pattern on a bytes-like object</a></li>
<li><a href="#cantconvertbytesobject">Can't convert '<code>bytes</code>' object to <code>str</code> implicitly</a></li>
</ol>
<section id="divingin">
<h2>Diving in</h2>
<p class="fancy">FIXME intro</p>
<p>...</p>
</section>
<section id="running2to3">
<h2>Running <code class="filename">2to3</code></h2>
<p>We're going to migrate the <code class="filename">chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script to help with this, called <code class="filename">2to3</code>. <code class="filename">2to3</code> takes your actual Python 2 source code as input, and auto-converts as much as it can to Python 3. [FIXME reference 2to3 chapter once it's done]</p>
<p>The <code class="filename">chardet</code> library is split across several different files, all in the same directory. The <code class="filename">2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code class="filename">2to3</code> will convert each of the files in turn.</p>
<p><a href="#skip2to3output" class="skip">skip over this</a></p>
<pre class="screen"><samp class="prompt">C:\home\chardet></samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w chardet\</kbd>
<samp>RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
RefactoringTool: Skipping implicit fixer: ws_comma
--- chardet\__init__.py (original)
+++ chardet\__init__.py (refactored)
@@ -18,7 +18,7 @@
__version__ = "1.0.1"
def detect(aBuf):
- import universaldetector
+ from . import universaldetector
u = universaldetector.UniversalDetector()
u.reset()
u.feed(aBuf)
--- chardet\big5prober.py (original)
+++ chardet\big5prober.py (refactored)
@@ -25,10 +25,10 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from mbcharsetprober import MultiByteCharSetProber
-from codingstatemachine import CodingStateMachine
-from chardistribution import Big5DistributionAnalysis
-from mbcssm import Big5SMModel
+from .mbcharsetprober import MultiByteCharSetProber
+from .codingstatemachine import CodingStateMachine
+from .chardistribution import Big5DistributionAnalysis
+from .mbcssm import Big5SMModel
class Big5Prober(MultiByteCharSetProber):
def __init__(self):
--- chardet\chardistribution.py (original)
+++ chardet\chardistribution.py (refactored)
@@ -25,12 +25,12 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
-from euctwfreq import EUCTWCharToFreqOrder, EUCTW_TABLE_SIZE, EUCTW_TYPICAL_DISTRIBUTION_RATIO
-from euckrfreq import EUCKRCharToFreqOrder, EUCKR_TABLE_SIZE, EUCKR_TYPICAL_DISTRIBUTION_RATIO
-from gb2312freq import GB2312CharToFreqOrder, GB2312_TABLE_SIZE, GB2312_TYPICAL_DISTRIBUTION_RATIO
-from big5freq import Big5CharToFreqOrder, BIG5_TABLE_SIZE, BIG5_TYPICAL_DISTRIBUTION_RATIO
-from jisfreq import JISCharToFreqOrder, JIS_TABLE_SIZE, JIS_TYPICAL_DISTRIBUTION_RATIO
+from . import constants
+from .euctwfreq import EUCTWCharToFreqOrder, EUCTW_TABLE_SIZE, EUCTW_TYPICAL_DISTRIBUTION_RATIO
+from .euckrfreq import EUCKRCharToFreqOrder, EUCKR_TABLE_SIZE, EUCKR_TYPICAL_DISTRIBUTION_RATIO
+from .gb2312freq import GB2312CharToFreqOrder, GB2312_TABLE_SIZE, GB2312_TYPICAL_DISTRIBUTION_RATIO
+from .big5freq import Big5CharToFreqOrder, BIG5_TABLE_SIZE, BIG5_TYPICAL_DISTRIBUTION_RATIO
+from .jisfreq import JISCharToFreqOrder, JIS_TABLE_SIZE, JIS_TYPICAL_DISTRIBUTION_RATIO
ENOUGH_DATA_THRESHOLD = 1024
SURE_YES = 0.99
--- chardet\charsetgroupprober.py (original)
+++ chardet\charsetgroupprober.py (refactored)
@@ -26,7 +26,7 @@
######################### END LICENSE BLOCK #########################
import constants, sys
-from charsetprober import CharSetProber
+from .charsetprober import CharSetProber
class CharSetGroupProber(CharSetProber):
def __init__(self):
--- chardet\codingstatemachine.py (original)
+++ chardet\codingstatemachine.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from constants import eStart, eError, eItsMe
+from .constants import eStart, eError, eItsMe
class CodingStateMachine:
def __init__(self, sm):
--- chardet\constants.py (original)
+++ chardet\constants.py (refactored)
@@ -38,10 +38,10 @@
SHORTCUT_THRESHOLD = 0.95
-import __builtin__
+import builtins
if not hasattr(__builtin__, 'False'):
False = 0
True = 1
else:
- False = __builtin__.False
- True = __builtin__.True
+ False = builtins.False
+ True = builtins.True
--- chardet\escprober.py (original)
+++ chardet\escprober.py (refactored)
@@ -26,9 +26,9 @@
######################### END LICENSE BLOCK #########################
import constants, sys
-from escsm import HZSMModel, ISO2022CNSMModel, ISO2022JPSMModel, ISO2022KRSMModel
-from charsetprober import CharSetProber
-from codingstatemachine import CodingStateMachine
+from .escsm import HZSMModel, ISO2022CNSMModel, ISO2022JPSMModel, ISO2022KRSMModel
+from .charsetprober import CharSetProber
+from .codingstatemachine import CodingStateMachine
class EscCharSetProber(CharSetProber):
def __init__(self):
--- chardet\escsm.py (original)
+++ chardet\escsm.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from constants import eStart, eError, eItsMe
+from .constants import eStart, eError, eItsMe
HZ_cls = ( \
1,0,0,0,0,0,0,0, # 00 - 07
--- chardet\eucjpprober.py (original)
+++ chardet\eucjpprober.py (refactored)
@@ -26,12 +26,12 @@
######################### END LICENSE BLOCK #########################
import constants, sys
-from constants import eStart, eError, eItsMe
-from mbcharsetprober import MultiByteCharSetProber
-from codingstatemachine import CodingStateMachine
-from chardistribution import EUCJPDistributionAnalysis
-from jpcntx import EUCJPContextAnalysis
-from mbcssm import EUCJPSMModel
+from .constants import eStart, eError, eItsMe
+from .mbcharsetprober import MultiByteCharSetProber
+from .codingstatemachine import CodingStateMachine
+from .chardistribution import EUCJPDistributionAnalysis
+from .jpcntx import EUCJPContextAnalysis
+from .mbcssm import EUCJPSMModel
class EUCJPProber(MultiByteCharSetProber):
def __init__(self):
--- chardet\euckrprober.py (original)
+++ chardet\euckrprober.py (refactored)
@@ -25,10 +25,10 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from mbcharsetprober import MultiByteCharSetProber
-from codingstatemachine import CodingStateMachine
-from chardistribution import EUCKRDistributionAnalysis
-from mbcssm import EUCKRSMModel
+from .mbcharsetprober import MultiByteCharSetProber
+from .codingstatemachine import CodingStateMachine
+from .chardistribution import EUCKRDistributionAnalysis
+from .mbcssm import EUCKRSMModel
class EUCKRProber(MultiByteCharSetProber):
def __init__(self):
--- chardet\euctwprober.py (original)
+++ chardet\euctwprober.py (refactored)
@@ -25,10 +25,10 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from mbcharsetprober import MultiByteCharSetProber
-from codingstatemachine import CodingStateMachine
-from chardistribution import EUCTWDistributionAnalysis
-from mbcssm import EUCTWSMModel
+from .mbcharsetprober import MultiByteCharSetProber
+from .codingstatemachine import CodingStateMachine
+from .chardistribution import EUCTWDistributionAnalysis
+from .mbcssm import EUCTWSMModel
class EUCTWProber(MultiByteCharSetProber):
def __init__(self):
--- chardet\gb2312prober.py (original)
+++ chardet\gb2312prober.py (refactored)
@@ -25,10 +25,10 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from mbcharsetprober import MultiByteCharSetProber
-from codingstatemachine import CodingStateMachine
-from chardistribution import GB2312DistributionAnalysis
-from mbcssm import GB2312SMModel
+from .mbcharsetprober import MultiByteCharSetProber
+from .codingstatemachine import CodingStateMachine
+from .chardistribution import GB2312DistributionAnalysis
+from .mbcssm import GB2312SMModel
class GB2312Prober(MultiByteCharSetProber):
def __init__(self):
--- chardet\hebrewprober.py (original)
+++ chardet\hebrewprober.py (refactored)
@@ -25,8 +25,8 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from charsetprober import CharSetProber
-import constants
+from .charsetprober import CharSetProber
+from . import constants
# This prober doesn't actually recognize a language or a charset.
# It is a helper prober for the use of the Hebrew model probers
--- chardet\jpcntx.py (original)
+++ chardet\jpcntx.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
+from . import constants
NUM_OF_CATEGORY = 6
DONT_KNOW = -1
--- chardet\langbulgarianmodel.py (original)
+++ chardet\langbulgarianmodel.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
+from . import constants
# 255: Control characters that usually does not exist in any text
# 254: Carriage/Return
--- chardet\langcyrillicmodel.py (original)
+++ chardet\langcyrillicmodel.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
+from . import constants
# KOI8-R language model
# Character Mapping Table:
--- chardet\langgreekmodel.py (original)
+++ chardet\langgreekmodel.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
+from . import constants
# 255: Control characters that usually does not exist in any text
# 254: Carriage/Return
--- chardet\langhebrewmodel.py (original)
+++ chardet\langhebrewmodel.py (refactored)
@@ -27,7 +27,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
+from . import constants
# 255: Control characters that usually does not exist in any text
# 254: Carriage/Return
--- chardet\langhungarianmodel.py (original)
+++ chardet\langhungarianmodel.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
+from . import constants
# 255: Control characters that usually does not exist in any text
# 254: Carriage/Return
--- chardet\langthaimodel.py (original)
+++ chardet\langthaimodel.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-import constants
+from . import constants
# 255: Control characters that usually does not exist in any text
# 254: Carriage/Return
--- chardet\latin1prober.py (original)
+++ chardet\latin1prober.py (refactored)
@@ -26,8 +26,8 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from charsetprober import CharSetProber
-import constants
+from .charsetprober import CharSetProber
+from . import constants
import operator
FREQ_CAT_NUM = 4
--- chardet\mbcharsetprober.py (original)
+++ chardet\mbcharsetprober.py (refactored)
@@ -28,8 +28,8 @@
######################### END LICENSE BLOCK #########################
import constants, sys
-from constants import eStart, eError, eItsMe
-from charsetprober import CharSetProber
+from .constants import eStart, eError, eItsMe
+from .charsetprober import CharSetProber
class MultiByteCharSetProber(CharSetProber):
def __init__(self):
--- chardet\mbcsgroupprober.py (original)
+++ chardet\mbcsgroupprober.py (refactored)
@@ -27,14 +27,14 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from charsetgroupprober import CharSetGroupProber
-from utf8prober import UTF8Prober
-from sjisprober import SJISProber
-from eucjpprober import EUCJPProber
-from gb2312prober import GB2312Prober
-from euckrprober import EUCKRProber
-from big5prober import Big5Prober
-from euctwprober import EUCTWProber
+from .charsetgroupprober import CharSetGroupProber
+from .utf8prober import UTF8Prober
+from .sjisprober import SJISProber
+from .eucjpprober import EUCJPProber
+from .gb2312prober import GB2312Prober
+from .euckrprober import EUCKRProber
+from .big5prober import Big5Prober
+from .euctwprober import EUCTWProber
class MBCSGroupProber(CharSetGroupProber):
def __init__(self):
--- chardet\mbcssm.py (original)
+++ chardet\mbcssm.py (refactored)
@@ -25,7 +25,7 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from constants import eStart, eError, eItsMe
+from .constants import eStart, eError, eItsMe
# BIG5
--- chardet\sbcharsetprober.py (original)
+++ chardet\sbcharsetprober.py (refactored)
@@ -27,7 +27,7 @@
######################### END LICENSE BLOCK #########################
import constants, sys
-from charsetprober import CharSetProber
+from .charsetprober import CharSetProber
SAMPLE_SIZE = 64
SB_ENOUGH_REL_THRESHOLD = 1024
--- chardet\sbcsgroupprober.py (original)
+++ chardet\sbcsgroupprober.py (refactored)
@@ -27,15 +27,15 @@
######################### END LICENSE BLOCK #########################
import constants, sys
-from charsetgroupprober import CharSetGroupProber
-from sbcharsetprober import SingleByteCharSetProber
-from langcyrillicmodel import Win1251CyrillicModel, Koi8rModel, Latin5CyrillicModel, MacCyrillicModel, Ibm866Model, Ibm855Model
-from langgreekmodel import Latin7GreekModel, Win1253GreekModel
-from langbulgarianmodel import Latin5BulgarianModel, Win1251BulgarianModel
-from langhungarianmodel import Latin2HungarianModel, Win1250HungarianModel
-from langthaimodel import TIS620ThaiModel
-from langhebrewmodel import Win1255HebrewModel
-from hebrewprober import HebrewProber
+from .charsetgroupprober import CharSetGroupProber
+from .sbcharsetprober import SingleByteCharSetProber
+from .langcyrillicmodel import Win1251CyrillicModel, Koi8rModel, Latin5CyrillicModel, MacCyrillicModel, Ibm866Model, Ibm855Model
+from .langgreekmodel import Latin7GreekModel, Win1253GreekModel
+from .langbulgarianmodel import Latin5BulgarianModel, Win1251BulgarianModel
+from .langhungarianmodel import Latin2HungarianModel, Win1250HungarianModel
+from .langthaimodel import TIS620ThaiModel
+from .langhebrewmodel import Win1255HebrewModel
+from .hebrewprober import HebrewProber
class SBCSGroupProber(CharSetGroupProber):
def __init__(self):
--- chardet\sjisprober.py (original)
+++ chardet\sjisprober.py (refactored)
@@ -25,13 +25,13 @@
# 02110-1301 USA
######################### END LICENSE BLOCK #########################
-from mbcharsetprober import MultiByteCharSetProber
-from codingstatemachine import CodingStateMachine
-from chardistribution import SJISDistributionAnalysis
-from jpcntx import SJISContextAnalysis
-from mbcssm import SJISSMModel
+from .mbcharsetprober import MultiByteCharSetProber
+from .codingstatemachine import CodingStateMachine
+from .chardistribution import SJISDistributionAnalysis
+from .jpcntx import SJISContextAnalysis
+from .mbcssm import SJISSMModel
import constants, sys
-from constants import eStart, eError, eItsMe
+from .constants import eStart, eError, eItsMe
class SJISProber(MultiByteCharSetProber):
def __init__(self):
--- chardet\universaldetector.py (original)
+++ chardet\universaldetector.py (refactored)
@@ -27,10 +27,10 @@
######################### END LICENSE BLOCK #########################
import constants, sys
-from latin1prober import Latin1Prober # windows-1252
-from mbcsgroupprober import MBCSGroupProber # multi-byte character sets
-from sbcsgroupprober import SBCSGroupProber # single-byte character sets
-from escprober import EscCharSetProber # ISO-2122, etc.
+from .latin1prober import Latin1Prober # windows-1252
+from .mbcsgroupprober import MBCSGroupProber # multi-byte character sets
+from .sbcsgroupprober import SBCSGroupProber # single-byte character sets
+from .escprober import EscCharSetProber # ISO-2122, etc.
import re
MINIMUM_THRESHOLD = 0.20
--- chardet\utf8prober.py (original)
+++ chardet\utf8prober.py (refactored)
@@ -26,10 +26,10 @@
######################### END LICENSE BLOCK #########################
import constants, sys
-from constants import eStart, eError, eItsMe
-from charsetprober import CharSetProber
-from codingstatemachine import CodingStateMachine
-from mbcssm import UTF8SMModel
+from .constants import eStart, eError, eItsMe
+from .charsetprober import CharSetProber
+from .codingstatemachine import CodingStateMachine
+from .mbcssm import UTF8SMModel
ONE_CHAR_PROB = 0.5
RefactoringTool: Files that were modified:
RefactoringTool: chardet\__init__.py
RefactoringTool: chardet\big5prober.py
RefactoringTool: chardet\chardistribution.py
RefactoringTool: chardet\charsetgroupprober.py
RefactoringTool: chardet\codingstatemachine.py
RefactoringTool: chardet\constants.py
RefactoringTool: chardet\escprober.py
RefactoringTool: chardet\escsm.py
RefactoringTool: chardet\eucjpprober.py
RefactoringTool: chardet\euckrprober.py
RefactoringTool: chardet\euctwprober.py
RefactoringTool: chardet\gb2312prober.py
RefactoringTool: chardet\hebrewprober.py
RefactoringTool: chardet\jpcntx.py
RefactoringTool: chardet\langbulgarianmodel.py
RefactoringTool: chardet\langcyrillicmodel.py
RefactoringTool: chardet\langgreekmodel.py
RefactoringTool: chardet\langhebrewmodel.py
RefactoringTool: chardet\langhungarianmodel.py
RefactoringTool: chardet\langthaimodel.py
RefactoringTool: chardet\latin1prober.py
RefactoringTool: chardet\mbcharsetprober.py
RefactoringTool: chardet\mbcsgroupprober.py
RefactoringTool: chardet\mbcssm.py
RefactoringTool: chardet\sbcharsetprober.py
RefactoringTool: chardet\sbcsgroupprober.py
RefactoringTool: chardet\sjisprober.py
RefactoringTool: chardet\universaldetector.py
RefactoringTool: chardet\utf8prober.py</samp></pre>
<p id="skip2to3output">Now run the <code class="filename">2to3</code> script on the testing harness, <code class="filename">test.py</code>.</p>
<p><a href="#skip2to3outputtest" class="skip">skip over this</a></p>
<pre class="screen"><samp class="prompt">C:\home\chardet></samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w test.py</kbd>
<samp>RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
RefactoringTool: Skipping implicit fixer: ws_comma
--- test.py (original)
+++ test.py (refactored)
@@ -4,7 +4,7 @@
count = 0
u = UniversalDetector()
for f in glob.glob(sys.argv[1]):
- print f.ljust(60),
+ print(f.ljust(60), end=' ')
u.reset()
for line in file(f, 'rb'):
u.feed(line)
@@ -12,8 +12,8 @@
u.close()
result = u.result
if result['encoding']:
- print result['encoding'], 'with confidence', result['confidence']
+ print(result['encoding'], 'with confidence', result['confidence'])
else:
- print '******** no result'
+ print('******** no result')
count += 1
-print count, 'tests'
+print(count, 'tests')
RefactoringTool: Files that were modified:
RefactoringTool: test.py</samp></pre>
<p id="skip2to3outputtest">Well, that wasn't so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it'll work?</p>
</section>
<section id="falseisinvalidsyntax">
<h2><code>False</code> is invalid syntax</h2>
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it's a good way to test our ported code to make sure there aren't any bugs lurking anywhere.</p>
<p><a href="#skipinvalidsyntax" class="skip">skip over this</a></p>
<pre class="screen"><samp class="prompt">C:\home\chardet></samp><kbd>python test.py tests\*\*</kbd>
<samp class="traceback">Traceback (most recent call last):
File "test.py", line 1, in &lt;module>
from chardet.universaldetector import UniversalDetector
File "C:\home\chardet\chardet\universaldetector.py", line 51
self.done = constants.False
^
SyntaxError: invalid syntax</samp></pre>
<p id="skipinvalidsyntax">Hmm, a small snag. In Python 3, <code>False</code> is a reserved word, so you can't use it as a variable name. Let's look at <code class="filename">constants.py</code> to see where it's defined. Here's the original version from <code class="filename">constants.py</code>, before the <code class="filename">2to3</code> script changed it:</p>
<p><a href="#skipbuiltincode" class="skip">skip over this</a></p>
<pre><code>import __builtin__
if not hasattr(__builtin__, 'False'):
False = 0
True = 1
else:
False = __builtin__.False
True = __builtin__.True</code></pre>
<p id="skipbuiltincode">This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in <code>Boolean</code> type. This code detects the absence of the built-in constants <code>True</code> and <code>False</code>, and defines them if necessary.</p>
<p>However, Python 3 will always have a <code>Boolean</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of "<code>constants.True</code>" and "<code>constants.False</code>" with "<code>True</code>" and "<code>False</code>", respectively, then delete this dead code from <code class="filename">constants.py</code>.</p>
<p>So this line in <code class="filename">universaldetector.py</code>:</p>
<pre><code>self.done = constants.False</code></pre>
<p>Becomes</p>
<pre><code>self.done = False</code></pre>
<p>Ah, wasn't that satisfying? The code is shorter and more readable already.</p>
</section>
<section id="nomodulenamedconstants">
<h2>No module named <code class="filename">constants</code></h2>
<p>Time to run test.py again and see how far it gets.</p>
<p><a href="#skipnomodulenamedconstants" class="skip">skip over this</a></p>
<pre class="screen"><samp class="prompt">C:\home\chardet></samp><kbd>python test.py tests\*\*</kbd>
<samp class="traceback">Traceback (most recent call last):
File "test.py", line 1, in &lt;module>
from chardet.universaldetector import UniversalDetector
File "C:\home\chardet\chardet\universaldetector.py", line 29, in &lt;module>
import constants, sys
ImportError: No module named constants</samp></pre>
<p id="skipnomodulenamedconstants">What's that you say? No module named <code class="filename">constants</code>? Of course there's a module named <code class="filename">constants</code>. ... Oh wait, no there isn't. Remember when the <code class="filename">2to3</code> script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:</p>
<pre><code>from . import constants</code></pre>
<p>But wait. Wasn't the <code class="filename">2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code class="filename">constants</code> module within the library, and an absolute import of the <code class="filename">sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can't, and the <code class="filename">2to3</code> script is not smart enough to split the import statement into two.</p>
<p>The solution is to split the import statement manually. So this two-in-one import:</p>
<pre><code>import constants, sys</code></pre>
<p>Needs to become two separate imports:</p>
<pre><code>from . import constants
import sys</code></pre>
<p>There are variations of this problem scattered throughout the <code class="filename">chardet</code> library. In some places it's "<code>import constants, sys</code>"; in other places, it's "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.</p>
<p>Onward!</p>
</section>
<section id="namefileisnotdefined">
<h2>Name '<var>file</var>' is not defined</h2>
<p>FIXME intro</p>
<p><a href="#skipnamefileisnotdefined" class="skip">skip over this</a></p>
<pre class="screen"><samp class="prompt">C:\home\chardet></samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class="traceback">Traceback (most recent call last):
File "test.py", line 9, in &lt;module>
for line in file(f, 'rb'):
NameError: name 'file' is not defined</samp></pre>
<p id="skipnamefileisnotdefined">This one surprised me, because I've been using this idiom as long as I can remember. In Python 2, the global <var>file()</var> function was an alias for <var>open()</var>, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the <code class="filename">io</code> module. [FIXME-LINK PEP 3116] I'll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global <var>file()</var> function no longer exists. However, the <var>open()</var> function does still exist. (Technically, it's an alias for <var>io.open()</var>, but never mind that right now.)</p>
<p>Thus, the simplest solution to the problem of the missing <var>file()</var> is to call <var>open()</var> instead:</p>
<pre><code>for line in open(f, 'rb'):</code></pre>
<p>And that's all I have to say about that.</p>
</section>
<section id="cantuseastringpattern">
<h2>Can't use a string pattern on a bytes-like object</h2>
<p>FIXME intro</p>
<p><a href="#skipcantuseastringpattern" class="skip">skip over this</a></p>
<pre class="screen"><samp class="prompt">C:\home\chardet></samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class="traceback">Traceback (most recent call last):
File "test.py", line 10, in &lt;module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 98, in feed
if self._highBitDetector.search(aBuf):
TypeError: can't use a string pattern on a bytes-like object</samp></pre>
<p id="skipcantuseastringpattern">Now things are starting to get interesting. And by "interesting," I mean "confusing as all hell."</p>
<p>First, let's see what <var>self._highBitDetector</var> is. It's defined in the <var>__init__</var> method of the <var>UniversalDetector</var> class:</p>
<p><a href="#skiphighbitdetectorcode" class="skip">skip over this</a></p>
<pre><code>class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
<p id="skiphighbitdetectorcode">This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that's not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII <em>bytes</em> in the range 128-255.</p>
<p>And therein lies the problem.</p>
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we're searching is not a string, it's a byte array. Looking at the traceback, this error occurred in <code class="filename">universaldetector.py</code>:</p>
<p><a href="#skipfeedhighbitdetectorcode" class="skip">skip over this</a></p>
<pre><code>def feed(self, aBuf):
.
.
.
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):</code></pre>
<p id="skipfeedhighbitdetectorcode">And what is <var>aBuf</var>? Let's backtrack further to a place that calls <var>UniversalDetector.feed()</var>. One place that calls it is the test harness, <code class="filename">test.py</code>.</p>
<p><a href="#skiptestharnessfeedcode" class="skip">skip over this</a></p>
<pre><code>u = UniversalDetector()
.
.
.
for line in open(f, 'rb'):
u.feed(line)</code></pre>
<p id="skiptestharnessfeedcode">And here we find our answer: in the <var>UniversalDetector.feed()</var> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for "read"; OK, big deal, we're reading the file. Ah, but <code>'b'</code> is for "bytes." Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <var>UniversalDetector.feed()</var>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit... characters. But we don't have characters; we have bytes. Oops.</p>
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.</p>
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. So instead of this:</p>
<pre><code>self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
<p>We now have this:</p>
<pre><code>self._highBitDetector = re.compile(b'[\x80-\xFF]')</code></pre>
<p>There is one other case of this same problem, on the very next line:</p>
<pre><code>self._escDetector = re.compile(r'(\033|~{)')</code></pre>
<p>Again, this is going to be used to search a byte array (the same <var>aBuf</var> variable, in fact), so the regular expression pattern needs to be defined as a byte array:</p>
<pre><code>self._escDetector = re.compile(b'(\033|~{)')</code></pre>
</section>
<section id="cantconvertbytesobject">
<h2>Can't convert '<code>bytes</code>' object to <code>str</code> implicitly</h2>
<p>Curiouser and curiouser...</p>
<p><a href="#skipcantconvertbytesobject" class="skip">skip over this</a></p>
<pre class="screen"><samp class="prompt">C:\home\chardet></samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class="traceback">Traceback (most recent call last):
File "test.py", line 10, in &lt;module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<p id="skipcantconvertbytesobject">...</p>
</section>
</body>
</html>