diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index f5ca1f6..2be9e58 100755 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -198,7 +198,60 @@ RefactoringTool: Skipping implicit fixer: ws_comma +print(count, 'tests') RefactoringTool: Files that were modified: RefactoringTool: test.py -
Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work? +
Well, that wasn’t so hard. Just a few imports and print statements to convert. Speaking of which, what was the problem with all those import statements? To answer that, you need to understand how the chardet module is split into multiple files.
+
+
⁂ + +
chardet is a multi-file module. I could have chosen to put all the code in one file (named chardet.py), but I didn’t. Instead, I made a directory (named chardet), then I made an __init__.py file in that directory. If Python sees an __init__.py file in a directory, it assumes that all of the files in that directory are part of the same module. The module’s name is the name of the directory. Files within the directory can reference other files within the same directory, or even within subdirectories. (More on that in a minute.) But the entire collection of files is presented to other Python code as a single module — as if all the functions and classes were in a single .py file.
+
+
What goes in the __init__.py file? Nothing. Everything. Something in between. The __init__.py file doesn’t need to define anything; it can literally be an empty file. Or you can use it to define your main entry point functions. Or you put all your functions in it. Or all but one.
+
+
++ +☞A directory with an
__init__.pyfile is always treated as a multi-file module. Without an__init__.pyfile, a directory is just a directory of unrelated.pyfiles. +
Let’s see how that works in practice. + +
+>>> import chardet +>>> dir(chardet) ① +['__builtins__', '__doc__', '__file__', '__name__', + '__package__', '__path__', '__version__', 'detect'] +>>> chardet ② +<module 'chardet' from 'C:\Python31\lib\site-packages\chardet\__init__.py'>+
chardet module is a detect() function.
+chardet module is more than just a file: the “module” is listed as the __init__.py file within the chardet/ directory.
+Let’s take a peek in that __init__.py file.
+
+
def detect(aBuf): ①
+ from . import universaldetector ②
+ u = universaldetector.UniversalDetector()
+ u.reset()
+ u.feed(aBuf)
+ u.close()
+ return u.result
+__init__.py file defines the detect() function, which is the main entry point into the chardet library.
+detect() function hardly has any code! In fact, all it really does is import the universaldetector module and start using it. But where is universaldetector defined?
+The answer lies in that odd-looking import statement:
+
+
from . import universaldetector
+
+Translated into English, that means “import the universaldetector module; that’s in the same directory I am,” where “I” is the chardet/__init__.py file. This is called a relative import. It’s a way for the files within a multi-file module to reference each other, without worrying about naming conflicts with other modules you may have installed in your import search path. This import statement will only look for the universaldetector module within the chardet/ directory itself.
+
+
These two concepts — __init__.py and relative imports — mean that you can break up your module into as many pieces as you like. The chardet module comprises 36 .py files — 36! Yet all you need to do to start using it is import chardet, then you can call the main chardet.detect() function. Unbeknowst to your code, the detect() function is actually defined in the chardet/__init__.py file. Also unbeknowst to you, the detect() function uses a relative import to reference a class defined in chardet/universaldetector.py, which in turn uses relative imports on five other files, all contained in the chardet/ directory.
+
+
++☞If you ever find yourself writing a large library in Python (or more likely, when you realize that your small library has grown into a large one), take the time to refactor it into a multi-file module. It’s one of the many things Python is good at, so take advantage of it. +
⁂
2to3 Can’tself.done = constants.False
Becomes
self.done = False
-Ah, wasn’t that satisfying? The code is shorter and more readable already. +
Ah, wasn’t that satisfying? The code is shorter and more readable already.
constantsTime to run test.py again and see how far it gets.
C:\home\chardet> python test.py tests\*\*
@@ -237,9 +290,11 @@ else:
File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module>
import constants, sys
ImportError: No module named constants
-What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default. To do relative imports, you need to do something like this instead:
+
What’s that you say? No module named constants? Of course there’s a module named constants. It’s right there, in chardet/constants.py.
+
+
Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the same library — but the logic behind relative imports has changed in Python 3. In Python 2, you could just import constants and it would look in the chardet/ directory first. In Python 3, all import statements are absolute by default. If you want to do a relative import in Python 3, you need to be explicit about it:
from . import constants
-But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.
+
But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.
The solution is to split the import statement manually. So this two-in-one import:
import constants, sys
Needs to become two separate imports: @@ -283,7 +338,7 @@ TypeError: can't use a string pattern on a bytes-like object . if self._mInputState == ePureAscii: if self._highBitDetector.search(aBuf): -
And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.
+
And what is aBuf? Let’s backtrack further to a place that calls UniversalDetector.feed(). One place that calls it is the test harness, test.py.
u = UniversalDetector()
.
.