The content of Dive Into Python 3 is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License.
diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html
index cda1ea5..33aac87 100644
--- a/case-study-porting-chardet-to-python-3.html
+++ b/case-study-porting-chardet-to-python-3.html
@@ -12,20 +12,18 @@ body{counter-reset:h1 20}
Usually, when people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
-
In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
-
What is character encoding auto-detection?
+
Diving in
+
Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In Chapter 3, I talked about the history of character encoding and the creation of Unicode, the “one encoding to rule them all.” I’d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
+
I’d also like a pony.
+
A Unicode pony.
+
A Unipony, as it were.
+
I’ll settle for character encoding auto-detection.
+
+
What is character encoding auto-detection?
It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key.
+
Isn’t that impossible?
In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.
In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
-
Who wrote this detection algorithm?
-
This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors’ comments, which are quite extensive and informative.
-
Yippie! Screw the standards, I’ll just auto-detect everything!
-
Don’t do that. Virtually every format and protocol contains a method for specifying character encoding.
-
-
HTTP can define a charset parameter in the Content-type header.
-
HTML documents can define a <meta http-equiv="content-type"> element in the <head> of a web page.
-
XML documents can define an encoding attribute in the XML prolog.
-
-
If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards and figure out which one wins if they give you conflicting information.)
-
Despite the complexity, it’s worthwhile to follow standards and respect explicit character encoding information. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
-
Why bother with auto-detection if it’s slow, inaccurate, and non-standard?
-
Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all.
-
If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options.
-
Diving in
-
This is a brief guide to navigating the code itself.
+
+
Does such an algorithm exist?
+
As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. Mozilla Firefox contains an encoding auto-detection library which is open source. I ported the library to Python 2 and dubbed it the chardet module. This chapter will take you step-by-step through the process of porting the chardet module from Python 2 to Python 3.
+
+
Introducing the chardet module
+
[FIXME download link, possibly on chardet.feedparser.org, possibly local]
+
Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself.
The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)
There are 5 categories of encodings that UniversalDetector handles:
@@ -98,11 +91,11 @@ body{counter-reset:h1 20}
Single-byte encodings
The single-byte encoding prober, SBCSGroupProber (defined in sbcsgroupprober.py), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: windows-1251, KOI8-R, ISO-8859-5, MacCyrillic, IBM855, and IBM866 (Russian); ISO-8859-7 and windows-1253 (Greek); ISO-8859-5 and windows-1251 (Bulgarian); ISO-8859-2 and windows-1250 (Hungarian); TIS-620 (Thai); windows-1255 and ISO-8859-8 (Hebrew).
SBCSGroupProber feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, SingleByteCharSetProber (defined in sbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. SingleByteCharSetProber processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
-
Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored "backwards" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew).
+
Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew).
windows-1252
If UniversalDetector detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a Latin1Prober (defined in latin1prober.py) to try to detect English text in a windows-1252 encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish windows-1252 is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. Latin1Prober automatically reduces its confidence rating to allow more accurate probers to win if at all possible.
Running 2to3
-
We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.
+
We’re going to migrate the chardet module from Python 2 to Python 3. Python 3 comes with a utility script called 2to3, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy — a function was renamed or moved to a different modules — but in other cases it can get pretty complex. To get a sense of all that it can do, refer to the appendix, Porting code to Python 3 with 2to3. In this chapter, we’ll start by running 2to3 on the chardet package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.
The main chardet package is split across several different files, all in the same directory. The 2to3 script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and 2to3 will convert each of the files in turn.
[The code examples will be easier to follow if you enable Javascript, but whatever.]
skip over this
@@ -604,7 +597,8 @@ RefactoringTool: Skipping implicit fixer: ws_comma
+print(count, 'tests')
RefactoringTool: Files that were modified:
RefactoringTool: test.py
-
Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
+
[FIXME explain the difference in import syntax]
+
Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
Fixing what 2to3 can’t
False is invalid syntax
Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere.
@@ -643,7 +637,7 @@ else:
File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module>
import constants, sys
ImportError: No module named constants
-
What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
+
What’s that you say? No module named constants? Of course there’s a module named constants. …Oh wait, no there isn’t. Remember when the 2to3 script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
from . import constants
But wait. Wasn’t the 2to3 script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the constants module within the library, and an absolute import of the sys module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the 2to3 script is not smart enough to split the import statement into two.
The solution is to split the import statement manually. So this two-in-one import:
@@ -685,7 +679,7 @@ TypeError: can't use a string pattern on a bytes-like object
self._highBitDetector = re.compile(r'[\x80-\xFF]')
This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.
And therein lies the problem.
-
In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py:
+
In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py:
def feed(self, aBuf):
.
@@ -701,7 +695,7 @@ TypeError: can't use a string pattern on a bytes-like object
.
for line in open(f, 'rb'):
u.feed(line)
-
And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
+
And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
What we need this regular expression to search is not an array of characters, but an array of bytes.
Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
skip over this code listing
diff --git a/dip2 b/dip2
index 4c15027..f7572d8 100644
--- a/dip2
+++ b/dip2
@@ -23,52 +23,12 @@
You should now have a version of Python installed that works for you.
Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing python on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version.
Congratulations, and welcome to Python.
-
-
Chapter 2. Your First Python Program
-
You know how other books go on and on about programming fundamentals and finally work up to building a complete, working program?
-Let's skip all that.
-
2.1. Diving in
-
Here is a complete, working Python program.
-
It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But
-read through it first and see what, if anything, you can make of it.
-
-def buildConnectionString(params):
- """Build a connection string from a dictionary of parameters.
- Returns string."""
- return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
-if __name__ == "__main__":
- myParams = {"server":"mpilgrim", \
- "database":"master", \
- "uid":"sa", \
- "pwd":"secret" \
- }
- print buildConnectionString(myParams)
Now run this program and see what happens.
-
-
In the ActivePython IDE on Windows, you can run the Python program you're editing by choosing
-File->Run... (Ctrl-R). Output is displayed in the interactive window.
-
-
-
In the Python IDE on Mac OS, you can run a Python program with
-Python->Run window... (Cmd-R), but there is an important option you must set first. Open the .py file in the IDE, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked. This is a per-file setting, but you'll only need to do it once per file.
-
-
-
On UNIX-compatible systems (including Mac OS X), you can run a Python program from the command line: python odbchelper.py
The id="odbchelper.output" output of odbchelper.py will look like this:
server=mpilgrim;uid=sa;database=master;pwd=secret
2.2. Declaring Functions
-
Python has functions like most other languages, but it does not have separate header files like C++ or interface/implementation sections like Pascal. When you need a function, just declare it, like this:
-
-def buildConnectionString(params):
Note that the keyword def starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments
-(not shown here) are separated with commas.
-
Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value.
-In fact, every Python function returns a value; if the function ever executes a return statement, it will return that value, otherwise it will return None, the Python null value.
-
-
-
In Visual Basic, functions (that return a value) start with function, and subroutines (that do not return a value) start with sub. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's None), and all functions start with def.
-
The argument, params, doesn't specify a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
-
-
-
In Java, C++, and other statically-typed languages, you must specify the datatype of the function return value and each function argument.
- In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
-
2.2.1. How Python's Datatypes Compare to Other Programming Languages
-
An erudite reader sent me this explanation of how Python compares to other programming languages:
-
-
-
statically typed language
-
A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare
- all variables with their datatypes before using them. Java and C are statically typed languages.
-
-
dynamically typed language
-
A language in which types are discovered at execution time; the opposite of statically typed. VBScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
-
-
strongly typed language
-
A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
-
-
weakly typed language
-
A language in which types may be ignored; the opposite of strongly typed. VBScript is weakly typed. In VBScript, you can concatenate the string '12' and the integer 3 to get the string '123', then treat that as the integer 123, all without any explicit conversion.
-
-
-
So Python is both dynamically typed (because it doesn't use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters).
2.3. Documenting Functions
You can document a Python function by giving it a docstring.
Example 2.2. Defining the buildConnectionString Function's docstring
Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from Chapter 2, odbchelper.py.
Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring
@@ -795,65 +664,6 @@ NameError: There is no variable named 'x'
Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is
- to insert values into a string with the %s placeholder.
-
-
-
String formatting in Python uses the same syntax as the sprintf function in C.
-
Example 3.21. Introducing String Formatting
>>> k = "uid"
->>> v = "sa"
->>> "%s=%s" % (k, v)①
-'uid=sa'
-
-
The whole expression evaluates to a string. The first %s is replaced by the value of k; the second %s is replaced by the value of v. All other characters in the string (in this case, the equal sign) stay as they are.
-
Note that (k, v) is a tuple. I told you they were good for something.
-
You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that
-string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
-
Example 3.22. String Formatting vs. Concatenating
>>> uid = "sa"
->>> pwd = "secret"
->>> print pwd + " is not a good password for " + uid①
-secret is not a good password for sa
->>> print "%s is not a good password for %s" % (pwd, uid)②
-secret is not a good password for sa
->>> userCount = 6
->>> print "Users connected: %d" % (userCount, )③④
-Users connected: 6
->>> print "Users connected: " + userCount⑤
-Traceback (innermost last):
- File "<interactive input>", line 1, in ?
-TypeError: cannot concatenate 'str' and 'int' objects
-
-
+ is the string concatenation operator.
-
In this trivial case, string formatting accomplishes the same result as concatentation.
-
(userCount, ) is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a
- tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the
- comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether (userCount) was a tuple with one element or just the value of userCount.
-
String formatting works with integers by specifying %d instead of %s.
-
Trying to concatenate a string with a non-string raises an exception. Unlike string formatting, string concatenation works
- only when everything is already a string.
-
As with printf in C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
-
The %f string formatting option treats the value as a decimal, and prints it to six decimal places.
-
The ".2" modifier of the %f option truncates the value to two decimal places.
-
You can even combine modifiers. Adding the + modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding
- the value to exactly two decimal places.
-
One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each
@@ -909,75 +719,23 @@ as params.items(), but each element in the
You have a list of key-value pairs in the form key=value, and you want to join them into a single string. To join any list of strings into a single string, use the join method of a string object.
-
Here is an example of joining a list from the buildConnectionString function:
- return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything
-is an object. You might have thought I meant that string variables are objects. But no, look closely at this example and you'll see that the string ";" itself is an object, and you are calling its join method.
-
The join method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't
-need to be a semi-colon; it doesn't even need to be a single character. It can be any string.
-
-
join works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements
- will raise an exception.
-
Example 3.27. Output of odbchelper.py
>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
->>> ["%s=%s" % (k, v) for k, v in params.items()]
-['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
->>> ";".join(["%s=%s" % (k, v) for k, v in params.items()])
-'server=mpilgrim;uid=sa;database=master;pwd=secret'
This string is then returned from the odbchelper function and printed by the calling block, which gives you the output that you marveled at when you started reading this
-chapter.
-
You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's
-called split.
-
Example 3.28. Splitting a String
>>> li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
->>> s = ";".join(li)
->>> s
-'server=mpilgrim;uid=sa;database=master;pwd=secret'
->>> s.split(";")①
-['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
->>> s.split(";", 1)②
-['server=mpilgrim', 'uid=sa;database=master;pwd=secret']
-
-
split reverses join by splitting a string into a multi-element list. Note that the delimiter (“;”) is stripped out completely; it does not appear in any of the elements of the returned list.
-
split takes an optional second argument, which is the number of times to split. (“Oooooh, optional arguments...” You'll learn how to do this in your own functions in the next chapter.)
-
-
anystring.split(delimiter, 1) is a useful technique when you want to search a string for a substring and then work with everything before the substring
- (which ends up in the first element of the returned list) and everything after it (which ends up in the second element).
-
When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story
- behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed
- important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of
- the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead.
-
3.8. Summary
-
The odbchelper.py program and its output should now make perfect sense.
-
-def buildConnectionString(params):
- """Build a connection string from a dictionary of parameters.
+(String splitting stuff was here)
+
+
+
+
+
+
+
- Returns string."""
- return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
-if __name__ == "__main__":
- myParams = {"server":"mpilgrim", \
- "database":"master", \
- "uid":"sa", \
- "pwd":"secret" \
- }
- print buildConnectionString(myParams)
-
Here is the output of odbchelper.py:
server=mpilgrim;uid=sa;database=master;pwd=secret
Before diving into the next chapter, make sure you're comfortable doing all of these things:
@@ -4162,53 +3920,21 @@ u'0'
You can even use the toxml method here, deeply nested within the document.
The p element has only one child node (you can't tell that from this example, but look at pNode.childNodes if you don't believe me), and it is a Text node for the single character '0'.
The .data attribute of a Text node gives you the actual string that the text node represents. But what is that 'u' in front of the string? The answer to that deserves its own section.
-
9.4. Unicode
-
Unicode is a system to represent characters from all the world's different languages. When Python parses an XML document, all data is stored in memory as unicode.
-
You'll get to all that in a minute, but first, some background.
-
Historical note. Before unicode, there were separate character encoding systems for each language, each using the same numbers (0-255) to represent
-that language's characters. Some languages (like Russian) have multiple conflicting standards about how to represent the
-same characters; other languages (like Japanese) have so many characters that they require multiple-byte character sets.
-Exchanging documents between systems was difficult because there was no way for a computer to tell for certain which character
-encoding scheme the document author had used; the computer only saw numbers, and the numbers could mean different things.
-Then think about trying to store these documents in the same place (like in the same database table); you would need to store
-the character encoding alongside each piece of text, and make sure to pass it around whenever you passed the text around.
-Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used
-escape codes to switch modes; poof, you're in Russian koi8-r mode, so character 241 means this; poof, now you're in Mac Greek
-mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve.
-
To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.
-[5] Each 2-byte number represents a unique character used in at least one of the world's languages. (Characters that are used
-in multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per number.
-Unicode data is never ambiguous.
-
Of course, there is still the matter of all these legacy encoding systems. 7-bit ASCII, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit ASCII. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called “latin-1”), which uses the 7-bit ASCII characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it
-(241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit ASCII for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from there into characters
-for other languages with the remaining numbers, 256 through 65535.
-
When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy encoding
-systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding
-scheme, or to print it to a non-unicode-aware terminal or printer. Or to store it in an XML document which explicitly specifies the encoding scheme.
-
And on that note, let's get back to Python.
-
Python has had unicode support throughout the language since version 2.0. The XML package uses unicode to store all parsed XML data, but you can use unicode anywhere.
-
Example 9.13. Introducing unicode
->>> s = u'Dive in'①
->>> s
-u'Dive in'
->>> print s②
-Dive in
-
-
To create a unicode string instead of a regular ASCII string, add the letter “u” before the string. Note that this particular string doesn't have any non-ASCII characters. That's fine; unicode is a superset of ASCII (a very large superset at that), so any regular ASCII string can also be stored as unicode.
-
When printing a string, Python will attempt to convert it to your default encoding, which is usually ASCII. (More on this in a minute.) Since this unicode string is made up of characters that are also ASCII characters, printing it has the same result as printing a normal ASCII string; the conversion is seamless, and if you didn't know that s was a unicode string, you'd never notice the difference.
-
Example 9.14. Storing non-ASCII characters
->>> s = u'La Pe\xf1a'①
->>> print s②
-Traceback (innermost last):
- File "<interactive input>", line 1, in ?
-UnicodeError: ASCII encoding error: ordinal not in range(128)
->>> print s.encode('latin-1')③
-La Peña
-
-
The real advantage of unicode, of course, is its ability to store non-ASCII characters, like the Spanish “ñ” (n with a tilde over it). The unicode character code for the tilde-n is 0xf1 in hexadecimal (241 in decimal), which you can type like this: \xf1.
-
Remember I said that the print function attempts to convert a unicode string to ASCII so it can print it? Well, that's not going to work here, because your unicode string contains non-ASCII characters, so Python raises a UnicodeError error.
-
Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. s is a unicode string, but print can only print a regular string. To solve this problem, you call the encode method, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme,
- which you pass as a parameter. In this case, you're using latin-1 (also known as iso-8859-1), which includes the tilde-n (whereas the default ASCII encoding scheme did not, since it only includes characters numbered 0 through 127).
+
+
+
+
+
+
+(Unicode stuff was here)
+
+
+
+
+
+
+
+
Remember I said Python usually converted unicode to ASCII whenever it needed to make a regular string out of a unicode string? Well, this default encoding scheme is an option which
you can customize.
Example 9.15. sitecustomize.py
@@ -4233,57 +3959,19 @@ La Peña
This example assumes that you have made the changes listed in the previous example to your sitecustomize.py file, and restarted Python. If your default encoding still says 'ascii', you didn't set up your sitecustomize.py properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even
call sys.setdefaultencoding after Python has started up. Dig into site.py and search for “setdefaultencoding” to find out how.)
Now that the default encoding scheme includes all the characters you use in your string, Python has no problem auto-coercing the string and printing it.
-
Example 9.17. Specifying encoding in .py files
-
If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to be UTF-8:
-#!/usr/bin/env python
-# -*- coding: UTF-8 -*-
-
Now, what about XML? Well, every XML document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R
-is popular for Russian texts. The encoding, if specified, is in the header of the XML document.
-
Example 9.18. russiansample.xml
-<?xml version="1.0" encoding="koi8-r"?> ①
-<preface>
-<title>Предисловие</title> ②
-</preface>
-
-
This is a sample extract from a real Russian XML document; it's part of a Russian translation of this very book. Note the encoding, koi8-r, specified in the header.
-
These are Cyrillic characters which, as far as I know, spell the Russian word for “Preface”. If you open this file in a regular text editor, the characters will most likely like gibberish, because they're encoded
- using the koi8-r encoding scheme, but they're being displayed in iso-8859-1.
-
Example 9.19. Parsing russiansample.xml
->>> from xml.dom import minidom
->>> xmldoc = minidom.parse('russiansample.xml')①
->>> title = xmldoc.getElementsByTagName('title')[0].firstChild.data
->>> title②
-u'\u041f\u0440\u0435\u0434\u0438\u0441\u043b\u043e\u0432\u0438\u0435'
->>> print title③
-Traceback (innermost last):
- File "<interactive input>", line 1, in ?
-UnicodeError: ASCII encoding error: ordinal not in range(128)
->>> convertedtitle = title.encode('koi8-r')④
->>> convertedtitle
-'\xf0\xd2\xc5\xc4\xc9\xd3\xcc\xcf\xd7\xc9\xc5'
->>> print convertedtitle⑤
-Предисловие
-
-
I'm assuming here that you saved the previous example as russiansample.xml in the current directory. I am also, for the sake of completeness, assuming that you've changed your default encoding back
- to 'ascii' by removing your sitecustomize.py file, or at least commenting out the setdefaultencoding line.
-
Note that the text data of the title tag (now in the title variable, thanks to that long concatenation of Python functions which I hastily skipped over and, annoyingly, won't explain until the next section) -- the text data inside the
-XML document's title element is stored in unicode.
-
Printing the title is not possible, because this unicode string contains non-ASCII characters, so Python can't convert it to ASCII because that doesn't make sense.
-
You can, however, explicitly convert it to koi8-r, in which case you get a (regular, not unicode) string of single-byte characters (f0, d2, c5, and so forth) that are the koi8-r-encoded versions of the characters in the original unicode string.
-
Printing the koi8-r-encoded string will probably show gibberish on your screen, because your Python IDE is interpreting those characters as iso-8859-1, not koi8-r. But at least they do print. (And, if you look carefully, it's the same gibberish that you saw when you opened the original
-XML document in a non-unicode-aware text editor. Python converted it from koi8-r into unicode when it parsed the XML document, and you've just converted it back.)
-
To sum up, unicode itself is a bit intimidating if you've never seen it before, but unicode data is really very easy to handle
-in Python. If your XML documents are all 7-bit ASCII (like the examples in this chapter), you will literally never think about unicode. Python will convert the ASCII data in the XML documents into unicode while parsing, and auto-coerce it back to ASCII whenever necessary, and you'll never even notice. But if you need to deal with that in other languages, Python is ready.
-
Unicode Tutorial has some more examples of how to use Python's unicode functions, including how to force Python to coerce unicode into ASCII even when it doesn't really want to.
-
PEP 263 goes into more detail about how and when to define a character encoding in your .py files.
-
+
+
+(More Unicode stuff was here)
+
+
+
+
+
+
+
9.5. Searching for elements
Traversing XML documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within
your XML document, there is a shortcut you can use to find it quickly: getElementsByTagName.
diff --git a/index.html b/index.html
index 591707a..aec2105 100644
--- a/index.html
+++ b/index.html
@@ -8,20 +8,28 @@
-
+
+
+
You are here: •
+
+
Dive Into Python 3
+
Dive Into Python 3 will cover Python 3 and its differences from Python 2. Compared to the original Dive Into Python, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final version will be published on paper by Apress. The book will remain online under the CC-BY-SA-3.0 license.
+
You can see the full table of contents (not finalized), or read what I’ve written so far:
Chinese has thousands of characters. The Rotokas alphabet of Bougainville is the smallest alphabet in the world, with just 12 letters. English has 26, plus a handful of punctuation marks. Python 3 can handle all of these languages, and more.
+
+
When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
+
+
In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and the result will be gibberish.
+
+
Surely you’ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn’t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it’s merely annoying; in other languages, the result can be completely unreadable.
+
+
As I mentioned, there are separate character encodings for each major language in the world, and a lot of minor ones. Since each language is different, and disk space has historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. ASCII, for instance, stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte.
+
+
Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with ASCII in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), and so on. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte.
+
+
Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they require multiple-byte character sets. That is, each “character” is represented by a two-byte number from 0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings, namely that they each use the same numbers to mean different things. It’s just that the range of numbers is broader, because there are many more characters to represent.
+
+
That was mostly OK in a non-networked world, where “text” was something you typed yourself and occasionally printed. There wasn’t much “plain text” — your word processor had its own format with stored character encoding information, rich styling, and so on. Word processors were customized for each language, so they automatically used the most appropriate character encoding in the Russian edition and in the English edition and in the Spanish edition. People who read these documents were using the same word processing program as the original author, so everything worked, more or less.
+
+
Now think about the rise of global networks like email and the web. Lots of “plain text” flying around the globe, being authored on one computer, transmitted through a second computer, and received and displayed by a third computer. Computers can only see numbers, but the numbers could mean different things. Oh no! What to do? Well, systems had to be designed to carry encoding information along with every piece of “plain text.” Remember, it’s the decryption key that maps computer-readable numbers to human-readable characters. A missing decryption key means garbled text, gibberish, or worse.
+
+
Now think about trying to store multiple pieces of text in the same place, like in the same database table that holds all the email you’ve ever received. You still need to store the character encoding alongside each piece of text so you can display it properly. Think that’s hard? Try searching your email database, which means converting between multiple encodings on the fly. Doesn’t that sound fun?
+
+
Now think about the possibility of multilingual documents, where characters from several languages are next to each other in the same document. (Hint: programs that tried to do this typically used escape codes to switch “modes.” Poof, you’re in Russian koi8-r mode, so 241 means this character; poof, now you’re in Mac Greek mode, so 241 means some other character.) And of course you’ll want to search those documents, too.
+
+
Now cry a lot, because everything you thought you knew about strings is wrong, and there ain’t no such thing as “plain text.”
+
+
+
+
Nothing below this line is really done yet. Thanks for reading this far! Stop now!
+
+
Unicode
+
+
Enter Unicode.
+
+
Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That's 232−1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; Unicode data is never ambiguous.
+
+
Right away, problems leap out at you. 4 bytes? For every single character‽ [FIXME incomplete paragraph]
+
+
Of course, there is still the matter of all those legacy encoding systems. [FIXME incomplete paragraph]
+
+
[FIXME stuff about UTF-32, UTF-16, and finally UTF-8]
+
+
Specifying character encoding in .py files
+
+
+
+
[FIXME this appears to be mostly the same in Python 3, except the default encoding is now UTF-8, not ASCII.]
+
+
If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual .py file by putting an encoding declaration at the top of each file. This declaration defines the .py file to be UTF-8:
+#!/usr/bin/env python
+# -*- coding: UTF-8 -*-
+
+
[FIXME maybe some examples here]
+
+
Formatting strings
+
+
[FIXME this is all completely different in Python 3. Cover the new way, then maybe show some examples from the old way? Or maybe not. Hey, maybe just point to the original "Dive Into Python".]
+
+
Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is
+ to insert values into a string with the %s placeholder.
+
+
+>>> k = "uid"
+>>> v = "sa"
+>>> "%s=%s" % (k, v)①
+'uid=sa'
+
+
The whole expression evaluates to a string. The first %s is replaced by the value of k; the second %s is replaced by the value of v. All other characters in the string (in this case, the equal sign) stay as they are.
+
+
+
Note that (k, v) is a tuple. I told you they were good for something.
+
+
You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that
+string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
+
+
+>>> uid = "sa"
+>>> pwd = "secret"
+>>> print pwd + " is not a good password for " + uid①
+secret is not a good password for sa
+>>> print "%s is not a good password for %s" % (pwd, uid)②
+secret is not a good password for sa
+>>> userCount = 6
+>>> print "Users connected: %d" % (userCount, )③④
+Users connected: 6
+>>> print "Users connected: " + userCount⑤
+Traceback (innermost last):
+ File "<interactive input>", line 1, in ?
+TypeError: cannot concatenate 'str' and 'int' objects
+
+
+ is the string concatenation operator.
+
In this trivial case, string formatting accomplishes the same result as concatentation.
+
(userCount, ) is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether (userCount) was a tuple with one element or just the value of userCount.
+
String formatting works with integers by specifying %d instead of %s.
+
Trying to concatenate a string with a non-string raises an exception. Unlike string formatting, string concatenation works only when everything is already a string.
+
+
+
As with printf in C, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
+
+
The %f string formatting option treats the value as a decimal, and prints it to six decimal places.
+
The ".2" modifier of the %f option truncates the value to two decimal places.
+
You can even combine modifiers. Adding the + modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding the value to exactly two decimal places.
+
+
+
Common string operations
+
+
[FIXME is it worth keeping this section on joining lists / splitting strings? All the examples are from an old code sample that isn't used at all anymore.]
+
+
You have a list of key-value pairs in the form key=value, and you want to join them into a single string. To join any list of strings into a single string, use the join method of a string object.
+
+
Here is an example of joining a list from the buildConnectionString function:
+
+
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
+
+
One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything
+is an object. You might have thought I meant that string variables are objects. But no, look closely at this example and you'll see that the string ";" itself is an object, and you are calling its join method.
+
The join method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't need to be a semi-colon; it doesn't even need to be a single character. It can be any string.
+
+
+
+
+>>> params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}
+>>> ["%s=%s" % (k, v) for k, v in params.items()]
+['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
+>>> ";".join(["%s=%s" % (k, v) for k, v in params.items()])
+'server=mpilgrim;uid=sa;database=master;pwd=secret'
+
+
This string is then returned from the odbchelper function and printed by the calling block, which gives you the output that you marveled at when you started reading this chapter.
+
+
You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's called split.
+
+
+>>> li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
+>>> s = ";".join(li)
+>>> s
+'server=mpilgrim;uid=sa;database=master;pwd=secret'
+>>> s.split(";")①
+['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
+>>> s.split(";", 1)②
+['server=mpilgrim', 'uid=sa;database=master;pwd=secret']
+
+
split reverses join by splitting a string into a multi-element list. Note that the delimiter (“;”) is stripped out completely; it does not appear in any of the elements of the returned list.
+
split takes an optional second argument, which is the number of times to split. (“Oooooh, optional arguments...” You'll learn how to do this in your own functions in the next chapter.)
+
+
+
+
+
The string module
+
+
[FIXME is this worth keeping? The module still exists in 3.0; check if it's going away in 3.1 or something.]
+
+
When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead.
+
+