diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 7141079..3ce320c 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -19,26 +19,27 @@ mark{background:#ff8;font-weight:bold}

Words, words. They’re all we have to go on.
Rosencrantz and Guildenstern are Dead

  -

Diving in

+

Diving In

Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In Chapter 3, I talked about the history of character encoding and the creation of Unicode, the “one encoding to rule them all.” I’d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.

I’d also like a pony.

A Unicode pony.

A Unipony, as it were.

I’ll settle for character encoding auto-detection. -

What is character encoding auto-detection?

+

What is Character Encoding Auto-Detection?

It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key. -

Isn’t that impossible?

+

Isn’t That Impossible?

In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.

In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings. -

Does such an algorithm exist?

+

Does Such An Algorithm Exist?

As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. Mozilla Firefox contains an encoding auto-detection library which is open source. I ported the library to Python 2 and dubbed it the chardet module. This chapter will take you step-by-step through the process of porting the chardet module from Python 2 to Python 3. -

Introducing the chardet module

+

Introducing The chardet Module

[FIXME download link, possibly on chardet.feedparser.org, possibly local]

Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself. +

The main entry point for the detection algorithm is universaldetector.py, which has one class, UniversalDetector. (You might think the main entry point is the detect function in chardet/__init__.py, but that’s really just a convenience function that creates a UniversalDetector object, calls it, and returns its result.)

There are 5 categories of encodings that UniversalDetector handles:

    @@ -48,18 +49,19 @@ mark{background:#ff8;font-weight:bold}
  1. Single-byte encodings, where each character is represented by one byte. Examples: KOI8-R (Russian), windows-1255 (Hebrew), and TIS-620 (Thai).
  2. windows-1252, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.
-

UTF-n with a BOM

+

UTF-n With A BOM

If the text starts with a BOM, we can reasonably assume that the text is encoded in UTF-8, UTF-16, or UTF-32. (The BOM will tell us exactly which one; that’s what it’s for.) This is handled inline in UniversalDetector, which returns the result immediately without any further processing. -

Escaped encodings

+

Escaped Encodings

If the text contains a recognizable escape sequence that might indicate an escaped encoding, UniversalDetector creates an EscCharSetProber (defined in escprober.py) and feeds it the text.

EscCharSetProber creates a series of state machines, based on models of HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR (defined in escsm.py). EscCharSetProber feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, EscCharSetProber immediately returns the positive result to UniversalDetector, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines. -

Multi-byte encodings

+

Multi-Byte Encodings

Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252.

The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller.

Most of the multi-byte encoding probers are inherited from MultiByteCharSetProber (defined in mbcharsetprober.py), and simply hook up the appropriate state machine and distribution analyzer and let MultiByteCharSetProber do the rest of the work. MultiByteCharSetProber runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, MultiByteCharSetProber feeds the text to an encoding-specific distribution analyzer.

The distribution analyzers (each defined in chardistribution.py) use language-specific models of which characters are used most frequently. Once MultiByteCharSetProber has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, MultiByteCharSetProber returns the result to MBCSGroupProber, which returns it to UniversalDetector, which returns it to the caller.

The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between EUC-JP and SHIFT_JIS, so the SJISProber (defined in sjisprober.py) also uses 2-character distribution analysis. SJISContextAnalysis and EUCJPContextAnalysis (both defined in jpcntx.py and both inheriting from a common JapaneseContextAnalysis class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to SJISProber, which checks both analyzers and returns the higher confidence level to MBCSGroupProber. -

Single-byte encodings

+

Single-Byte Encodings

+

The single-byte encoding prober, SBCSGroupProber (defined in sbcsgroupprober.py), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: windows-1251, KOI8-R, ISO-8859-5, MacCyrillic, IBM855, and IBM866 (Russian); ISO-8859-7 and windows-1253 (Greek); ISO-8859-5 and windows-1251 (Bulgarian); ISO-8859-2 and windows-1250 (Hungarian); TIS-620 (Thai); windows-1255 and ISO-8859-8 (Hebrew).

SBCSGroupProber feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, SingleByteCharSetProber (defined in sbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. SingleByteCharSetProber processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.

Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew). @@ -567,8 +569,9 @@ RefactoringTool: Files that were modified: RefactoringTool: test.py

[FIXME explain the difference in import syntax]

Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work? -

Fixing what 2to3 can’t

+

Fixing What 2to3 Can’t

False is invalid syntax

+

Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere.

C:\home\chardet> python test.py tests\*\*
 Traceback (most recent call last):
@@ -613,6 +616,7 @@ import sys

There are variations of this problem scattered throughout the chardet library. In some places it’s "import constants, sys"; in other places, it’s "import constants, re". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.

Onward!

Name 'file' is not defined

+

And here we go again, running test.py to try to execute our test cases…

C:\home\chardet> python test.py tests\*\*
 tests\ascii\howto.diveintomark.org.xml
@@ -654,6 +658,7 @@ TypeError: can't use a string pattern on a bytes-like object
. for line in open(f, 'rb'): u.feed(line) +

And here we find our answer: in the UniversalDetector.feed() method, aBuf is a line read from a file on disk. Look carefully at the parameters used to open the file: 'rb'. 'r' is for “read”; OK, big deal, we’re reading the file. Ah, but 'b' is for “binary.” Without the 'b' flag, this for loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to open(), but never mind that for now.) But with the 'b' flag, this for loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to UniversalDetector.feed(), and eventually gets passed to the pre-compiled regular expression, self._highBitDetector, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.

What we need this regular expression to search is not an array of characters, but an array of bytes.

Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.) @@ -776,6 +781,7 @@ TypeError: unsupported operand type(s) for +: 'int' and 'bytes' self._mInputState = eEscAscii self._mLastChar = aBuf[-1] +

This error doesn't occur the first time the feed() method gets called; it occurs the second time, after self._mLastChar has been set to the last byte of aBuf. Well, what's the problem with that? Getting a single element from a byte array yields an integer, not a byte array. To see the difference, follow me to the interactive shell:

 >>> aBuf = b'\xEF\xBB\xBF'         
@@ -1115,7 +1121,7 @@ NameError: global name 'reduce' is not defined
total = reduce(operator.add, self._mFreqCounter)

The reduce() function takes two arguments — a function and a list (strictly speaking, any iterable object will do) — and applies the function cumulatively to each item of the list. In other words, this is a fancy and roundabout way of adding up all the items in a list and returning the result. -

This monstrosity was so common in Python 2 that Python 3 added a global sum() function. +

This monstrosity was so common that Python added a global sum() function.

  def get_confidence(self):
       if self.get_state() == constants.eNotMe:
           return 0.01
diff --git a/dip3.css b/dip3.css
index f444c94..868b93d 100644
--- a/dip3.css
+++ b/dip3.css
@@ -62,11 +62,11 @@ abbr{font-variant:small-caps;text-transform:lowercase;letter-spacing:0.1em}
 .note{margin:3.5em 4.94em}
 .note span{display:block;float:left;font-size:xx-large;line-height:0.875;margin:0 0.22em 0 -1.22em}
 .c,pre,.w,.w a,.d{line-height:2.154}
-.f:first-letter{float:left;color:#ddd;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
-h1,h2,h3,p,ul,ol{margin:1.75em 0;font-size:medium}
+.f:first-letter{float:left;color:lightblue;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
+p,ul,ol{margin:1.75em 0;font-size:medium}
 
 /* basics */
-html{background:#fff;color:#000}
+html{background:#fff;color:#333}
 body{margin:1.75em 28px}
 form div{float:right}
 .c{text-align:center;margin:2.154em 0}
@@ -74,7 +74,7 @@ form div{float:right}
 /* links */
 a{text-decoration:none;border-bottom:1px dotted}
 a:hover{border-bottom:1px solid}
-a:link,.w a{color:#26c}
+a:link,.w a{color:steelblue}
 a:visited{color:#93c}
 .c a{color:inherit}
 
@@ -92,10 +92,19 @@ kbd{font-weight:bold}
 li ol,.q{margin:0}
 pre a,.w a,pre a:hover{border:0}
 
-/* headers */
-h1{background:PapayaWhip;width:100%} /* all hail PapayaWhip */
+/* headers and pullquotes */
+h1,h2,h3,aside{font-family:"Book Antiqua",Palatino,Georgia,serif}
+h1,h2,h3{font-variant:small-caps}
+h1,h2{letter-spacing:-1px}
+h1,h1 code{font-size:xx-large}
+h2,h2 code{font-size:x-large}
+h3,h3 code{font-size:large}
+h1{border-bottom:4px double;width:100%;margin:1em 0}
 h1:before{content:"Chapter " counter(h1) ". "}
 h1{counter-reset:h2}
 h2:before{counter-increment:h2;content:counter(h1) "." counter(h2) ". "}
-h2{counter-reset:h3}
+h2{counter-reset:h3;border-top:1px dotted;margin-top:1.75em;padding-top:1.75em}
+#toc + h2{border:0;margin:0;padding:0}
+#toc + h2:before{content:""}
 h3:before{counter-increment:h3;content:counter(h1) "." counter(h2) "." counter(h3) ". "}
+aside{display:block;float:right;font-style:oblique;font-size:xx-large;width:25%;margin:1.75em 0 .75em 1.75em;background:steelblue;color:white;padding:1.75em;border:1px solid;-moz-border-radius:1em;-webkit-border-radius:1em;border-radius:1em}
\ No newline at end of file
diff --git a/dip3.js b/dip3.js
index 465420f..76df991 100644
--- a/dip3.js
+++ b/dip3.js
@@ -100,6 +100,6 @@ function showTOC() {
 	toc += '';
 	level -= 1;
     }
-    toc = ' hide table of contents' + toc;
+    toc = ' hide table of contents
  1. Full table of contents
  2. ' + toc.substring(4); $("#toc").html(toc); } diff --git a/iterators-and-generators.html b/iterators-and-generators.html index 2ab0dca..a18154c 100644 --- a/iterators-and-generators.html +++ b/iterators-and-generators.html @@ -14,9 +14,10 @@ body{counter-reset:h1 11}

    East is East, and West is West, and never the twain shall meet.
    Rudyard Kipling

      -

    Diving in

    +

    Diving In

    Let’s talk about plural nouns. Also, functions that return other functions, advanced regular expressions, iterators, and generators. But first, let’s talk about how to make plural nouns. (If you haven’t read the chapter on regular expressions, now would be a good time. This chapter assumes you understand the basics of regular expressions, and quickly descends into more advanced uses.)

    English is a schizophrenic language that borrows from a lot of other languages, and the rules for making singular nouns into plural nouns are varied and complex. There are rules, and then there are exceptions to those rules, and then there are exceptions to the exceptions. +

    If you grew up in an English-speaking country or learned English in a formal school setting, you’re probably familiar with the basic rules:

    • If a word ends in S, X, or Z, add ES. Bass becomes basses, fax becomes faxes, and waltz becomes waltzes. @@ -27,7 +28,7 @@ body{counter-reset:h1 11}

      (I know, there are a lot of exceptions. Man becomes men and woman becomes women, but human becomes humans. Mouse becomes mice and louse becomes lice, but house becomes houses. Knife becomes knives and wife becomes wives, but lowlife becomes lowlifes. And don’t even get me started on words that are their own plural, like sheep, deer, and haiku.)

      Other languages, of course, are completely different.

      Let’s design a Python library that automatically pluralizes English nouns. We’ll start just these four rules, but keep in mind that you’ll inevitably need to add more. -

      I know, let’s use regular expressions!

      +

      I Know, Let’s Use Regular Expressions!

      So you’re looking at words, which, at least in English, means you’re looking at strings of characters. You have rules that say you need to find different combinations of characters, then do different things to them. This sounds like a job for regular expressions!

      [download plural1.py]

      import re
      @@ -111,7 +112,7 @@ def plural(noun):
       

Regular expression substitutions are extremely powerful, and the \1 syntax makes them even more powerful. But combining the entire operation into one regular expression is also much harder to read, and it doesn’t directly map to the way you first described the pluralizing rules. You originally laid out rules like “if the word ends in S, X, or Z, then add ES”. If you look at this function, you have two lines of code that say “if the word ends in S, X, or Z, then add ES”. It doesn’t get much more direct than that. -

A list of functions

+

A List Of Functions

Now you’re going to add a level of abstraction. You started by defining a list of rules: if this, do that, otherwise go to the next rule. Let’s temporarily complicate part of the program so you can simplify another part. @@ -159,6 +160,7 @@ def plural(noun):

  • Since the rules have been broken out into a separate data structure, the new plural() function can be reduced to a few lines of code. Using a for loop, you can pull out the match and apply rules two at a time (one match, one apply) from the rules structure. On the first iteration of the for loop, matches_rule will get match_sxz, and apply_rule will get apply_sxz. On the second iteration (assuming you get that far), matches_rule will be assigned match_h, and apply_rule will be assigned apply_h. The function is guaranteed to return something eventually, because the final match rule (match_default) simply returns True, meaning the corresponding apply rule (apply_default) will always be applied. +

    The reason this technique works is that everything in Python is an object, including functions. The rules data structure contains functions — not names of functions, but actual function objects. When they get assigned in the for loop, then matches_rule and apply_rule are actual functions that you can call. On the first iteration of the for loop, this is equivalent to calling matches_sxz(noun), and if it returns a match, calling apply_sxz(noun).

    If this additional level of abstraction is confusing, try unrolling the function to see the equivalence. The entire for loop is equivalent to the following: @@ -188,7 +190,7 @@ def plural(noun):

    But this is really just a stepping stone to the next section. Let’s move on… -

    A list of patterns

    +

    A List Of Patterns

    Defining separate named functions for each match and apply rule isn’t really necessary. You never call them directly; you add them to the rules list and call them through there. Furthermore, each function follows one of two patterns. All the match functions call re.search(), and all the apply functions call re.sub(). Let’s factor out the patterns so that defining new rules can be easier. @@ -234,7 +236,7 @@ def build_match_and_apply_functions(pattern, search, replace):

  • Since the rules list is the same as the previous example (really, it is), it should come as no surprise that the plural() function hasn’t changed at all. It’s completely generic; it takes a list of rule functions and calls them in order. It doesn’t care how the rules are defined. In the previous example, they were defined as seperate named functions. Now they are built dynamically by mapping the output of the build_match_and_apply_functions() function onto a list of raw strings. It doesn’t matter; the plural function still works the same way. -

    A file of patterns

    +

    A File Of Patterns

    You’ve factored out all the duplicate code and added enough abstractions so that the pluralization rules are defined in a list of strings. The next logical step is to take these strings and put them in a separate file, where they can be maintained separately from the code that uses them. @@ -325,7 +327,7 @@ def plural(noun):

    Since make_counter sets up an infinite loop, you could theoretically do this forever, and it would just keep incrementing x and spitting out values. But let’s look at more productive uses of generators instead. -

    A Fibonacci generator

    +

    A Fibonacci Generator

    [download fibonacci.py]

    def fib(max):
    @@ -339,6 +341,8 @@ def plural(noun):
     
  • b is the next number in the sequence, so assign that to a, but also calculate the next value (a + b) and assign that to b for later use. Note that this happens in parallel; if a is 3 and b is 5, then a, b = b, a + b will set a to 5 (the previous value of b) and b to 8 (the sum of the previous values of a and b). + +

    So you have a function that spits out successive Fibonacci numbers. Sure, you could do that with recursion, but this way is easier to read. Also, it works well with for loops.

    @@ -351,7 +355,7 @@ def plural(noun):
     
  • Each time through the for loop, n gets a new value from the yield statement in fib(), and all you have to do is print it out. Once fib() runs out of numbers (a becomes bigger than max, which in this case is 1000), then the for loop exits gracefully. -

    A plural rule generator

    +

    A Plural Rule Generator

    Let’s go back to plural5.py and see how this version of the plural() function works. @@ -381,7 +385,7 @@ def plural(noun):

    In truth, generators are special case of iterators. A function that yields values is a nice, compact way of building an iterator without building an iterator. Let me show you what I mean by that. -

    A Fibonacci iterator

    +

    A Fibonacci Iterator

    Remember the Fibonacci generator? Here it is as a built-from-scratch iterator: @@ -428,9 +432,10 @@ def plural(noun):

  • How does the for loop know when to stop? I’m glad you asked! When next(fib_iter) raises a StopIteration exception, the for loop will swallow the exception and gracefully exit. (Any other exception will pass through and be raised as usual.) And where have you seen a StopIteration exception? In the __next__() method, of course! -

    A plural rule iterator

    +

    A Plural Rule Iterator

    -

    Now it’s time for the finale… +

    +

    Now it’s time for the finale.

    [download plural6.py]

    class LazyRules:
    @@ -558,7 +563,7 @@ rules = LazyRules()
  • Separation of code and data. All the patterns are stored in a separate file. Code is code, and data is data, and never the twain shall meet. -

    Further reading

    +

    Further Reading

    • PEP 234: Iterators
    • PEP 255: Simple Generators diff --git a/native-datatypes.html b/native-datatypes.html index 8781014..82899f6 100644 --- a/native-datatypes.html +++ b/native-datatypes.html @@ -14,7 +14,7 @@ body{counter-reset:h1 2}

      Wonder is the foundation of all philosophy, inquiry its progress, ignorance its end.
      — Michel de Montaigne

        -

      Diving in

      +

      Diving In

      Cast aside your first Python program for just a minute, and let's talk about datatypes. In Python, every variable has a datatype, but you don't need to declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally.

      Python has many native datatypes. Here are the important ones:

        @@ -29,6 +29,7 @@ body{counter-reset:h1 2}

        Of course, there are a lot more types than these seven. Everything is an object in Python, so there are types like module, function, class, method, file, and even compiled code. You've already seen some of these: modules have names, functions have docstrings, &c. You'll learn about classes in [FIXME xref] and files in [FIXME xref].

        Strings and bytes are important enough — and complicated enough — that they get their own chapter. Let's look at the others first.

        Booleans

        +

        Booleans are either true or false. Python has two constants, True and False, which can be used to assign boolean values directly. Expressions can also evaluate to a boolean value. In certain places (like if statements), Python expects an expression to evaluate to a boolean value. These places are called boolean contexts. You can use virtually any expression in a boolean context, and Python will try to determine its truth value. Different datatypes have different rules about which values are true or false in a boolean context. (This will make more sense once you see some concrete examples later in this chapter.)

        For example, take this snippet from humansize.py:

        if size < 0:
        @@ -60,7 +61,7 @@ body{counter-reset:h1 2}
         
      1. Adding an int to an int yields an int.
      2. Adding an int to a float yields a float. Python coerces the int into a float to perform the addition, then returns a float as the result.
      -

      Coercing integers to floats and vice-versa

      +

      Coercing Integers To Floats And Vice-Versa

      As you just saw, some operators (like addition) will coerce integers to floating point numbers as needed. You can also coerce them by yourself.

       >>> float(2)                
      @@ -86,7 +87,7 @@ body{counter-reset:h1 2}
       

      Python 2 had separate types for int and long. The int datatype was limited by sys.maxint, which varied by platform but was usually 232-1. Python 3 has just one integer type, which behaves mostly like the old long type from Python 2. See PEP 237 for details.

      -

      Common numerical operations

      +

      Common Numerical Operations

      You can do all kinds of things with numbers.

       >>> 11 / 2      
      @@ -145,7 +146,8 @@ body{counter-reset:h1 2}
       
    • The math module has all the basic trigonometric functions, including sin(), cos(), tan(), and variants like asin().
    • Note, however, that Python does not have infinite precision. tan(π / 4) should return 1.0, not 0.99999999999999989. -

      Numbers in a boolean context

      +

      Numbers In A Boolean Context

      +

      You can use numbers in a boolean context, such as an if statement. Zero values are false, and non-zero values are true.

       >>> def is_it_true(anything):             
      @@ -183,7 +185,7 @@ body{counter-reset:h1 2}
       

      A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the ArrayList class, which can hold arbitrary objects and can expand dynamically as new items are added.

      -

      Creating a list

      +

      Creating A List

      Creating a list is easy: use square brackets to wrap a comma-separated list of values.

       >>> a_list = ['a', 'b', 'mpilgrim', 'z', 'example']  
      @@ -204,7 +206,8 @@ body{counter-reset:h1 2}
       
    • A negative index accesses items from the end of the list counting backwards. The last item of any non-empty list is always a_list[-1].
    • If the negative index is confusing to you, think of it this way: a_list[-n] == a_list[len(a_list) - n]. So in this list, a_list[-3] == a_list[5 - 3] == a_list[2]. -

      Slicing a list

      +

      Slicing A List

      +

      Once you've defined a list, you can get any part of it as a new list. This is called slicing the list.

       >>> a_list
      @@ -229,7 +232,7 @@ body{counter-reset:h1 2}
       
    • Similarly, if the right slice index is the length of the list, you can leave it out. So a_list[3:] is the same as a_list[3:5], because this list has five items. There is a pleasing symmetry here. In this five-item list, a_list[:3] returns the first 3 items, and a_list[3:] returns the last two items. In fact, a_list[:n] will always return the first n items, and a_list[n:] will return the rest, regardless of the length of the list.
    • If both slice indices are left out, all items of the list are included. But this is not the same as the original a_list variable. It is a new list that happens to have all the same items. a_list[:] is shorthand for making a complete copy of a list. -

      Adding items to a list

      +

      Adding Items To A List

      There are four ways to add items to a list.

       >>> a_list = ['a']
      @@ -265,7 +268,7 @@ body{counter-reset:h1 2}
       >>> a_list
       ['a', 'b', 'c', 'd', 'e', 'f', ['g', 'h', 'i']]
       >>> len(a_list)                     
      -4
      +7
       >>> a_list[-1]
       ['g', 'h', 'i']
        @@ -274,7 +277,7 @@ body{counter-reset:h1 2}
      1. On the other hand, the append() method takes any number of arguments, each of which can be any datatype. Here, you're calling the append() method with a single argument, a list of three items.
      2. If you start with a list of six items and append a list onto it, you end up with... a list of seven items. Why seven? Because the last item (which you just appended) is itself a list. Lists can contain any type of data, including other lists. That may be what you want, or it may not. But it's what you asked for, and it's what you got.
      -

      Searching for values in a list

      +

      Searching For Values In A List

       >>> a_list = ['a', 'b', 'new', 'mpilgrim', 'new']
       >>> 'mpilgrim' in a_list      
      @@ -296,7 +299,8 @@ ValueError: list.index(x): x not in list
    • The index() method finds the first occurrence of a value in the list. In this case, 'new' occurs twice in the list, in a_list[2] and a_list[4], but the index() method will return only the index of the first occurrence.
    • As you might not expect, if the value is not found in the list, Python raises an exception. This is notably different from most languages, which will return some invalid index (like -1). While this may seem annoying at first, I think you will come to appreciate it. It means your program will crash at the source of the problem instead of failing strangely and silently later. -

      Lists in a boolean context

      +

      Lists In A Boolean Context

      +

      You can also use a list in a boolean context, such as an if statement.

       >>> def is_it_true(anything):
      @@ -322,7 +326,7 @@ ValueError: list.index(x): x not in list

      A dictionary in Python is like a hash in Perl 5. In Perl 5, variables that store hashes always start with a % character. In Python, variables can be named anything, and Python keeps track of the datatype internally.

      -

      Creating a dictionary

      +

      Creating A Dictionary

      Creating a dictionary is easy. The syntax is similar to sets, but instead of values, you have key-value pairs. Once you have a dictionary, you can look up values by their key.

       >>> a_dict = {"server":"db.diveintopython3.org", "database":"mysql"}  
      @@ -342,7 +346,7 @@ KeyError: 'db.diveintopython3.org'
    • 'database' is a key, and its associated value, referenced by a_dict["database"], is 'mysql'.
    • You can get values by key, but you can't get keys by value. So a_dict["server"] is 'db.diveintopython3.org', but a_dict["db.diveintopython3.org"] raises an exception, because 'db.diveintopython3.org' is not a key. -

      Modifying a dictionary

      +

      Modifying A Dictionary

      Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any time, or you can modify the value of an existing key. Continuing from the previous example:

       >>> a_dict
      @@ -366,7 +370,7 @@ KeyError: 'db.diveintopython3.org'
    • Assigning a value to an existing dictionary key simply replaces the old value with the new one.
    • Will this change the value of the user key back to "mark"? No! Look at the key closely — that's a capital U in "User". Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it's completely different. -

      Mixed-value dictionaries

      +

      Mixed-Value Dictionaries

      Dictionaries aren't just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don't all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.

      In fact, you've already seen a dictionary with non-string keys and values, in your first Python program.

      SUFFIXES = {1000: ('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'),
      @@ -389,8 +393,9 @@ KeyError: 'db.diveintopython3.org'
    • Similarly, 1024 is a key in the SUFFIXES dictionary; its value is also a list of eight items.
    • Since SUFFIXES[1000] is a list, you can address individual items in the list by their 0-based index. -

      Dictionaries in a boolean context

      -

      You can also use a list in a boolean context, such as an if statement. +

      Dictionaries In A Boolean Context

      + +

      You can also use a dictionary in a boolean context, such as an if statement.

       >>> def is_it_true(anything):
       ...   if anything:
      @@ -427,7 +432,7 @@ KeyError: 'db.diveintopython3.org'
      >>> x == y True
    • -

      None in a boolean context

      +

      None In A Boolean Context

      In a boolean context, None is false and not None is true.

       >>> def is_it_true(anything):
      @@ -440,7 +445,7 @@ KeyError: 'db.diveintopython3.org'
      no, it's false >>> is_it_true(not None) yes, it's true
    • -

      Further reading

      +

      Further Reading

      • The fractions module
      • The math module diff --git a/regular-expressions.html b/regular-expressions.html index 518053f..6378c61 100644 --- a/regular-expressions.html +++ b/regular-expressions.html @@ -14,14 +14,14 @@ body{counter-reset:h1 4}

        Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.
        Jamie Zawinski

          -

        Diving in

        +

        Diving In

        Every modern programming language has built-in functions for working with strings. In Python, strings have methods for searching and replacing: index(), find(), split(), count(), replace(), &c. But these methods are limited to the simplest of cases. For example, the index() method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace() and split() methods have the same limitations.

        If your goal can be accomplished with string methods, you should use them. They’re fast and simple and easy to read, and there’s a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with if statements to handle special cases, or if you’re chaining calls to split() and join() to slice-and-dice your strings, you may need to move up to regular expressions.

        Regular expressions are a powerful and (mostly) standardized way of searching, replacing, and parsing text with complex patterns of characters. Although the regular expression syntax is tight and unlike normal code, the result can end up being more readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions, so you can include fine-grained documentation within them.

        If you’ve used regular expressions in other languages (like Perl 5), Python’s syntax will be very familiar. Read the summary of the re module to get an overview of the available functions and their arguments.

        -

        Case study: street addresses

        +

        Case Study: Street Addresses

        This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don’t just make this stuff up; it’s actually useful.) This example shows how I approached the problem.

         >>> s = '100 NORTH MAIN ROAD'
        @@ -42,6 +42,7 @@ body{counter-reset:h1 4}
         
      • It’s time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the re module.
      • Take a look at the first parameter: 'ROAD$'. This is a simple regular expression that matches 'ROAD' only when it occurs at the end of a string. The $ means “end of the string.” (There is a corresponding character, the caret ^, which means “beginning of the string.”) Using the re.sub function, you search the string s for the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s, but does not match the ROAD that’s part of the word BROAD, because that’s in the middle of s. +

        Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching 'ROAD' at the end of the address, was not good enough, because not all addresses included a street designation at all. Some addresses simply ended with the street name. I got away with it most of the time, but if the street name was 'BROAD', then the regular expression would match 'ROAD' at the end of the string as part of the word 'BROAD', which is not what I wanted.

         >>> s = '100 BROAD'
        @@ -62,7 +63,7 @@ body{counter-reset:h1 4}
         
      • *sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word 'ROAD' as a whole word by itself, but it wasn’t at the end, because the address had an apartment number after the street designation. Because 'ROAD' isn’t at the very end of the string, it doesn’t match, so the entire call to re.sub ends up replacing nothing at all, and you get the original string back, which is not what you want.
      • To solve this problem, I removed the $ character and added another \b. Now the regular expression reads “match 'ROAD' when it’s a whole word by itself anywhere in the string,” whether at the end, the beginning, or somewhere in the middle. -

        Case study: Roman numerals

        +

        Case Study: Roman Numerals

        You’ve most likely seen Roman numerals, even if you didn’t recognize them. You may have seen them in copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You may also have seen them in outlines and bibliographical references. It’s a system of representing numbers that really does date back to the ancient Roman empire (hence the name).

        In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.

          @@ -82,7 +83,7 @@ body{counter-reset:h1 4}
        • The fives characters can not be repeated. The number 10 is always represented as X, never as VV. The number 100 is always C, never LL.
        • Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much. DC is 600; CD is a completely different number (400, 100 less than 500). CI is 101; IC is not even a valid Roman numeral (because you can’t subtract 1 directly from 100; you would need to write it as XCIX, for 10 less than 100, then 1 less than 10).
        -

        Checking for thousands

        +

        Checking For Thousands

        What would it take to validate that an arbitrary string is a valid Roman numeral? Let’s take it one digit at a time. Since Roman numerals are always written highest to lowest, let’s start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of M characters.

         >>> import re
        @@ -104,7 +105,8 @@ body{counter-reset:h1 4}
         
      • 'MMMM' does not match. All three M characters match, but then the regular expression insists on the string ending (because of the $ character), and the string doesn’t end yet (because of the fourth M). So search() returns None.
      • Interestingly, an empty string also matches this regular expression, since all the M characters are optional. -

        Checking for hundreds

        +

        Checking For Hundreds

        +

        The hundreds place is more difficult than the thousands, because there are several mutually exclusive ways it could be expressed, depending on its value.

        • 100 = C @@ -150,7 +152,8 @@ body{counter-reset:h1 4}
        • Interestingly, an empty string still matches this pattern, because all the M characters are optional and ignored, and the empty string matches the D?C?C?C? pattern where all the characters are optional and ignored.

          Whew! See how quickly regular expressions can get nasty? And you’ve only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they’re exactly the same pattern. But let’s look at another way to express the pattern. -

          Using the {n,m} Syntax

          +

          Using The {n,m} Syntax

          +

          In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.

           >>> import re
          @@ -188,7 +191,7 @@ body{counter-reset:h1 4}
           
        • This matches the start of the string, then three M out of a possible three, then the end of the string.
        • This matches the start of the string, then three M out of a possible three, but then does not match the end of the string. The regular expression allows for up to only three M characters before the end of the string, but you have four, so the pattern does not match and returns None. -

          Checking for tens and ones

          +

          Checking For Tens And Ones

          Now let’s expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.

           >>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
          @@ -209,6 +212,7 @@ body{counter-reset:h1 4}
           
        • This matches the start of the string, then the first optional M, then CM, then the optional L and all three optional X characters, then the end of the string. MCMLXXX is the Roman numeral representation of 1980.
        • This matches the start of the string, then the first optional M, then CM, then the optional L and all three optional X characters, then fails to match the end of the string because there is still one more X unaccounted for. So the entire pattern fails to match, and returns None. MCMLXXXX is not a valid Roman numeral. +

          The expression for the ones place follows the same pattern. I’ll spare you the details and show you the end result.

           >>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
          @@ -264,7 +268,8 @@ body{counter-reset:h1 4}
           
        • This matches the start of the string, then four of a possible four M, then D and three of a possible three C, then L and three of a possible three X, then V and three of a possible three I, then the end of the string.
        • This does not match. Why? Because it doesn’t have the re.VERBOSE flag, so the re.search function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can’t auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose. -

          Case study: parsing phone numbers

          +

          Case study: Parsing Phone Numbers

          +

          So far you’ve concentrated on matching whole patterns. Either the pattern matches, or it doesn’t. But regular expressions are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.

          This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the area code, trunk, number, and optionally an extension separately in the company’s database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.

          Here are the phone numbers I needed to be able to accept: diff --git a/strings.html b/strings.html index d2ea691..b1de286 100644 --- a/strings.html +++ b/strings.html @@ -16,13 +16,15 @@ body{counter-reset:h1 3} My alphabet starts where your alphabet ends!
          — Dr. Seuss, On Beyond Zebra!

            -

          Some boring stuff you need to understand before you can dive in

          +

          Some Boring Stuff You Need To Understand Before You Can Dive In

          Did you know that the people of Bougainville have the smallest alphabet in the world? Their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course, has 26 letters — 52 if you count uppercase and lowercase separately — plus a handful of !@#$%& punctuation marks.

          When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages.

          In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and the result will be gibberish. +

          +

          Surely you’ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn’t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it’s merely annoying; in other languages, the result can be completely unreadable.

          There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. For instance, you’re probably familiar with the ASCII encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, &c.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte. @@ -97,8 +99,9 @@ La Peña

        • -

          Diving in

          +

          Diving In

          +

          Let's take another look at humansize.py:

          [download humansize.py] @@ -135,7 +138,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):

        • There's a… whoa, what the heck is that? -

          Formatting strings

          +

          Formatting Strings

          Python 3 supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is to insert a value into a string with single placeholder. @@ -149,7 +152,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):

        • There's a lot going on here. First, that's a method call on a string literal. Strings are objects, and objects have methods. Second, the whole expression evaluates to a string. Third, {0} and {1} are replacement fields, which are replaced by the arguments passed to the format() method. -

          Compound field names

          +

          Compound Field Names

          The previous example shows the simplest case, where the replacement fields are simply integers. Integer replacement fields are treated as positional indices into the argument list of the format() method. That means that {0} is replaced by the first argument (username in this case), {1} is replaced by the second argument (password), &c. You can have as many positional indices as you have arguments, and you can have as many arguments as you want. But replacement fields are much more powerful than that. @@ -166,6 +169,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):

        • This looks complicated, but it's not. {0} would refer to the first argument passed to the format() method, si_suffixes. But si_suffixes is a list. So {0[0]} refers to the first item of the list which is the first argument passed to the format() method: 'KB'. Meanwhile, {0[1]} refers to the second item of the same list: 'MB'. Everything outside the curly braces — including 1000, the equals sign, and the spaces — is untouched. The final result is the string '1000KB = 1MB'. +

          What this example shows is that format specifers can access items and properties of data structures using (almost) Python syntax. This is called compound field names. The following compound field names "just work":

            @@ -195,7 +199,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
          • sys.modules["humansize"].SUFFIXES[1000][0] is the first item of the list of SI suffixes: 'KB'. Therefore, the complete replacement field {0.modules[humansize].SUFFIXES[1000][0]} is replaced by the two-character string KB.
          -

          Format specifiers

          +

          Format Specifiers

          But wait! There's more! Let's take another look at that strange line of code from humansize.py: @@ -216,7 +220,7 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):

          For all the gory details on format specifiers, consult the Format Specification Mini-Language in the official Python documentation. -

          Other common string methods

          +

          Other Common String Methods

          Besides formatting, strings can do a number of other useful tricks. @@ -241,7 +245,7 @@ experience of years.

        • You can input multi-line strings in the Python interactive shell. Once you start a multi-line string with triple quotation marks, just hit ENTER and the interactive shell will prompt you to continue the string. Typing the closing triple quotation marks ends the string, and the next ENTER will execute the command (in this case, assigning the string to s).
        • The splitlines() method takes one multi-line string and returns a list of strings, one for each line of the original. Note that the carriage returns at the end of each line are not included.
        • The lower() method converts the entire string to lowercase. (Similarly, the upper() method converts a string to uppercase.) -
        • the count() method counts the number of occurrences of a substring. Yes, there really are six “f”s in that sentence! +
        • The count() method counts the number of occurrences of a substring. Yes, there really are six “f”s in that sentence! -

          The string module

          +

          The string Module

          [FIXME is this worth keeping? The module still exists in 3.0; check if it's going away in 3.1 or something.] @@ -326,9 +330,11 @@ is an object. You might have thought I meant that string variables are

          When I first learned Python, I expected join to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the join method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate string module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like lower, upper, and split. But many hard-core Python programmers objected to the new join method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old string module (which still has a lot of useful stuff in it). I use the new join method exclusively, but you will see code written either way, and if it really bothers you, you can use the old string.join function instead. -

          Strings vs. bytes

          +

          Strings vs. Bytes

          -

          Character encoding of Python source code

          +

          FIXME + +

          Character Encoding Of Python Source Code

          Python 3 assumes that your source code — i.e. each .py file — is encoded in UTF-8. @@ -347,7 +353,7 @@ is an object. You might have thought I meant that string variables are

          For more information, consult PEP 263: Defining Python Source Code Encodings. -

          Further reading

          +

          Further Reading

          On Unicode in Python: diff --git a/unit-testing.html b/unit-testing.html index 0a445d6..0523a81 100644 --- a/unit-testing.html +++ b/unit-testing.html @@ -14,7 +14,7 @@ body{counter-reset:h1 7}

          Certitude is not the test of certainty. We have been cocksure of many things that were not so.
          Oliver Wendell Holmes, Jr.

            -

          (Not) diving in

          +

          (Not) Diving In

          How do you know that the code you wrote yesterday still works after the changes you made today? Every seasoned programmer has war stories of an “innocent” change that couldn't possibly have affected that other “unrelated” module… If this sounds familiar, this chapter is for you.

          In this chapter, you're going to write and debug a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in “Case study: roman numerals”. Now step back and consider what it would take to expand that into a two-way utility.

          The rules for Roman numerals lead to a number of interesting observations: @@ -37,7 +37,8 @@ body{counter-reset:h1 7}

        • When maintaining code, it helps you cover your ass when someone comes screaming that your latest change broke their old code. (“But sir, all the unit tests passed when I checked it in...”)
        • When writing code in a team, it increases confidence that the code you're about to commit isn't going to break someone else's code, because you can run their unit tests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that doesn't play well with others.)
        -

        A single question

        +

        A Single Question

        +

        A test case answers a single question about the code it is testing. A test case should be able to...

        • ...run completely by itself, without any human input. Unit testing is about automation. @@ -126,6 +127,7 @@ if __name__ == "__main__":
        • Here you call the actual to_roman() function. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you have now defined the API for the to_roman() function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the API is different than that, this test is considered failed. Also notice that you are not trapping any exceptions when you call to_roman(). This is intentional. to_roman() shouldn't raise an exception when you call it with valid input, and these input values are all valid. If to_roman() raises an exception, this test is considered failed.
        • Assuming the to_roman() function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check whether it returned the right value. This is a common question, and the TestCase class provides a method, assertEqual, to check whether two values are equal. If the result returned from to_roman() (result) does not match the known value you were expecting (numeral), assertEqual will raise an exception and the test will fail. If the two values are equal, assertEqual will do nothing. If every value returned from to_roman() matches the known value you expect, assertEqual never raises an exception, so testToRomanKnownValues eventually exits normally, which means to_roman() has passed this test. +

          Once you have a test case, you can start coding the to_roman() function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you've written any code, you're doing it wrong — your tests aren't testing your code at all! Write a test that fails, then code until it passes.

          # roman1.py
           
          @@ -215,7 +217,8 @@ OK
        • Hooray! The to_roman() function passes the “known values” test case. It's not comprehensive, but it does put the function through its paces with a variety of inputs, including inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.

          “Good” input? Hmm. What about bad input? -

          “Halt and catch fire”

          +

          “Halt And Catch Fire”

          +

          It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect.

           >>> import roman1
          @@ -326,7 +329,7 @@ OK
          1. Hooray! Both tests pass. Because you worked iteratively, bouncing back and forth between testing and coding, you can be sure that the two lines of code you just wrote were the cause of that one test going from “fail” to “pass.” That kind of confidence doesn't come cheap, but it will pay for itself over the lifetime of your code.
          -

          More halting, more fire

          +

          More Halting, More Fire

          ... +

          Everything Is An Object

          In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. A function, like everything else in Python, is an object.

          Run the interactive Python shell and follow along:

          @@ -145,7 +149,7 @@ if __name__ == "__main__":
           

          import in Python is like require in Perl. Once you import a Python module, you access its functions with module.function; once you require a Perl module, you access its functions with module::function.

          -

          The import search path

          +

          The import Search Path

          Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in sys.path. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.)

           >>> import sys                       
          @@ -160,11 +164,11 @@ if __name__ == "__main__":
           
        • Actually, I lied; the truth is more complicated than that, because not all modules are stored as .py files. Some, like the sys module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The sys module is written in C.)
        • You can add a new directory to Python's search path at runtime by appending the directory name to sys.path, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll learn more about append() and other list methods in [FIXME xref-was-#datatypes].) -

          What's an object?

          +

          What's An Object?

          Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute __doc__, which returns the docstring defined in the function's source code. The sys module is an object which has (among other things) an attribute called path. And so forth.

          Still, this doesn't answer the more fundamental question: what is an object? Different programming languages define “object” in different ways. In some, it means that all objects must have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in [FIXME xref-was-#datatypes]), and not all objects are subclassable (more on this in [FIXME xref-was-#fileinfo]). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function (more in this in [FIXME xref-was-#apihelp]).

          This is so important that I'm going to repeat it in case you missed it the first few times: everything in Python is an object. Strings are objects. Lists are objects. Functions are objects. Even modules are objects. -

          Indenting code

          +

          Indenting Code

          Python functions have no explicit begin or end, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (:) and the indentation of the code itself.

          
           def approximate_size(size, a_kilobyte_is_1024_bytes=True):  
          @@ -189,7 +193,8 @@ if __name__ == "__main__":
           

          Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. C++ and Java use semicolons to separate statements and curly braces to separate code blocks.

          -

          Running scripts

          +

          Running Scripts

          +

          Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of humansize.py:

          
           if __name__ == "__main__":
          @@ -208,7 +213,7 @@ if __name__ == "__main__":
           c:\home\diveintopython3> c:\python30\python.exe humansize.py
           1.0 TB
           931.3 GiB
          -

          Further reading

          +

          Further Reading