diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 20ff49f..3d4b7f3 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -44,7 +44,7 @@ body{counter-reset:h1 20}
bytes' object to str implicitly
-chardet: a mini-FAQchardet: a mini-FAQWhen you think of “text,” you probably think of “characters and symbols I see on my computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
Don’t do that. Virtually every format and protocol contains a method for specifying character encoding.
charset parameter in the Content-type header.
-<meta http-equiv="content-type"> element in the <head> of a web page.
-encoding attribute in the XML prolog.
+charset parameter in the Content-type header.
+<meta http-equiv="content-type"> element in the <head> of a web page.
+encoding attribute in the XML prolog.
If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards and figure out which one wins if they give you conflicting information.) +
If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards and figure out which one wins if they give you conflicting information.)
Despite the complexity, it’s worthwhile to follow standards and respect explicit character encoding information. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all. @@ -676,7 +676,7 @@ TypeError: can't use a string pattern on a bytes-like object
class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')
-This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255. +
This pre-compiles a regular expression designed to find non-ASCII characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII bytes in the range 128-255.
And therein lies the problem.
In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (u'') instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in universaldetector.py:
skip over this diff --git a/dip2 b/dip2 index 933204f..fe207f6 100644 --- a/dip2 +++ b/dip2 @@ -6479,22 +6479,22 @@ numerals. You saw the mechanics of constructing and validating Roman numerals in
Given all of this, what would you expect out of a set of functions to convert to and from Roman numerals?
roman.py requirementstoRoman should return the Roman numeral representation for all integers 1 to 3999.
+to_roman() should return the Roman numeral representation for all integers 1 to 3999.
-toRoman should fail when given an integer outside the range 1 to 3999.
+to_roman() should fail when given an integer outside the range 1 to 3999.
-toRoman should fail when given a non-integer number.
+to_roman() should fail when given a non-integer number.
-fromRoman should take a valid Roman numeral and return the number that it represents.
+from_roman() should take a valid Roman numeral and return the number that it represents.
-fromRoman should fail when given an invalid Roman numeral.
+from_roman() should fail when given an invalid Roman numeral.
fromRoman(toRoman(n)) == n for all n in 1..3999.
+ you started with. So from_roman(to_roman(n)) == n for all n in 1..3999.
-toRoman should always return a Roman numeral using uppercase letters.
+to_roman() should always return a Roman numeral using uppercase letters.
-fromRoman should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input).
+from_roman() should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input).
The most fundamental part of unit testing is constructing individual test cases. A test case answers a single question about - the code it is testing. -
A test case should be able to... -
A test case answers a single question about the code it is testing. A test case should be able to...
Given that, let's build the first test case. You have the following requirement:
toRoman should return the Roman numeral representation for all integers 1 to 3999.
+to_roman() should return the Roman numeral representation for all integers 1 to 3999.
testToRomanKnownValues
@@ -6761,89 +6758,75 @@ class KnownValues(unittest.TestCase): ①
(3999, 'MMMCMXCIX')) ②
def testToRomanKnownValues(self): ③
- """toRoman should give known result with known input"""
+ """to_roman should give known result with known input"""
for integer, numeral in self.knownValues:
- result = roman.toRoman(integer) ④ ⑤
+ result = roman.to_roman(integer) ④ ⑤
self.assertEqual(numeral, result) ⑥TestCase class of the unittest module. This class provides many useful methods which you can use in your test case to test specific conditions.
-toRoman function. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you
- have now defined the API for the toRoman function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the
-API is different than that, this test is considered failed.
-toRoman. This is intentional. toRoman shouldn't raise an exception when you call it with valid input, and these input values are all valid. If toRoman raises an exception, this test is considered failed.
-toRoman function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check
- whether it returned the right value. This is a common question, and the TestCase class provides a method, assertEqual, to check whether two values are equal. If the result returned from toRoman (result) does not match the known value you were expecting (numeral), assertEqual will raise an exception and the test will fail. If the two values are equal, assertEqual will do nothing. If every value returned from toRoman matches the known value you expect, assertEqual never raises an exception, so testToRomanKnownValues eventually exits normally, which means toRoman has passed this test.
It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. - And not just any sort of failure; they must fail in the way you expect. -
Remember the other requirements for toRoman:
+
It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect. +
Remember the other requirements for to_roman():
toRoman should fail when given an integer outside the range 1 to 3999.
+to_roman() should fail when given an integer outside the range 1 to 3999.
-toRoman should fail when given a non-integer number.
+to_roman() should fail when given a non-integer number.
In Python, functions indicate failure by raising exceptions, and the unittest module provides methods for testing whether a function raises a particular exception when given bad input.
-
toRoman
+Example 13.3. Testing bad input to to_roman()
class ToRomanBadInput(unittest.TestCase):
def testTooLarge(self):
- """toRoman should fail with large input"""
- self.assertRaises(roman.OutOfRangeError, roman.toRoman, 4000) ①
+ """to_roman should fail with large input"""
+ self.assertRaises(roman.OutOfRangeError, roman.to_roman, 4000) ①
def testZero(self):
- """toRoman should fail with 0 input"""
- self.assertRaises(roman.OutOfRangeError, roman.toRoman, 0) ②
+ """to_roman should fail with 0 input"""
+ self.assertRaises(roman.OutOfRangeError, roman.to_roman, 0) ②
def testNegative(self):
- """toRoman should fail with negative input"""
- self.assertRaises(roman.OutOfRangeError, roman.toRoman, -1)
+ """to_roman should fail with negative input"""
+ self.assertRaises(roman.OutOfRangeError, roman.to_roman, -1)
def testNonInteger(self):
- """toRoman should fail with non-integer input"""
- self.assertRaises(roman.NotIntegerError, roman.toRoman, 0.5) ③
+ """to_roman should fail with non-integer input"""
+ self.assertRaises(roman.NotIntegerError, roman.to_roman, 0.5) ③
- The
TestCase class of the unittest provides the assertRaises method, which takes the following arguments: the exception you're expecting, the function you're testing, and the arguments
you're passing that function. (If the function you're testing takes more than one argument, pass them all to assertRaises, in order, and it will pass them right along to the function you're testing.) Pay close attention to what you're doing here:
- instead of calling toRoman directly and manually checking that it raises a particular exception (by wrapping it in a try...except block), assertRaises has encapsulated all of that for us. All you do is give it the exception (roman.OutOfRangeError), the function (toRoman), and toRoman's arguments (4000), and assertRaises takes care of calling toRoman and checking to make sure that it raises roman.OutOfRangeError. (Also note that you're passing the toRoman function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned
+ instead of calling to_roman() directly and manually checking that it raises a particular exception (by wrapping it in a try...except block), assertRaises has encapsulated all of that for us. All you do is give it the exception (roman.OutOfRangeError), the function (to_roman()), and to_roman()'s arguments (4000), and assertRaises takes care of calling to_roman() and checking to make sure that it raises roman.OutOfRangeError. (Also note that you're passing the to_roman() function itself as an argument; you're not calling it, and you're not passing the name of it as a string. Have I mentioned
recently how handy it is that everything in Python is an object, including functions and exceptions?)
- Along with testing numbers that are too large, you need to test numbers that are too small. Remember, Roman numerals cannot
- express
0 or negative numbers, so you have a test case for each of those (testZero and testNegative). In testZero, you are testing that toRoman raises a roman.OutOfRangeError exception when called with 0; if it does not raise a roman.OutOfRangeError (either because it returns an actual value, or because it raises some other exception), this test is considered failed.
- - Requirement #3 specifies that
toRoman cannot accept a non-integer number, so here you test to make sure that toRoman raises a roman.NotIntegerError exception when called with 0.5. If toRoman does not raise a roman.NotIntegerError, this test is considered failed.
-The next two requirements are similar to the first three, except they apply to fromRoman instead of toRoman:
+ express 0 or negative numbers, so you have a test case for each of those (testZero and testNegative). In testZero, you are testing that to_roman() raises a roman.OutOfRangeError exception when called with 0; if it does not raise a roman.OutOfRangeError (either because it returns an actual value, or because it raises some other exception), this test is considered failed.
+
- Requirement #3 specifies that
to_roman() cannot accept a non-integer number, so here you test to make sure that to_roman() raises a roman.NotIntegerError exception when called with 0.5. If to_roman() does not raise a roman.NotIntegerError, this test is considered failed.
+The next two requirements are similar to the first three, except they apply to from_roman() instead of to_roman():
-fromRoman should take a valid Roman numeral and return the number that it represents.
+from_roman() should take a valid Roman numeral and return the number that it represents.
-fromRoman should fail when given an invalid Roman numeral.
+from_roman() should fail when given an invalid Roman numeral.
Requirement #4 is handled in the same way as requirement #1, iterating through a sampling of known values and testing each in turn. Requirement #5 is handled in the same way as requirements
-#2 and #3, by testing a series of bad inputs and making sure fromRoman raises the appropriate exception.
-
Example 13.4. Testing bad input to fromRoman
+#2 and #3, by testing a series of bad inputs and making sure from_roman() raises the appropriate exception.
+Example 13.4. Testing bad input to from_roman()
class FromRomanBadInput(unittest.TestCase):
def testTooManyRepeatedNumerals(self):
- """fromRoman should fail with too many repeated numerals"""
+ """from_roman should fail with too many repeated numerals"""
for s in ('MMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'):
- self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s) ①
+ self.assertRaises(roman.InvalidRomanNumeralError, roman.from_roman, s) ①
def testRepeatedPairs(self):
- """fromRoman should fail with repeated pairs of numerals"""
+ """from_roman should fail with repeated pairs of numerals"""
for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'):
- self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)
+ self.assertRaises(roman.InvalidRomanNumeralError, roman.from_roman, s)
def testMalformedAntecedent(self):
- """fromRoman should fail with malformed antecedents"""
+ """from_roman should fail with malformed antecedents"""
for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV',
'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'):
- self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, s)
+ self.assertRaises(roman.InvalidRomanNumeralError, roman.from_roman, s)
-- Not much new to say about these; the pattern is exactly the same as the one you used to test bad input to
toRoman. I will briefly note that you have another exception: roman.InvalidRomanNumeralError. That makes a total of three custom exceptions that will need to be defined in roman.py (along with roman.OutOfRangeError and roman.NotIntegerError). You'll see how to define these custom exceptions when you actually start writing roman.py, later in this chapter.
+ - Not much new to say about these; the pattern is exactly the same as the one you used to test bad input to
to_roman(). I will briefly note that you have another exception: roman.InvalidRomanNumeralError. That makes a total of three custom exceptions that will need to be defined in roman.py (along with roman.OutOfRangeError and roman.NotIntegerError). You'll see how to define these custom exceptions when you actually start writing roman.py, later in this chapter.
13.6. Testing for sanity
Often, you will find that a unit of code contains a set of reciprocal functions, usually in the form of conversion functions
where one converts A to B and the other converts B to A. In these cases, it is useful to create a “sanity check” to make sure that you can convert A to B and back to A without losing precision, incurring rounding errors, or triggering
@@ -6852,16 +6835,16 @@ class FromRomanBadInput(unittest.TestCase):
- If you take a number, convert it to Roman numerals, then convert that back to a number, you should end up with the number
- you started with. So
fromRoman(toRoman(n)) == n for all n in 1..3999.
+ you started with. So from_roman(to_roman(n)) == n for all n in 1..3999.
-Example 13.5. Testing toRoman against fromRoman
+Example 13.5. Testing to_roman() against from_roman()
class SanityCheck(unittest.TestCase):
def testSanity(self):
- """fromRoman(toRoman(n))==n for all n"""
+ """from_roman(to_roman(n))==n for all n"""
for integer in range(1, 4000): ① ②
- numeral = roman.toRoman(integer)
- result = roman.fromRoman(numeral)
+ numeral = roman.to_roman(integer)
+ result = roman.from_roman(numeral)
self.assertEqual(integer, result) ③
- You've seen the
range function before, but here it is called with two arguments, which returns a list of integers starting at the first argument (1) and counting consecutively up to but not including the second argument (4000). Thus, 1..3999, which is the valid range for converting to Roman numerals.
@@ -6870,41 +6853,41 @@ class SanityCheck(unittest.TestCase):
The last two requirements are different from the others because they seem both arbitrary and trivial:
-toRoman should always return a Roman numeral using uppercase letters.
+to_roman() should always return a Roman numeral using uppercase letters.
-fromRoman should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input).
+from_roman() should only accept uppercase Roman numerals (i.e. it should fail when given lowercase input).
-In fact, they are somewhat arbitrary. You could, for instance, have stipulated that fromRoman accept lowercase and mixed case input. But they are not completely arbitrary; if toRoman is always returning uppercase output, then fromRoman must at least accept uppercase input, or the “sanity check” (requirement #6) would fail. The fact that it only accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying
+
In fact, they are somewhat arbitrary. You could, for instance, have stipulated that from_roman() accept lowercase and mixed case input. But they are not completely arbitrary; if to_roman() is always returning uppercase output, then from_roman() must at least accept uppercase input, or the “sanity check” (requirement #6) would fail. The fact that it only accepts uppercase input is arbitrary, but as any systems integrator will tell you, case always matters, so it's worth specifying
the behavior up front. And if it's worth specifying, it's worth testing.
Example 13.6. Testing for case
class CaseCheck(unittest.TestCase):
def testToRomanCase(self):
- """toRoman should always return uppercase"""
+ """to_roman should always return uppercase"""
for integer in range(1, 4000):
- numeral = roman.toRoman(integer)
+ numeral = roman.to_roman(integer)
self.assertEqual(numeral, numeral.upper()) ①
def testFromRomanCase(self):
- """fromRoman should only accept uppercase input"""
+ """from_roman should only accept uppercase input"""
for integer in range(1, 4000):
- numeral = roman.toRoman(integer)
- roman.fromRoman(numeral.upper()) ② ③
+ numeral = roman.to_roman(integer)
+ roman.from_roman(numeral.upper()) ② ③
self.assertRaises(roman.InvalidRomanNumeralError,
- roman.fromRoman, numeral.lower()) ④
+ roman.from_roman, numeral.lower()) ④
- The most interesting thing about this test case is all the things it doesn't test. It doesn't test that the value returned
- from
toRoman is right or even consistent; those questions are answered by separate test cases. You have a whole test case just to test for uppercase-ness. You might
- be tempted to combine this with the sanity check, since both run through the entire range of values and call toRoman.
+ from to_roman() is right or even consistent; those questions are answered by separate test cases. You have a whole test case just to test for uppercase-ness. You might
+ be tempted to combine this with the sanity check, since both run through the entire range of values and call to_roman().
[6] But that would violate one of the fundamental rules: each test case should answer only a single question. Imagine that you combined this case check with the sanity check, and
then that test case failed. You would need to do further analysis to figure out which part of the test case failed to determine
what the problem was. If you need to analyze the results of your unit testing just to figure out what they mean, it's a sure
sign that you've mis-designed your test cases.
- - There's a similar lesson to be learned here: even though “you know” that
toRoman always returns uppercase, you are explicitly converting its return value to uppercase here to test that fromRoman accepts uppercase input. Why? Because the fact that toRoman always returns uppercase is an independent requirement. If you changed that requirement so that, for instance, it always
+ - There's a similar lesson to be learned here: even though “you know” that
to_roman() always returns uppercase, you are explicitly converting its return value to uppercase here to test that from_roman() accepts uppercase input. Why? Because the fact that to_roman() always returns uppercase is an independent requirement. If you changed that requirement so that, for instance, it always
returned lowercase, the testToRomanCase test case would need to change, but this test case would still work. This was another of the fundamental rules: each test case must be able to work in isolation from any of the others. Every test case is an island.
- - Note that you're not assigning the return value of
fromRoman to anything. This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value. In this case, that's what you want. This test case doesn't test anything about the return
- value; it just tests that fromRoman accepts the uppercase input without raising an exception.
- - This is a complicated line, but it's very similar to what you did in the
ToRomanBadInput and FromRomanBadInput tests. You are testing to make sure that calling a particular function (roman.fromRoman) with a particular value (numeral.lower(), the lowercase version of the current Roman numeral in the loop) raises a particular exception (roman.InvalidRomanNumeralError). If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different
+ - Note that you're not assigning the return value of
from_roman() to anything. This is legal syntax in Python; if a function returns a value but nobody's listening, Python just throws away the return value. In this case, that's what you want. This test case doesn't test anything about the return
+ value; it just tests that from_roman() accepts the uppercase input without raising an exception.
+ - This is a complicated line, but it's very similar to what you did in the
ToRomanBadInput and FromRomanBadInput tests. You are testing to make sure that calling a particular function (roman.from_roman) with a particular value (numeral.lower(), the lowercase version of the current Roman numeral in the loop) raises a particular exception (roman.InvalidRomanNumeralError). If it does (each time through the loop), the test passes; if even one time it does something else (like raises a different
exception, or returning a value without raising an exception at all), the test fails.
In the next chapter, you'll see how to write code that passes these tests.
@@ -6928,11 +6911,11 @@ class OutOfRangeError(RomanError): pass ②
class NotIntegerError(RomanError): pass
class InvalidRomanNumeralError(RomanError): pass ③
-def toRoman(n):
+def to_roman(n):
"""convert integer to Roman numeral"""
pass ④
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
pass
@@ -6940,69 +6923,70 @@ def fromRoman(s):
- This is how you define your own custom exceptions in Python. Exceptions are classes, and you create your own by subclassing existing exceptions. It is strongly recommended (but not
required) that you subclass
Exception, which is the base class that all built-in exceptions inherit from. Here I am defining RomanError (inherited from Exception) to act as the base class for all my other custom exceptions to follow. This is a matter of style; I could just as easily
have inherited each individual exception from the Exception class directly.
- - The
OutOfRangeError and NotIntegerError exceptions will eventually be used by toRoman to flag various forms of invalid input, as specified in ToRomanBadInput.
- - The
InvalidRomanNumeralError exception will eventually be used by fromRoman to flag invalid input, as specified in FromRomanBadInput.
+ - The
OutOfRangeError and NotIntegerError exceptions will eventually be used by to_roman() to flag various forms of invalid input, as specified in ToRomanBadInput.
+ - The
InvalidRomanNumeralError exception will eventually be used by from_roman() to flag invalid input, as specified in FromRomanBadInput.
- At this stage, you want to define the API of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word
pass.
Now for the big moment (drum roll please): you're finally going to run the unit test against this stubby little module. At
this point, every test case should fail. In fact, if any test case passes in stage 1, you should go back to romantest.py and re-evaluate why you coded a test so useless that it passes with do-nothing functions.
+
- At this stage, you want to define the API of each of your functions, but you don't want to code them yet, so you stub them out using the Python reserved word
pass.
Run romantest1.py with the -v command-line option, which will give more verbose output so you can see exactly what's going on as each test case runs.
With any luck, your output should look like this:
-
Example 14.2. Output of romantest1.py against roman1.py
fromRoman should only accept uppercase input ... ERROR
-toRoman should always return uppercase ... ERROR
-fromRoman should fail with malformed antecedents ... FAIL
-fromRoman should fail with repeated pairs of numerals ... FAIL
-fromRoman should fail with too many repeated numerals ... FAIL
-fromRoman should give known result with known input ... FAIL
-toRoman should give known result with known input ... FAIL
-fromRoman(toRoman(n))==n for all n ... FAIL
-toRoman should fail with non-integer input ... FAIL
-toRoman should fail with negative input ... FAIL
-toRoman should fail with large input ... FAIL
-toRoman should fail with 0 input ... FAIL
+Example 14.2. Output of romantest1.py against roman1.py
from_roman should only accept uppercase input ... ERROR
+to_roman should always return uppercase ... ERROR
+from_roman should fail with malformed antecedents ... FAIL
+from_roman should fail with repeated pairs of numerals ... FAIL
+from_roman should fail with too many repeated numerals ... FAIL
+from_roman should give known result with known input ... FAIL
+to_roman should give known result with known input ... FAIL
+from_roman(to_roman(n))==n for all n ... FAIL
+to_roman should fail with non-integer input ... FAIL
+to_roman should fail with negative input ... FAIL
+to_roman should fail with large input ... FAIL
+to_roman should fail with 0 input ... FAIL
======================================================================
-ERROR: fromRoman should only accept uppercase input
+ERROR: from_roman should only accept uppercase input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 154, in testFromRomanCase
- roman1.fromRoman(numeral.upper())
+ roman1.from_roman(numeral.upper())
AttributeError: 'None' object has no attribute 'upper'
======================================================================
-ERROR: toRoman should always return uppercase
+ERROR: to_roman should always return uppercase
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 148, in testToRomanCase
self.assertEqual(numeral, numeral.upper())
AttributeError: 'None' object has no attribute 'upper'
======================================================================
-FAIL: fromRoman should fail with malformed antecedents
+FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 133, in testMalformedAntecedent
- self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s)
+ self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with repeated pairs of numerals
+FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 127, in testRepeatedPairs
- self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s)
+ self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with too many repeated numerals
+FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 122, in testTooManyRepeatedNumerals
- self.assertRaises(roman1.InvalidRomanNumeralError, roman1.fromRoman, s)
+ self.assertRaises(roman1.InvalidRomanNumeralError, roman1.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should give known result with known input
+FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 99, in testFromRomanKnownValues
@@ -7011,7 +6995,7 @@ FAIL: fromRoman should give known result with known input
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None
======================================================================
-FAIL: toRoman should give known result with known input
+FAIL: to_roman should give known result with known input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 93, in testToRomanKnownValues
@@ -7020,7 +7004,7 @@ FAIL: toRoman should give known result with known input
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: I != None
======================================================================
-FAIL: fromRoman(toRoman(n))==n for all n
+FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 141, in testSanity
@@ -7029,38 +7013,38 @@ FAIL: fromRoman(toRoman(n))==n for all n
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None
======================================================================
-FAIL: toRoman should fail with non-integer input
+FAIL: to_roman should fail with non-integer input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 116, in testNonInteger
- self.assertRaises(roman1.NotIntegerError, roman1.toRoman, 0.5)
+ self.assertRaises(roman1.NotIntegerError, roman1.to_roman, 0.5)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: NotIntegerError
======================================================================
-FAIL: toRoman should fail with negative input
+FAIL: to_roman should fail with negative input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 112, in testNegative
- self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, -1)
+ self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, -1)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError
======================================================================
-FAIL: toRoman should fail with large input
+FAIL: to_roman should fail with large input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 104, in testTooLarge
- self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, 4000)
+ self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 4000)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError
======================================================================
-FAIL: toRoman should fail with 0 input ①
+FAIL: to_roman should fail with 0 input ①
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage1\romantest1.py", line 108, in testZero
- self.assertRaises(roman1.OutOfRangeError, roman1.toRoman, 0)
+ self.assertRaises(roman1.OutOfRangeError, roman1.to_roman, 0)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError ②
@@ -7068,12 +7052,6 @@ AssertionError: OutOfRangeError ②
Ran 12 tests in 0.040s ③
FAILED (failures=10, errors=2) ④
-
-- Running the script runs
unittest.main(), which runs each test case, which is to say each method defined in each class within romantest.py. For each test case, it prints out the docstring of the method and whether that test passed or failed. As expected, none of the test cases passed.
- - For each failed test case,
unittest displays the trace information showing exactly what happened. In this case, the call to assertRaises (also called failUnlessRaises) raised an AssertionError because it was expecting toRoman to raise an OutOfRangeError and it didn't.
- - After the detail,
unittest displays a summary of how many tests were performed and how long it took.
- - Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass,
unittest distinguishes between failures and errors. A failure is a call to an assertXYZ method, like assertEqual or assertRaises, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort
- of exception raised in the code you're testing or the unit test case itself. For instance, the testFromRomanCase method (“fromRoman should only accept uppercase input”) was an error, because the call to numeral.upper() raised an AttributeError exception, because toRoman was supposed to return a string but didn't. But testZero (“toRoman should fail with 0 input”) was a failure, because the call to fromRoman did not raise the InvalidRomanNumeral exception that assertRaises was looking for.
14.2. roman.py, stage 2
Now that you have the framework of the roman module laid out, it's time to start writing code and passing test cases.
Example 14.3. roman2.py
@@ -7103,7 +7081,7 @@ romanNumeralMap = (('M', 1000), ①
('IV', 4),
('I', 1))
-def toRoman(n):
+def to_roman(n):
"""convert integer to Roman numeral"""
result = ""
for numeral, integer in romanNumeralMap:
@@ -7112,7 +7090,7 @@ def toRoman(n):
n -= integer
return result
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
pass
@@ -7121,7 +7099,7 @@ def fromRoman(s):
- The character representations of the most basic Roman numerals. Note that this is not just the single-character Roman numerals;
- you're also defining two-character pairs like
CM (“one hundred less than one thousand”); this will make the toRoman code simpler later.
+ you're also defining two-character pairs like CM (“one hundred less than one thousand”); this will make the to_roman() code simpler later.
- The order of the Roman numerals. They are listed in descending value order, from
M all the way down to I.
@@ -7131,81 +7109,81 @@ def fromRoman(s):
- Here's where your rich data structure pays off, because you don't need any special logic to handle the subtraction rule.
To convert to Roman numerals, you simply iterate through romanNumeralMap looking for the largest integer value less than or equal to the input. Once found, you add the Roman numeral representation
to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
-
Example 14.4. How toRoman works
-If you're not clear how toRoman works, add a print statement to the end of the while loop:
+Example 14.4. How to_roman() works
+If you're not clear how to_roman() works, add a print statement to the end of the while loop:
while n >= integer:
result += numeral
n -= integer
print 'subtracting', integer, 'from input, adding', numeral, 'to output'
>>> import roman2
->>> roman2.toRoman(1424)
+>>> roman2.to_roman(1424)
subtracting 1000 from input, adding M to output
subtracting 400 from input, adding CD to output
subtracting 10 from input, adding X to output
subtracting 10 from input, adding X to output
subtracting 4 from input, adding IV to output
'MCDXXIV'
-
So toRoman appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely.
+
So to_roman() appears to work, at least in this manual spot check. But will it pass the unit testing? Well no, not entirely.
Example 14.5. Output of romantest2.py against roman2.py
Remember to run romantest2.py with the -v command-line flag to enable verbose mode.
-
fromRoman should only accept uppercase input ... FAIL
-toRoman should always return uppercase ... ok①
-fromRoman should fail with malformed antecedents ... FAIL
-fromRoman should fail with repeated pairs of numerals ... FAIL
-fromRoman should fail with too many repeated numerals ... FAIL
-fromRoman should give known result with known input ... FAIL
-toRoman should give known result with known input ... ok ②
-fromRoman(toRoman(n))==n for all n ... FAIL
-toRoman should fail with non-integer input ... FAIL ③
-toRoman should fail with negative input ... FAIL
-toRoman should fail with large input ... FAIL
-toRoman should fail with 0 input ... FAIL
+from_roman should only accept uppercase input ... FAIL
+to_roman should always return uppercase ... ok①
+from_roman should fail with malformed antecedents ... FAIL
+from_roman should fail with repeated pairs of numerals ... FAIL
+from_roman should fail with too many repeated numerals ... FAIL
+from_roman should give known result with known input ... FAIL
+to_roman should give known result with known input ... ok ②
+from_roman(to_roman(n))==n for all n ... FAIL
+to_roman should fail with non-integer input ... FAIL ③
+to_roman should fail with negative input ... FAIL
+to_roman should fail with large input ... FAIL
+to_roman should fail with 0 input ... FAIL
-toRoman does, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already.
-- Here's the big news: this version of the
toRoman function passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including
+ to_roman() does, in fact, always return uppercase, because romanNumeralMap defines the Roman numeral representations as uppercase. So this test passes already.
+- Here's the big news: this version of the
to_roman() function passes the known values test. Remember, it's not comprehensive, but it does put the function through its paces with a variety of good inputs, including
inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
- - However, the function does not “work” for bad values; it fails every single bad input test. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to
+
- However, the function does not “work” for bad values; it fails every single bad input test. That makes sense, because you didn't include any checks for bad input. Those test cases look for specific exceptions to
be raised (via
assertRaises), and you're never raising them. You'll do that in the next stage.
Here's the rest of the output of the unit test, listing the details of all the failures. You're down to 10.
======================================================================
-FAIL: fromRoman should only accept uppercase input
+FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 156, in testFromRomanCase
- roman2.fromRoman, numeral.lower())
+ roman2.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with malformed antecedents
+FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 133, in testMalformedAntecedent
- self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s)
+ self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with repeated pairs of numerals
+FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 127, in testRepeatedPairs
- self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s)
+ self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with too many repeated numerals
+FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 122, in testTooManyRepeatedNumerals
- self.assertRaises(roman2.InvalidRomanNumeralError, roman2.fromRoman, s)
+ self.assertRaises(roman2.InvalidRomanNumeralError, roman2.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should give known result with known input
+FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 99, in testFromRomanKnownValues
@@ -7214,7 +7192,7 @@ FAIL: fromRoman should give known result with known input
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None
======================================================================
-FAIL: fromRoman(toRoman(n))==n for all n
+FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 141, in testSanity
@@ -7223,38 +7201,38 @@ FAIL: fromRoman(toRoman(n))==n for all n
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None
======================================================================
-FAIL: toRoman should fail with non-integer input
+FAIL: to_roman should fail with non-integer input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 116, in testNonInteger
- self.assertRaises(roman2.NotIntegerError, roman2.toRoman, 0.5)
+ self.assertRaises(roman2.NotIntegerError, roman2.to_roman, 0.5)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: NotIntegerError
======================================================================
-FAIL: toRoman should fail with negative input
+FAIL: to_roman should fail with negative input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 112, in testNegative
- self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, -1)
+ self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, -1)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError
======================================================================
-FAIL: toRoman should fail with large input
+FAIL: to_roman should fail with large input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 104, in testTooLarge
- self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, 4000)
+ self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError
======================================================================
-FAIL: toRoman should fail with 0 input
+FAIL: to_roman should fail with 0 input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage2\romantest2.py", line 108, in testZero
- self.assertRaises(roman2.OutOfRangeError, roman2.toRoman, 0)
+ self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 0)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: OutOfRangeError
@@ -7262,7 +7240,7 @@ AssertionError: OutOfRangeError
Ran 12 tests in 0.320s
FAILED (failures=10)14.3. roman.py, stage 3
-Now that toRoman behaves correctly with good input (integers from 1 to 3999), it's time to make it behave correctly with bad input (everything else).
+
Now that to_roman() behaves correctly with good input (integers from 1 to 3999), it's time to make it behave correctly with bad input (everything else).
Example 14.6. roman3.py
This file is available in py/roman/stage3/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
@@ -7290,7 +7268,7 @@ romanNumeralMap = (('M', 1000),
('IV', 4),
('I', 1))
-def toRoman(n):
+def to_roman(n):
"""convert integer to Roman numeral"""
if not (0 < n < 4000): ①
raise OutOfRangeError, "number out of range (must be 1..3999)" ②
@@ -7304,7 +7282,7 @@ def toRoman(n):
n -= integer
return result
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
pass
@@ -7315,75 +7293,75 @@ def fromRoman(s):
is never handled.
- This is the non-integer check. Non-integers can not be converted to Roman numerals.
- The rest of the function is unchanged.
-
Example 14.7. Watching toRoman handle bad input
+Example 14.7. Watching to_roman() handle bad input
>>> import roman3
->>> roman3.toRoman(4000)
+>>> roman3.to_roman(4000)
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
- File "roman3.py", line 27, in toRoman
+ File "roman3.py", line 27, in to_roman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)
->>> roman3.toRoman(1.5)
+>>> roman3.to_roman(1.5)
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
- File "roman3.py", line 29, in toRoman
+ File "roman3.py", line 29, in to_roman
raise NotIntegerError, "non-integers can not be converted"
NotIntegerError: non-integers can not be converted
-Example 14.8. Output of romantest3.py against roman3.py
fromRoman should only accept uppercase input ... FAIL
-toRoman should always return uppercase ... ok
-fromRoman should fail with malformed antecedents ... FAIL
-fromRoman should fail with repeated pairs of numerals ... FAIL
-fromRoman should fail with too many repeated numerals ... FAIL
-fromRoman should give known result with known input ... FAIL
-toRoman should give known result with known input ... ok ①
-fromRoman(toRoman(n))==n for all n ... FAIL
-toRoman should fail with non-integer input ... ok ②
-toRoman should fail with negative input ... ok ③
-toRoman should fail with large input ... ok
-toRoman should fail with 0 input ... ok
+Example 14.8. Output of romantest3.py against roman3.py
from_roman should only accept uppercase input ... FAIL
+to_roman should always return uppercase ... ok
+from_roman should fail with malformed antecedents ... FAIL
+from_roman should fail with repeated pairs of numerals ... FAIL
+from_roman should fail with too many repeated numerals ... FAIL
+from_roman should give known result with known input ... FAIL
+to_roman should give known result with known input ... ok ①
+from_roman(to_roman(n))==n for all n ... FAIL
+to_roman should fail with non-integer input ... ok ②
+to_roman should fail with negative input ... ok ③
+to_roman should fail with large input ... ok
+to_roman should fail with 0 input ... ok
-toRoman still passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything.
-- More exciting is the fact that all of the bad input tests now pass. This test,
testNonInteger, passes because of the int(n) <> n check. When a non-integer is passed to toRoman, the int(n) <> n check notices it and raises the NotIntegerError exception, which is what testNonInteger is looking for.
+ to_roman() still passes the known values test, which is comforting. All the tests that passed in stage 2 still pass, so the latest code hasn't broken anything.
+- More exciting is the fact that all of the bad input tests now pass. This test,
testNonInteger, passes because of the int(n) <> n check. When a non-integer is passed to to_roman(), the int(n) <> n check notices it and raises the NotIntegerError exception, which is what testNonInteger is looking for.
- This test,
testNegative, passes because of the not (0 < n < 4000) check, which raises an OutOfRangeError exception, which is what testNegative is looking for.
======================================================================
-FAIL: fromRoman should only accept uppercase input
+FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 156, in testFromRomanCase
- roman3.fromRoman, numeral.lower())
+ roman3.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with malformed antecedents
+FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 133, in testMalformedAntecedent
- self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s)
+ self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with repeated pairs of numerals
+FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 127, in testRepeatedPairs
- self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s)
+ self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with too many repeated numerals
+FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 122, in testTooManyRepeatedNumerals
- self.assertRaises(roman3.InvalidRomanNumeralError, roman3.fromRoman, s)
+ self.assertRaises(roman3.InvalidRomanNumeralError, roman3.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should give known result with known input
+FAIL: from_roman should give known result with known input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 99, in testFromRomanKnownValues
@@ -7392,7 +7370,7 @@ FAIL: fromRoman should give known result with known input
raise self.failureException, (msg or '%s != %s' % (first, second))
AssertionError: 1 != None
======================================================================
-FAIL: fromRoman(toRoman(n))==n for all n
+FAIL: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage3\romantest3.py", line 141, in testSanity
@@ -7405,14 +7383,14 @@ Ran 12 tests in 0.401s
FAILED (failures=6) ①
-- You're down to 6 failures, and all of them involve
fromRoman: the known values test, the three separate bad input tests, the case check, and the sanity check. That means that toRoman has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that fromRoman be written, which it isn't yet.) Which means that you must stop coding toRoman now. No tweaking, no twiddling, no extra checks “just in case”. Stop. Now. Back away from the keyboard.
+ - You're down to 6 failures, and all of them involve
from_roman(): the known values test, the three separate bad input tests, the case check, and the sanity check. That means that to_roman() has passed all the tests it can pass by itself. (It's involved in the sanity check, but that also requires that from_roman() be written, which it isn't yet.) Which means that you must stop coding to_roman() now. No tweaking, no twiddling, no extra checks “just in case”. Stop. Now. Back away from the keyboard.

The most important thing that comprehensive unit testing can tell you is when to stop coding. When all the unit tests for
a function pass, stop coding the function. When all the unit tests for an entire module pass, stop coding the module.
14.4. roman.py, stage 4
-Now that toRoman is done, it's time to start coding fromRoman. Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than
- the toRoman function.
+
Now that to_roman() is done, it's time to start coding from_roman(). Thanks to the rich data structure that maps individual Roman numerals to integer values, this is no more difficult than
+ the to_roman() function.
Example 14.9. roman4.py
This file is available in py/roman/stage4/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
@@ -7440,9 +7418,9 @@ romanNumeralMap = (('M', 1000),
('IV', 4),
('I', 1))
-# toRoman function omitted for clarity (it hasn't changed)
+# to_roman function omitted for clarity (it hasn't changed)
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
result = 0
index = 0
@@ -7453,16 +7431,16 @@ def fromRoman(s):
return result
-- The pattern here is the same as
toRoman. You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer
+ - The pattern here is the same as
to_roman(). You iterate through your Roman numeral data structure (a tuple of tuples), and instead of matching the highest integer
values as often as possible, you match the “highest” Roman numeral character strings as often as possible.
-Example 14.10. How fromRoman works
-If you're not clear how fromRoman works, add a print statement to the end of the while loop:
+Example 14.10. How from_roman() works
+If you're not clear how from_roman() works, add a print statement to the end of the while loop:
while s[index:index+len(numeral)] == numeral:
result += integer
index += len(numeral)
print 'found', numeral, 'of length', len(numeral), ', adding', integer
>>> import roman4
->>> roman4.fromRoman('MCMLXXII')
+>>> roman4.from_roman('MCMLXXII')
found M , of length 1, adding 1000
found CM , of length 2, adding 900
found L , of length 1, adding 50
@@ -7470,56 +7448,56 @@ found X , of length 1, adding 10
found X , of length 1, adding 10
found I , of length 1, adding 1
found I , of length 1, adding 1
-1972Example 14.11. Output of romantest4.py against roman4.py
fromRoman should only accept uppercase input ... FAIL
-toRoman should always return uppercase ... ok
-fromRoman should fail with malformed antecedents ... FAIL
-fromRoman should fail with repeated pairs of numerals ... FAIL
-fromRoman should fail with too many repeated numerals ... FAIL
-fromRoman should give known result with known input ... ok ①
-toRoman should give known result with known input ... ok
-fromRoman(toRoman(n))==n for all n ... ok②
-toRoman should fail with non-integer input ... ok
-toRoman should fail with negative input ... ok
-toRoman should fail with large input ... ok
-toRoman should fail with 0 input ... ok
+1972Example 14.11. Output of romantest4.py against roman4.py
from_roman should only accept uppercase input ... FAIL
+to_roman should always return uppercase ... ok
+from_roman should fail with malformed antecedents ... FAIL
+from_roman should fail with repeated pairs of numerals ... FAIL
+from_roman should fail with too many repeated numerals ... FAIL
+from_roman should give known result with known input ... ok ①
+to_roman should give known result with known input ... ok
+from_roman(to_roman(n))==n for all n ... ok②
+to_roman should fail with non-integer input ... ok
+to_roman should fail with negative input ... ok
+to_roman should fail with large input ... ok
+to_roman should fail with 0 input ... ok
-- Two pieces of exciting news here. The first is that
fromRoman works for good input, at least for all the known values you test.
- - The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both
toRoman and fromRoman work properly for all possible good values. (This is not guaranteed; it is theoretically possible that toRoman has a bug that produces the wrong Roman numeral for some particular set of inputs, and that fromRoman has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that toRoman generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write
+ - Two pieces of exciting news here. The first is that
from_roman() works for good input, at least for all the known values you test.
+ - The second is that the sanity check also passed. Combined with the known values tests, you can be reasonably sure that both
to_roman() and from_roman() work properly for all possible good values. (This is not guaranteed; it is theoretically possible that to_roman() has a bug that produces the wrong Roman numeral for some particular set of inputs, and that from_roman() has a reciprocal bug that produces the same wrong integer values for exactly that set of Roman numerals that to_roman() generated incorrectly. Depending on your application and your requirements, this possibility may bother you; if so, write
more comprehensive test cases until it doesn't bother you.)
======================================================================
-FAIL: fromRoman should only accept uppercase input
+FAIL: from_roman should only accept uppercase input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 156, in testFromRomanCase
- roman4.fromRoman, numeral.lower())
+ roman4.from_roman, numeral.lower())
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with malformed antecedents
+FAIL: from_roman should fail with malformed antecedents
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 133, in testMalformedAntecedent
- self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s)
+ self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with repeated pairs of numerals
+FAIL: from_roman should fail with repeated pairs of numerals
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 127, in testRepeatedPairs
- self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s)
+ self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
======================================================================
-FAIL: fromRoman should fail with too many repeated numerals
+FAIL: from_roman should fail with too many repeated numerals
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage4\romantest4.py", line 122, in testTooManyRepeatedNumerals
- self.assertRaises(roman4.InvalidRomanNumeralError, roman4.fromRoman, s)
+ self.assertRaises(roman4.InvalidRomanNumeralError, roman4.from_roman, s)
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
@@ -7527,9 +7505,9 @@ AssertionError: InvalidRomanNumeralError
Ran 12 tests in 1.222s
FAILED (failures=4)14.5. roman.py, stage 5
-Now that fromRoman works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input.
+
Now that from_roman() works properly with good input, it's time to fit in the last piece of the puzzle: making it work properly with bad input.
That means finding a way to look at a string and determine if it's a valid Roman numeral. This is inherently more difficult
- than validating numeric input in toRoman, but you have a powerful tool at your disposal: regular expressions.
+ than validating numeric input in to_roman(), but you have a powerful tool at your disposal: regular expressions.
If you're not familiar with regular expressions and didn't read Chapter 7, Regular Expressions, now would be a good time.
As you saw in Section 7.3, “Case Study: Roman Numerals”, there are several simple rules for constructing a Roman numeral, using the letters M, D, C, L, X, V, and I. Let's review the rules:
@@ -7573,7 +7551,7 @@ romanNumeralMap = (('M', 1000),
('IV', 4),
('I', 1))
-def toRoman(n):
+def to_roman(n):
"""convert integer to Roman numeral"""
if not (0 < n < 4000):
raise OutOfRangeError, "number out of range (must be 1..3999)"
@@ -7590,7 +7568,7 @@ def toRoman(n):
#Define pattern to detect valid Roman numerals
romanNumeralPattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' ①
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
if not re.search(romanNumeralPattern, s):②
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
@@ -7610,18 +7588,18 @@ def fromRoman(s):
At this point, you are allowed to be skeptical that that big ugly regular expression could possibly catch all the types of
invalid Roman numerals. But don't take my word for it, look at the results:
Example 14.13. Output of romantest5.py against roman5.py
-fromRoman should only accept uppercase input ... ok ①
-toRoman should always return uppercase ... ok
-fromRoman should fail with malformed antecedents ... ok ②
-fromRoman should fail with repeated pairs of numerals ... ok ③
-fromRoman should fail with too many repeated numerals ... ok
-fromRoman should give known result with known input ... ok
-toRoman should give known result with known input ... ok
-fromRoman(toRoman(n))==n for all n ... ok
-toRoman should fail with non-integer input ... ok
-toRoman should fail with negative input ... ok
-toRoman should fail with large input ... ok
-toRoman should fail with 0 input ... ok
+from_roman should only accept uppercase input ... ok ①
+to_roman should always return uppercase ... ok
+from_roman should fail with malformed antecedents ... ok ②
+from_roman should fail with repeated pairs of numerals ... ok ③
+from_roman should fail with too many repeated numerals ... ok
+from_roman should give known result with known input ... ok
+to_roman should give known result with known input ... ok
+from_roman(to_roman(n))==n for all n ... ok
+to_roman should fail with non-integer input ... ok
+to_roman should fail with negative input ... ok
+to_roman should fail with large input ... ok
+to_roman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 12 tests in 2.864s
@@ -7630,7 +7608,7 @@ OK ④
- One thing I didn't mention about regular expressions is that, by default, they are case-sensitive. Since the regular expression
romanNumeralPattern was expressed in uppercase characters, the
re.search check will reject any input that isn't completely uppercase. So the uppercase input test passes.
- - More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like
MCMC. As you've seen, this does not match the regular expression, so fromRoman raises an InvalidRomanNumeralError exception, which is what the malformed antecedents test case is looking for, so the test passes.
+ - More importantly, the bad input tests pass. For instance, the malformed antecedents test checks cases like
MCMC. As you've seen, this does not match the regular expression, so from_roman() raises an InvalidRomanNumeralError exception, which is what the malformed antecedents test case is looking for, so the test passes.
- In fact, all the bad input tests pass. This regular expression catches everything you could think of when you made your test
cases.
- And the anticlimax award of the year goes to the word “
OK”, which is printed by the unittest module when all the tests pass.
@@ -7642,7 +7620,7 @@ OK ④
15.1. Handling bugs
Despite your best efforts to write comprehensive unit tests, bugs happen. What do I mean by “bug”? A bug is a test case you haven't written yet.
Example 15.1. The bug
>>> import roman5
->>> roman5.fromRoman("") ①
+>>> roman5.from_roman("") ①
0
- Remember in the previous section when you kept seeing that an empty string would match the regular expression you were using to check for valid Roman numerals?
@@ -7655,32 +7633,32 @@ class FromRomanBadInput(unittest.TestCase):
# previous test cases omitted for clarity (they haven't changed)
def testBlank(self):
- """fromRoman should fail with blank string"""
- self.assertRaises(roman.InvalidRomanNumeralError, roman.fromRoman, "") ①
+ """from_roman should fail with blank string"""
+ self.assertRaises(roman.InvalidRomanNumeralError, roman.from_roman, "") ①
-- Pretty simple stuff here. Call
fromRoman with an empty string and make sure it raises an InvalidRomanNumeralError exception. The hard part was finding the bug; now that you know about it, testing for it is the easy part.
+ - Pretty simple stuff here. Call
from_roman() with an empty string and make sure it raises an InvalidRomanNumeralError exception. The hard part was finding the bug; now that you know about it, testing for it is the easy part.
Since your code has a bug, and you now have a test case that tests this bug, the test case will fail:
-
Example 15.3. Output of romantest61.py against roman61.py
fromRoman should only accept uppercase input ... ok
-toRoman should always return uppercase ... ok
-fromRoman should fail with blank string ... FAIL
-fromRoman should fail with malformed antecedents ... ok
-fromRoman should fail with repeated pairs of numerals ... ok
-fromRoman should fail with too many repeated numerals ... ok
-fromRoman should give known result with known input ... ok
-toRoman should give known result with known input ... ok
-fromRoman(toRoman(n))==n for all n ... ok
-toRoman should fail with non-integer input ... ok
-toRoman should fail with negative input ... ok
-toRoman should fail with large input ... ok
-toRoman should fail with 0 input ... ok
+Example 15.3. Output of romantest61.py against roman61.py
from_roman should only accept uppercase input ... ok
+to_roman should always return uppercase ... ok
+from_roman should fail with blank string ... FAIL
+from_roman should fail with malformed antecedents ... ok
+from_roman should fail with repeated pairs of numerals ... ok
+from_roman should fail with too many repeated numerals ... ok
+from_roman should give known result with known input ... ok
+to_roman should give known result with known input ... ok
+from_roman(to_roman(n))==n for all n ... ok
+to_roman should fail with non-integer input ... ok
+to_roman should fail with negative input ... ok
+to_roman should fail with large input ... ok
+to_roman should fail with 0 input ... ok
======================================================================
-FAIL: fromRoman should fail with blank string
+FAIL: from_roman should fail with blank string
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage6\romantest61.py", line 137, in testBlank
- self.assertRaises(roman61.InvalidRomanNumeralError, roman61.fromRoman, "")
+ self.assertRaises(roman61.InvalidRomanNumeralError, roman61.from_roman, "")
File "c:\python21\lib\unittest.py", line 266, in failUnlessRaises
raise self.failureException, excName
AssertionError: InvalidRomanNumeralError
@@ -7691,7 +7669,7 @@ FAILED (failures=1)Now you can fix the bug.
Example 15.4. Fixing the bug (roman62.py)
This file is available in py/roman/stage6/ in the examples directory.
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
if not s: ①
raise InvalidRomanNumeralError, 'Input can not be blank'
@@ -7708,19 +7686,19 @@ def fromRoman(s):
- Only two lines of code are required: an explicit check for an empty string, and a
raise statement.
-Example 15.5. Output of romantest62.py against roman62.py
fromRoman should only accept uppercase input ... ok
-toRoman should always return uppercase ... ok
-fromRoman should fail with blank string ... ok ①
-fromRoman should fail with malformed antecedents ... ok
-fromRoman should fail with repeated pairs of numerals ... ok
-fromRoman should fail with too many repeated numerals ... ok
-fromRoman should give known result with known input ... ok
-toRoman should give known result with known input ... ok
-fromRoman(toRoman(n))==n for all n ... ok
-toRoman should fail with non-integer input ... ok
-toRoman should fail with negative input ... ok
-toRoman should fail with large input ... ok
-toRoman should fail with 0 input ... ok
+Example 15.5. Output of romantest62.py against roman62.py
from_roman should only accept uppercase input ... ok
+to_roman should always return uppercase ... ok
+from_roman should fail with blank string ... ok ①
+from_roman should fail with malformed antecedents ... ok
+from_roman should fail with repeated pairs of numerals ... ok
+from_roman should fail with too many repeated numerals ... ok
+from_roman should give known result with known input ... ok
+to_roman should give known result with known input ... ok
+from_roman(to_roman(n))==n for all n ... ok
+to_roman should fail with non-integer input ... ok
+to_roman should fail with negative input ... ok
+to_roman should fail with large input ... ok
+to_roman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 13 tests in 2.834s
@@ -7813,77 +7791,77 @@ class KnownValues(unittest.TestCase):
(4999, 'MMMMCMXCIX'))
def testToRomanKnownValues(self):
- """toRoman should give known result with known input"""
+ """to_roman should give known result with known input"""
for integer, numeral in self.knownValues:
- result = roman71.toRoman(integer)
+ result = roman71.to_roman(integer)
self.assertEqual(numeral, result)
def testFromRomanKnownValues(self):
- """fromRoman should give known result with known input"""
+ """from_roman should give known result with known input"""
for integer, numeral in self.knownValues:
- result = roman71.fromRoman(numeral)
+ result = roman71.from_roman(numeral)
self.assertEqual(integer, result)
class ToRomanBadInput(unittest.TestCase):
def testTooLarge(self):
- """toRoman should fail with large input"""
- self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, 5000) ②
+ """to_roman should fail with large input"""
+ self.assertRaises(roman71.OutOfRangeError, roman71.to_roman, 5000) ②
def testZero(self):
- """toRoman should fail with 0 input"""
- self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, 0)
+ """to_roman should fail with 0 input"""
+ self.assertRaises(roman71.OutOfRangeError, roman71.to_roman, 0)
def testNegative(self):
- """toRoman should fail with negative input"""
- self.assertRaises(roman71.OutOfRangeError, roman71.toRoman, -1)
+ """to_roman should fail with negative input"""
+ self.assertRaises(roman71.OutOfRangeError, roman71.to_roman, -1)
def testNonInteger(self):
- """toRoman should fail with non-integer input"""
- self.assertRaises(roman71.NotIntegerError, roman71.toRoman, 0.5)
+ """to_roman should fail with non-integer input"""
+ self.assertRaises(roman71.NotIntegerError, roman71.to_roman, 0.5)
class FromRomanBadInput(unittest.TestCase):
def testTooManyRepeatedNumerals(self):
- """fromRoman should fail with too many repeated numerals"""
+ """from_roman should fail with too many repeated numerals"""
for s in ('MMMMM', 'DD', 'CCCC', 'LL', 'XXXX', 'VV', 'IIII'): ③
- self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s)
+ self.assertRaises(roman71.InvalidRomanNumeralError, roman71.from_roman, s)
def testRepeatedPairs(self):
- """fromRoman should fail with repeated pairs of numerals"""
+ """from_roman should fail with repeated pairs of numerals"""
for s in ('CMCM', 'CDCD', 'XCXC', 'XLXL', 'IXIX', 'IVIV'):
- self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s)
+ self.assertRaises(roman71.InvalidRomanNumeralError, roman71.from_roman, s)
def testMalformedAntecedent(self):
- """fromRoman should fail with malformed antecedents"""
+ """from_roman should fail with malformed antecedents"""
for s in ('IIMXCC', 'VX', 'DCM', 'CMM', 'IXIV',
'MCMC', 'XCX', 'IVI', 'LM', 'LD', 'LC'):
- self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, s)
+ self.assertRaises(roman71.InvalidRomanNumeralError, roman71.from_roman, s)
def testBlank(self):
- """fromRoman should fail with blank string"""
- self.assertRaises(roman71.InvalidRomanNumeralError, roman71.fromRoman, "")
+ """from_roman should fail with blank string"""
+ self.assertRaises(roman71.InvalidRomanNumeralError, roman71.from_roman, "")
class SanityCheck(unittest.TestCase):
def testSanity(self):
- """fromRoman(toRoman(n))==n for all n"""
+ """from_roman(to_roman(n))==n for all n"""
for integer in range(1, 5000):④
- numeral = roman71.toRoman(integer)
- result = roman71.fromRoman(numeral)
+ numeral = roman71.to_roman(integer)
+ result = roman71.from_roman(numeral)
self.assertEqual(integer, result)
class CaseCheck(unittest.TestCase):
def testToRomanCase(self):
- """toRoman should always return uppercase"""
+ """to_roman should always return uppercase"""
for integer in range(1, 5000):
- numeral = roman71.toRoman(integer)
+ numeral = roman71.to_roman(integer)
self.assertEqual(numeral, numeral.upper())
def testFromRomanCase(self):
- """fromRoman should only accept uppercase input"""
+ """from_roman should only accept uppercase input"""
for integer in range(1, 5000):
- numeral = roman71.toRoman(integer)
- roman71.fromRoman(numeral.upper())
+ numeral = roman71.to_roman(integer)
+ roman71.from_roman(numeral.upper())
self.assertRaises(roman71.InvalidRomanNumeralError,
- roman71.fromRoman, numeral.lower())
+ roman71.from_roman, numeral.lower())
if __name__ == "__main__":
unittest.main()
@@ -7891,75 +7869,75 @@ if __name__ == "__main__":
- The existing known values don't change (they're all still reasonable values to test), but you need to add a few more in the
4000 range. Here I've included 4000 (the shortest), 4500 (the second shortest), 4888 (the longest), and 4999 (the largest).
- - The definition of “large input” has changed. This test used to call
toRoman with 4000 and expect an error; now that 4000-4999 are good values, you need to bump this up to 5000.
- - The definition of “too many repeated numerals” has also changed. This test used to call
fromRoman with 'MMMM' and expect an error; now that MMMM is considered a valid Roman numeral, you need to bump this up to 'MMMMM'.
+ - The definition of “large input” has changed. This test used to call
to_roman() with 4000 and expect an error; now that 4000-4999 are good values, you need to bump this up to 5000.
+ - The definition of “too many repeated numerals” has also changed. This test used to call
from_roman() with 'MMMM' and expect an error; now that MMMM is considered a valid Roman numeral, you need to bump this up to 'MMMMM'.
- The sanity check and case checks loop through every number in the range, from
1 to 3999. Since the range has now expanded, these for loops need to be updated as well to go up to 4999.
Now your test cases are up to date with the new requirements, but your code is not, so you expect several of the test cases
to fail.
Example 15.7. Output of romantest71.py against roman71.py
-fromRoman should only accept uppercase input ... ERROR ①
-toRoman should always return uppercase ... ERROR
-fromRoman should fail with blank string ... ok
-fromRoman should fail with malformed antecedents ... ok
-fromRoman should fail with repeated pairs of numerals ... ok
-fromRoman should fail with too many repeated numerals ... ok
-fromRoman should give known result with known input ... ERROR ②
-toRoman should give known result with known input ... ERROR ③
-fromRoman(toRoman(n))==n for all n ... ERROR④
-toRoman should fail with non-integer input ... ok
-toRoman should fail with negative input ... ok
-toRoman should fail with large input ... ok
-toRoman should fail with 0 input ... ok
+from_roman should only accept uppercase input ... ERROR ①
+to_roman should always return uppercase ... ERROR
+from_roman should fail with blank string ... ok
+from_roman should fail with malformed antecedents ... ok
+from_roman should fail with repeated pairs of numerals ... ok
+from_roman should fail with too many repeated numerals ... ok
+from_roman should give known result with known input ... ERROR ②
+to_roman should give known result with known input ... ERROR ③
+from_roman(to_roman(n))==n for all n ... ERROR④
+to_roman should fail with non-integer input ... ok
+to_roman should fail with negative input ... ok
+to_roman should fail with large input ... ok
+to_roman should fail with 0 input ... ok
-- Our case checks now fail because they loop from
1 to 4999, but toRoman only accepts numbers from 1 to 3999, so it will fail as soon the test case hits 4000.
- - The
fromRoman known values test will fail as soon as it hits 'MMMM', because fromRoman still thinks this is an invalid Roman numeral.
- - The
toRoman known values test will fail as soon as it hits 4000, because toRoman still thinks this is out of range.
- - The sanity check will also fail as soon as it hits
4000, because toRoman still thinks this is out of range.
+ - Our case checks now fail because they loop from
1 to 4999, but to_roman() only accepts numbers from 1 to 3999, so it will fail as soon the test case hits 4000.
+ - The
from_roman() known values test will fail as soon as it hits 'MMMM', because from_roman() still thinks this is an invalid Roman numeral.
+ - The
to_roman() known values test will fail as soon as it hits 4000, because to_roman() still thinks this is out of range.
+ - The sanity check will also fail as soon as it hits
4000, because to_roman() still thinks this is out of range.
======================================================================
-ERROR: fromRoman should only accept uppercase input
+ERROR: from_roman should only accept uppercase input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 161, in testFromRomanCase
- numeral = roman71.toRoman(integer)
- File "roman71.py", line 28, in toRoman
+ numeral = roman71.to_roman(integer)
+ File "roman71.py", line 28, in to_roman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)
======================================================================
-ERROR: toRoman should always return uppercase
+ERROR: to_roman should always return uppercase
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 155, in testToRomanCase
- numeral = roman71.toRoman(integer)
- File "roman71.py", line 28, in toRoman
+ numeral = roman71.to_roman(integer)
+ File "roman71.py", line 28, in to_roman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)
======================================================================
-ERROR: fromRoman should give known result with known input
+ERROR: from_roman should give known result with known input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 102, in testFromRomanKnownValues
- result = roman71.fromRoman(numeral)
- File "roman71.py", line 47, in fromRoman
+ result = roman71.from_roman(numeral)
+ File "roman71.py", line 47, in from_roman
raise InvalidRomanNumeralError, 'Invalid Roman numeral: %s' % s
InvalidRomanNumeralError: Invalid Roman numeral: MMMM
======================================================================
-ERROR: toRoman should give known result with known input
+ERROR: to_roman should give known result with known input
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 96, in testToRomanKnownValues
- result = roman71.toRoman(integer)
- File "roman71.py", line 28, in toRoman
+ result = roman71.to_roman(integer)
+ File "roman71.py", line 28, in to_roman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)
======================================================================
-ERROR: fromRoman(toRoman(n))==n for all n
+ERROR: from_roman(to_roman(n))==n for all n
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\docbook\dip\py\roman\stage7\romantest71.py", line 147, in testSanity
- numeral = roman71.toRoman(integer)
- File "roman71.py", line 28, in toRoman
+ numeral = roman71.to_roman(integer)
+ File "roman71.py", line 28, in to_roman
raise OutOfRangeError, "number out of range (must be 1..3999)"
OutOfRangeError: number out of range (must be 1..3999)
----------------------------------------------------------------------
@@ -7996,7 +7974,7 @@ romanNumeralMap = (('M', 1000),
('IV', 4),
('I', 1))
-def toRoman(n):
+def to_roman(n):
"""convert integer to Roman numeral"""
if not (0 < n < 5000): ①
raise OutOfRangeError, "number out of range (must be 1..4999)"
@@ -8013,7 +7991,7 @@ def toRoman(n):
#Define pattern to detect valid Roman numerals
romanNumeralPattern = '^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$' ②
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
if not s:
raise InvalidRomanNumeralError, 'Input can not be blank'
@@ -8029,23 +8007,23 @@ def fromRoman(s):
return result
-toRoman only needs one small change, in the range check. Where you used to check 0 < n < 4000, you now check 0 < n < 5000. And you change the error message that you raise to reflect the new acceptable range (1..4999 instead of 1..3999). You don't need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds 'M' for each thousand that it finds; given 4000, it will spit out 'MMMM'. The only reason it didn't do this before is that you explicitly stopped it with the range check.)
-- You don't need to make any changes to
fromRoman at all. The only change is to romanNumeralPattern; if you look closely, you'll notice that you added another optional M in the first section of the regular expression. This will allow up to 4 M characters instead of 3, meaning you will allow the Roman numeral equivalents of 4999 instead of 3999. The actual fromRoman function is completely general; it just looks for repeated Roman numeral characters and adds them up, without caring how
+ to_roman() only needs one small change, in the range check. Where you used to check 0 < n < 4000, you now check 0 < n < 5000. And you change the error message that you raise to reflect the new acceptable range (1..4999 instead of 1..3999). You don't need to make any changes to the rest of the function; it handles the new cases already. (It merrily adds 'M' for each thousand that it finds; given 4000, it will spit out 'MMMM'. The only reason it didn't do this before is that you explicitly stopped it with the range check.)
+- You don't need to make any changes to
from_roman() at all. The only change is to romanNumeralPattern; if you look closely, you'll notice that you added another optional M in the first section of the regular expression. This will allow up to 4 M characters instead of 3, meaning you will allow the Roman numeral equivalents of 4999 instead of 3999. The actual from_roman() function is completely general; it just looks for repeated Roman numeral characters and adds them up, without caring how
many times they repeat. The only reason it didn't handle 'MMMM' before is that you explicitly stopped it with the regular expression pattern matching.
You may be skeptical that these two small changes are all that you need. Hey, don't take my word for it; see for yourself:
-
Example 15.9. Output of romantest72.py against roman72.py
fromRoman should only accept uppercase input ... ok
-toRoman should always return uppercase ... ok
-fromRoman should fail with blank string ... ok
-fromRoman should fail with malformed antecedents ... ok
-fromRoman should fail with repeated pairs of numerals ... ok
-fromRoman should fail with too many repeated numerals ... ok
-fromRoman should give known result with known input ... ok
-toRoman should give known result with known input ... ok
-fromRoman(toRoman(n))==n for all n ... ok
-toRoman should fail with non-integer input ... ok
-toRoman should fail with negative input ... ok
-toRoman should fail with large input ... ok
-toRoman should fail with 0 input ... ok
+Example 15.9. Output of romantest72.py against roman72.py
from_roman should only accept uppercase input ... ok
+to_roman should always return uppercase ... ok
+from_roman should fail with blank string ... ok
+from_roman should fail with malformed antecedents ... ok
+from_roman should fail with repeated pairs of numerals ... ok
+from_roman should fail with too many repeated numerals ... ok
+from_roman should give known result with known input ... ok
+to_roman should give known result with known input ... ok
+from_roman(to_roman(n))==n for all n ... ok
+to_roman should fail with non-integer input ... ok
+to_roman should fail with negative input ... ok
+to_roman should fail with large input ... ok
+to_roman should fail with 0 input ... ok
----------------------------------------------------------------------
Ran 13 tests in 3.685s
@@ -8059,7 +8037,7 @@ OK ①
the feeling you get when someone else blames you for breaking their code and you can actually prove that you didn't. The best thing about unit testing is that it gives you the freedom to refactor mercilessly.
Refactoring is the process of taking working code and making it work better. Usually, “better” means “faster”, although it can also mean “using less memory”, or “using less disk space”, or simply “more elegantly”. Whatever it means to you, to your project, in your environment, refactoring is important to the long-term health of any
program.
-
Here, “better” means “faster”. Specifically, the fromRoman function is slower than it needs to be, because of that big nasty regular expression that you use to validate Roman numerals.
+
Here, “better” means “faster”. Specifically, the from_roman() function is slower than it needs to be, because of that big nasty regular expression that you use to validate Roman numerals.
It's probably not worth trying to do away with the regular expression altogether (it would be difficult, and it might not
end up any faster), but you can speed up the function by precompiling the regular expression.
Example 15.10. Compiling regular expressions
@@ -8089,12 +8067,12 @@ end up any faster), but you can speed up the function by precompiling the regula
This file is available in py/roman/stage8/ in the examples directory.
If you have not already done so, you can download this and other examples used in this book.
-# toRoman and rest of module omitted for clarity
+# to_roman and rest of module omitted for clarity
romanNumeralPattern = \
re.compile('^M?M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$') ①
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
if not s:
raise InvalidRomanNumeralError, 'Input can not be blank'
@@ -8111,7 +8089,7 @@ def fromRoman(s):
- This looks very similar, but in fact a lot has changed. romanNumeralPattern is no longer a string; it is a pattern object which was returned from
re.compile.
- - That means that you can call methods on romanNumeralPattern directly. This will be much, much faster than calling
re.search every time. The regular expression is compiled once and stored in romanNumeralPattern when the module is first imported; then, every time you call fromRoman, you can immediately match the input string against the regular expression, without any intermediate steps occurring under
+ - That means that you can call methods on romanNumeralPattern directly. This will be much, much faster than calling
re.search every time. The regular expression is compiled once and stored in romanNumeralPattern when the module is first imported; then, every time you call from_roman(), you can immediately match the input string against the regular expression, without any intermediate steps occurring under
the covers.
So how much faster is it to compile regular expressions? See for yourself:
Example 15.12. Output of romantest81.py against roman81.py
............. ①
@@ -8158,7 +8136,7 @@ OK ②
- More important than any performance boost is the fact that the module still works perfectly. This is the freedom I was talking
about earlier: the freedom to tweak, change, or rewrite any piece of it and verify that you haven't messed anything up in
the process. This is not a license to endlessly tweak your code just for the sake of tweaking it; you had a very specific
- objective (“make
fromRoman faster”), and you were able to accomplish that objective without any lingering doubts about whether you introduced new bugs in the
+ objective (“make from_roman() faster”), and you were able to accomplish that objective without any lingering doubts about whether you introduced new bugs in the
process.
One other tweak I would like to make, and then I promise I'll stop refactoring and put this module to bed. As you've seen
repeatedly, regular expressions can get pretty hairy and unreadable pretty quickly. I wouldn't like to come back to this
@@ -8239,26 +8217,26 @@ romanNumeralMap = (('M', 1000),
#Create tables for fast conversion of roman numerals.
#See fillLookupTables() below.
-toRomanTable = [ None ] # Skip an index since Roman numerals have no zero
-fromRomanTable = {}
+to_romanTable = [ None ] # Skip an index since Roman numerals have no zero
+from_romanTable = {}
-def toRoman(n):
+def to_roman(n):
"""convert integer to Roman numeral"""
if not (0 < n <= MAX_ROMAN_NUMERAL):
raise OutOfRangeError, "number out of range (must be 1..%s)" % MAX_ROMAN_NUMERAL
if int(n) <> n:
raise NotIntegerError, "non-integers can not be converted"
- return toRomanTable[n]
+ return to_romanTable[n]
-def fromRoman(s):
+def from_roman(s):
"""convert Roman numeral to integer"""
if not s:
raise InvalidRomanNumeralError, "Input can not be blank"
- if not fromRomanTable.has_key(s):
+ if not from_romanTable.has_key(s):
raise InvalidRomanNumeralError, "Invalid Roman numeral: %s" % s
- return fromRomanTable[s]
+ return from_romanTable[s]
-def toRomanDynamic(n):
+def to_romanDynamic(n):
"""convert integer to Roman numeral using dynamic programming"""
result = ""
for numeral, integer in romanNumeralMap:
@@ -8267,16 +8245,16 @@ def toRomanDynamic(n):
n -= integer
break
if n > 0:
- result += toRomanTable[n]
+ result += to_romanTable[n]
return result
def fillLookupTables():
"""compute all the possible roman numerals"""
#Save the values in two global tables to convert to and from integers.
for integer in range(1, MAX_ROMAN_NUMERAL + 1):
- romanNumber = toRomanDynamic(integer)
- toRomanTable.append(romanNumber)
- fromRomanTable[romanNumber] = integer
+ romanNumber = to_romanDynamic(integer)
+ to_romanTable.append(romanNumber)
+ from_romanTable[romanNumber] = integer
fillLookupTables()
So how fast is it?
@@ -8328,7 +8306,7 @@ only done once, this is negligible in the long run.
- Using
assertEqual to check that a function returns a known value
- - Using
assertRaises to check that a function raises a known exception
+ - Using
assertRaises to check that a function raises a known exception
- Calling
unittest.main() in your if __name__ clause to run all your test cases at once
@@ -8391,19 +8369,19 @@ buildConnectionString should fail with string input ... ok
buildConnectionString should fail with tuple input ... ok
buildConnectionString handles empty dictionary ... ok
buildConnectionString returns known result with known input ... ok
-fromRoman should only accept uppercase input ... ok ③
-toRoman should always return uppercase ... ok
-fromRoman should fail with blank string ... ok
-fromRoman should fail with malformed antecedents ... ok
-fromRoman should fail with repeated pairs of numerals ... ok
-fromRoman should fail with too many repeated numerals ... ok
-fromRoman should give known result with known input ... ok
-toRoman should give known result with known input ... ok
-fromRoman(toRoman(n))==n for all n ... ok
-toRoman should fail with non-integer input ... ok
-toRoman should fail with negative input ... ok
-toRoman should fail with large input ... ok
-toRoman should fail with 0 input ... ok
+from_roman should only accept uppercase input ... ok ③
+to_roman should always return uppercase ... ok
+from_roman should fail with blank string ... ok
+from_roman should fail with malformed antecedents ... ok
+from_roman should fail with repeated pairs of numerals ... ok
+from_roman should fail with too many repeated numerals ... ok
+from_roman should give known result with known input ... ok
+to_roman should give known result with known input ... ok
+from_roman(to_roman(n))==n for all n ... ok
+to_roman should fail with non-integer input ... ok
+to_roman should fail with negative input ... ok
+to_roman should fail with large input ... ok
+to_roman should fail with 0 input ... ok
kgp a ref test ... ok
kgp b ref test ... ok
kgp c ref test ... ok
diff --git a/dip3.css b/dip3.css
index 7c0ed8a..6d41645 100644
--- a/dip3.css
+++ b/dip3.css
@@ -30,7 +30,7 @@ a:visited{color:darkorchid}
pre{white-space:pre-wrap;padding-left:2.154em;line-height:2.154;border-left:1px dotted}
.widgets{float:left}
.widgets,.widgets a,.download{font-size:small;line-height:2.154}
-.block{clear:left}
+.block,ol{clear:left}
pre a,.widgets a{padding:0.4375em 0;border:0}
.widgets a{text-decoration:underline}
pre a:hover{border:0}
diff --git a/index.html b/index.html
index ad1aef4..ff4e168 100644
--- a/index.html
+++ b/index.html
@@ -23,7 +23,7 @@ li:last-child:before{content:"A. \00a0 \00a0"}
- Regular expressions
-
-
-
-
+
- Unit testing
-
-
-
diff --git a/native-datatypes.html b/native-datatypes.html
index 7d7582a..e96a4e4 100644
--- a/native-datatypes.html
+++ b/native-datatypes.html
@@ -111,7 +111,7 @@ body{counter-reset:h1 2}
- Integers can be arbitrarily large.
-☞Python 2 had separate types for int and long. The int datatype was limited by sys.maxint, which varied by platform but was usually 232-1. Python 3 has just one integer type, which behaves mostly like the old long type from Python 2. See PEP 237 for details.
+
☞Python 2 had separate types for int and long. The int datatype was limited by sys.maxint, which varied by platform but was usually 232-1. Python 3 has just one integer type, which behaves mostly like the old long type from Python 2. See PEP 237 for details.
You can do all kinds of things with numbers.
@@ -137,7 +137,7 @@ body{counter-reset:h1 2}
- The
% operator gives the remainder after performing integer division. 11 divided by 2 is 5 with a remainder of 1, so the result here is 1.
-☞In Python 2, the / operator usually meant integer division, but you could make it behave like floating point division by including a special directive in your code. In Python 3, the / operator always means floating point division. See PEP 238 for details.
+
☞In Python 2, the / operator usually meant integer division, but you could make it behave like floating point division by including a special directive in your code. In Python 3, the / operator always means floating point division. See PEP 238 for details.
FIXME fractions, math module, numbers in a boolean context
Lists
@@ -357,8 +357,8 @@ KeyError: 'db.diveintopython3.org'
- fractions
- math module
-
- PEP 237
-
- PEP 238
+
- PEP 237
+
- PEP 238
- links to appendix
- ...etc...
diff --git a/porting-code-to-python-3-with-2to3.html b/porting-code-to-python-3-with-2to3.html
index 20d46b6..815f606 100644
--- a/porting-code-to-python-3-with-2to3.html
+++ b/porting-code-to-python-3-with-2to3.html
@@ -145,7 +145,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
long data type
Python 2 had separate int and long types for non-floating-point numbers. An int could not be any larger than sys.maxint, which varied by platform. Longs were defined by appending an L to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called int, which mostly behaves like the long type in Python 2. Since there are no longer two types, there is no need for special syntax to distinguish them.
-
Further reading: PEP 237: Unifying Long Integers and Integers.
+
Further reading: PEP 237: Unifying Long Integers and Integers.
Notes
@@ -259,7 +259,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
Modules that have been renamed or reorganized
Several modules in the Python Standard Library have been renamed. Several other modules which are related to each other have been combined or reorganized to make their association more logical.
http
-In Python 3, several related HTTP modules have been combined into a single package, http.
+
In Python 3, several related HTTP modules have been combined into a single package, http.
Notes
@@ -282,10 +282,10 @@ import CGIHttpServer
import http.server
-- The
http.client module implements a low-level library that can request HTTP resources and interpret HTTP responses.
- - The
http.cookies module provides a Pythonic interface to browser cookies that are sent in a Set-Cookie: HTTP header.
+ - The
http.client module implements a low-level library that can request HTTP resources and interpret HTTP responses.
+ - The
http.cookies module provides a Pythonic interface to browser cookies that are sent in a Set-Cookie: HTTP header.
- The
http.cookiejar module manipulates the actual files on disk that popular web browsers use to store cookies.
- - The
http.server module provides a basic HTTP server.
+ - The
http.server module provides a basic HTTP server.
urllib
Python 2 had a rat's nest of overlapping modules to parse, encode, and fetch URLs. In Python 3, these have all been refactored and combined in a single package, urllib.
@@ -319,15 +319,15 @@ from urllib2 import HTTPError
from urllib.error import HTTPError
-- The old
urllib module in Python 2 had a variety of functions, including urlopen() for fetching data and splittype(), splithost(), and splituser() for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new urllib package. 2to3 will also change all calls to these functions so they use the new naming scheme.
+ - The old
urllib module in Python 2 had a variety of functions, including urlopen() for fetching data and splittype(), splithost(), and splituser() for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new urllib package. 2to3 will also change all calls to these functions so they use the new naming scheme.
- The old
urllib2 module in Python 2 has been folded into into the urllib package in Python 3. All your urllib2 favorites — the build_opener() method, Request objects, and HTTPBasicAuthHandler and friends — are still available.
- The
urllib.parse module in Python 3 contains all the parsing functions from the old urlparse module in Python 2.
- The
urllib.robotparser module parses robots.txt files.
- - The
FancyURLopener class, which handles HTTP redirects and other status codes, is still available in the new urllib.request module. The urlencode function has moved to urllib.parse.
+ - The
FancyURLopener class, which handles HTTP redirects and other status codes, is still available in the new urllib.request module. The urlencode function has moved to urllib.parse.
- The
Request object is still available in urllib.request, but constants like HTTPError have been moved to urllib.error.
dbm
-All the various DBM clones are now in a single package, dbm. If you need a specific variant like GNU DBM, you can import the appropriate module within the dbm package.
+
All the various DBM clones are now in a single package, dbm. If you need a specific variant like GNU DBM, you can import the appropriate module within the dbm package.
Notes
@@ -353,7 +353,7 @@ import whichdb
xmlrpc
-XML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, xmlrpc.
+
XML-RPC is a lightweight method of performing remote RPC calls over HTTP. The XML-RPC client library and several XML-RPC server implementations are now combined in a single package, xmlrpc.
Notes
@@ -417,14 +417,14 @@ except ImportError:
- The
copyreg module adds pickle support for custom types defined in C.
- The
queue module implements a multi-producer, multi-consumer queue.
- The
socketserver module provides generic base classes for implementing different kinds of socket servers.
- - The
configparser module parses INI-style configuration files.
+ - The
configparser module parses INI-style configuration files.
- The
reprlib module reimplements the built-in repr() function, but with limits on how many values are represented.
- The
subprocess module allows you to spawn processes, connect to their pipes, and obtain their return codes.
Relative imports within a package
A package is a group of related modules that function as a single entity. In Python 2, when modules within a package need to reference each other, you use import foo or from foo import Bar. The Python 2 interpreter first searches within the current package to find foo.py, and then moves on to the other directories in the Python search path (sys.path). Python 3 works a bit differently. Instead of searching the current package, it goes directly to the Python search path. If you want one module within a package to import another module in the same package, you need to explicitly provide the relative path between the two modules.
Suppose you had this package, with multiple files in the same directory:
-
chardet/
|
+--__init__.py
diff --git a/regular-expressions.html b/regular-expressions.html
index 23aaa1d..3c293f4 100644
--- a/regular-expressions.html
+++ b/regular-expressions.html
@@ -15,7 +15,7 @@ body{counter-reset:h1 4}
Regular expressions
-❝ Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems. ❞
— Jamie Zawinski
+
❝ Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. ❞
— Jamie Zawinski
Diving in
-Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of
-characters. If you've used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the re module to get an overview of the available functions and their arguments.
-
Strings have methods for searching and replacing — index(), find(), split(), count(), replace(), &c. — but they are limited to the simplest of cases. For example, the index() method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace() and split() methods have the same limitations.
-
If your goal can be accomplished with string functions, you should use them. They're fast and simple and easy to read, and there's a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with if statements to handle special cases, or if you're combining them with split() and join() and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.
+
Regular expressions are a powerful and standardized way of searching, replacing, and parsing text with complex patterns of
+characters. If you’ve used regular expressions in other languages (like Perl), the syntax will be very familiar, and you get by just reading the summary of the re module to get an overview of the available functions and their arguments.
+
Strings have methods for searching and replacing: index(), find(), split(), count(), replace(), &c. But these methods are limited to the simplest of cases. For example, the index() method looks for a single, hard-coded substring, and the search is always case-sensitive. To do case-insensitive searches of a string s, you must call s.lower() or s.upper() and make sure your search strings are the appropriate case to match. The replace() and split() methods have the same limitations.
+
If your goal can be accomplished with string functions, you should use them. They’re fast and simple and easy to read, and there’s a lot to be said for fast, simple, readable code. But if you find yourself using a lot of different string functions with if statements to handle special cases, or if you’re combining them with split() and join() and list comprehensions in weird unreadable ways, you may need to move up to regular expressions.
Although the regular expression syntax is tight and unlike normal code, the result can end up being more readable than a hand-rolled solution that uses a long chain of string functions. There are even ways of embedding comments within regular expressions, so you can include fine-grained documentation within them.
Case study: street addresses
-This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don't just make this stuff up; it's actually useful.) This example shows how I approached the problem.
+
This series of examples was inspired by a real-life problem I had in my day job several years ago, when I needed to scrub and standardize street addresses exported from a legacy system before importing them into a newer system. (See, I don’t just make this stuff up; it’s actually useful.) This example shows how I approached the problem.
>>> s = '100 NORTH MAIN ROAD'
>>> s.replace('ROAD', 'RD.') ①
@@ -56,9 +56,9 @@ characters. If you've used regular expressions in other languages (like Perl), t
- My goal is to standardize a street address so that
'ROAD' is always abbreviated as 'RD.'. At first glance, I thought this was simple enough that I could just use the string method replace(). After all, all the data was already uppercase, so case mismatches would not be a problem. And the search string, 'ROAD', was a constant. And in this deceptively simple example, s.replace() does indeed work.
- Life, unfortunately, is full of counterexamples, and I quickly discovered this one. The problem here is that
'ROAD' appears twice in the address, once as part of the street name 'BROAD' and once as its own word. The replace() method sees these two occurrences and blindly replaces both of them; meanwhile, I see my addresses getting destroyed.
- - To solve the problem of addresses with more than one
'ROAD' substring, you could resort to something like this: only search and replace 'ROAD' in the last four characters of the address (s[-4:]), and leave the string alone (s[:-4]). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string you're replacing. (If you were replacing 'STREET' with 'ST.', you would need to use s[:-6] and s[-6:].replace(...).) Would you like to come back in six months and debug this? I know I wouldn't.
- - It's time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the
re module.
- - Take a look at the first parameter:
'ROAD$'. This is a simple regular expression that matches 'ROAD' only when it occurs at the end of a string. The $ means “end of the string.” (There is a corresponding character, the caret ^, which means “beginning of the string.”) Using the re.sub function, you search the string s for the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s, but does not match the ROAD that's part of the word BROAD, because that's in the middle of s.
+ - To solve the problem of addresses with more than one
'ROAD' substring, you could resort to something like this: only search and replace 'ROAD' in the last four characters of the address (s[-4:]), and leave the string alone (s[:-4]). But you can see that this is already getting unwieldy. For example, the pattern is dependent on the length of the string you’re replacing. (If you were replacing 'STREET' with 'ST.', you would need to use s[:-6] and s[-6:].replace(...).) Would you like to come back in six months and debug this? I know I wouldn’t.
+ - It’s time to move up to regular expressions. In Python, all functionality related to regular expressions is contained in the
re module.
+ - Take a look at the first parameter:
'ROAD$'. This is a simple regular expression that matches 'ROAD' only when it occurs at the end of a string. The $ means “end of the string.” (There is a corresponding character, the caret ^, which means “beginning of the string.”) Using the re.sub function, you search the string s for the regular expression 'ROAD$' and replace it with 'RD.'. This matches the ROAD at the end of the string s, but does not match the ROAD that’s part of the word BROAD, because that’s in the middle of s.
Continuing with my story of scrubbing addresses, I soon discovered that the previous example, matching 'ROAD' at the end of the address, was not good enough, because not all addresses included a street designation at all. Some addresses simply ended with the street name. I got away with it most of the time, but if the street name was 'BROAD', then the regular expression would match 'ROAD' at the end of the string as part of the word 'BROAD', which is not what I wanted.
@@ -75,13 +75,13 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> re.sub(r'\bROAD\b', 'RD.', s) ④
'100 BROAD RD. APT 3'
-- What I really wanted was to match
'ROAD' when it was at the end of the string and it was its own word (and not a part of some larger word). To express this in a regular expression, you use \b, which means “a word boundary must occur right here.” In Python, this is complicated by the fact that the '\' character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it's a bug in syntax or a bug in your regular expression.
+ - What I really wanted was to match
'ROAD' when it was at the end of the string and it was its own word (and not a part of some larger word). To express this in a regular expression, you use \b, which means “a word boundary must occur right here.” In Python, this is complicated by the fact that the '\' character in a string must itself be escaped. This is sometimes referred to as the backslash plague, and it is one reason why regular expressions are easier in Perl than in Python. On the down side, Perl mixes regular expressions with other syntax, so if you have a bug, it may be hard to tell whether it’s a bug in syntax or a bug in your regular expression.
- To work around the backslash plague, you can use what is called a raw string [FIXME reference to strings chapter], by prefixing the string with the letter
r. This tells Python that nothing in this string should be escaped; '\t' is a tab character, but r'\t' is really the backslash character \ followed by the letter t. I recommend always using raw strings when dealing with regular expressions; otherwise, things get too confusing too quickly (and regular expressions are confusing enough already).
- - *sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word
'ROAD' as a whole word by itself, but it wasn't at the end, because the address had an apartment number after the street designation. Because 'ROAD' isn't at the very end of the string, it doesn't match, so the entire call to re.sub ends up replacing nothing at all, and you get the original string back, which is not what you want.
- - To solve this problem, I removed the
$ character and added another \b. Now the regular expression reads “match 'ROAD' when it's a whole word by itself anywhere in the string,” whether at the end, the beginning, or somewhere in the middle.
+ - *sigh* Unfortunately, I soon found more cases that contradicted my logic. In this case, the street address contained the word
'ROAD' as a whole word by itself, but it wasn’t at the end, because the address had an apartment number after the street designation. Because 'ROAD' isn’t at the very end of the string, it doesn’t match, so the entire call to re.sub ends up replacing nothing at all, and you get the original string back, which is not what you want.
+ - To solve this problem, I removed the
$ character and added another \b. Now the regular expression reads “match 'ROAD' when it’s a whole word by itself anywhere in the string,” whether at the end, the beginning, or somewhere in the middle.
Case study: Roman numerals
-You've most likely seen Roman numerals, even if you didn't recognize them. You may have seen them in copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You may also have seen them in outlines and bibliographical references. It's a system of representing numbers that really does date back to the ancient Roman empire (hence the name).
+
You’ve most likely seen Roman numerals, even if you didn’t recognize them. You may have seen them in copyrights of old movies and television shows (“Copyright MCMXLVI” instead of “Copyright 1946”), or on the dedication walls of libraries or universities (“established MDCCCLXXXVIII” instead of “established 1888”). You may also have seen them in outlines and bibliographical references. It’s a system of representing numbers that really does date back to the ancient Roman empire (hence the name).
In Roman numerals, there are seven characters that are repeated and combined in various ways to represent numbers.
I = 1
@@ -95,13 +95,13 @@ characters. If you've used regular expressions in other languages (like Perl), t
The following are some general rules for constructing Roman numerals:
- Characters are additive.
I is 1, II is 2, and III is 3. VI is 6 (literally, “5 and 1”), VII is 7, and VIII is 8.
- - The tens characters (
I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can't represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). The number 40 is written as XL (10 less than 50), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (10 less than 50, then 1 less than 5).
+ - The tens characters (
I, X, C, and M) can be repeated up to three times. At 4, you need to subtract from the next highest fives character. You can’t represent 4 as IIII; instead, it is represented as IV (“1 less than 5”). The number 40 is written as XL (10 less than 50), 41 as XLI, 42 as XLII, 43 as XLIII, and then 44 as XLIV (10 less than 50, then 1 less than 5).
- Similarly, at
9, you need to subtract from the next highest tens character: 8 is VIII, but 9 is IX (1 less than 10), not VIIII (since the I character can not be repeated four times). The number 90 is XC, 900 is CM.
- The fives characters can not be repeated. The number
10 is always represented as X, never as VV. The number 100 is always C, never LL.
- - Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much.
DC is 600; CD is a completely different number (400, 100 less than 500). CI is 101; IC is not even a valid Roman numeral (because you can't subtract 1 directly from 100; you would need to write it as XCIX, for 10 less than 100, then 1 less than 10).
+ - Roman numerals are always written highest to lowest, and read left to right, so the order the of characters matters very much.
DC is 600; CD is a completely different number (400, 100 less than 500). CI is 101; IC is not even a valid Roman numeral (because you can’t subtract 1 directly from 100; you would need to write it as XCIX, for 10 less than 100, then 1 less than 10).
Checking for thousands
-What would it take to validate that an arbitrary string is a valid Roman numeral? Let's take it one digit at a time. Since Roman numerals are always written highest to lowest, let's start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of M characters.
+
What would it take to validate that an arbitrary string is a valid Roman numeral? Let’s take it one digit at a time. Since Roman numerals are always written highest to lowest, let’s start with the highest: the thousands place. For numbers 1000 and higher, the thousands are represented by a series of M characters.
>>> import re
>>> pattern = '^M?M?M?$' ①
@@ -115,11 +115,11 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> re.search(pattern, '') ⑥
<SRE_Match object at 0106F4A8>
-- This pattern has three parts.
^ matches what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the M characters were, which is not what you want. You want to make sure that the M characters, if they're there, are at the beginning of the string. M? optionally matches a single M character. Since this is repeated three times, you're matching anywhere from zero to three M characters in a row. And $ matches the end of the string. When combined with the ^ character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the M characters.
+ - This pattern has three parts.
^ matches what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the M characters were, which is not what you want. You want to make sure that the M characters, if they’re there, are at the beginning of the string. M? optionally matches a single M character. Since this is repeated three times, you’re matching anywhere from zero to three M characters in a row. And $ matches the end of the string. When combined with the ^ character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the M characters.
- The essence of the
re module is the search() function, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found, search() returns an object which has various methods to describe the match; if no match is found, search() returns None, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return value of search(). 'M' matches this regular expression, because the first optional M matches and the second and third optional M characters are ignored.
'MM' matches because the first and second optional M characters match and the third M is ignored.
'MMM' matches because all three M characters match.
-'MMMM' does not match. All three M characters match, but then the regular expression insists on the string ending (because of the $ character), and the string doesn't end yet (because of the fourth M). So search() returns None.
+'MMMM' does not match. All three M characters match, but then the regular expression insists on the string ending (because of the $ character), and the string doesn’t end yet (because of the fourth M). So search() returns None.
- Interestingly, an empty string also matches this regular expression, since all the
M characters are optional.
Checking for hundreds
@@ -164,10 +164,10 @@ characters. If you've used regular expressions in other languages (like Perl), t
'MCM' matches because the first M matches, the second and third M characters are ignored, and the CM matches (so the CD and D?C?C?C? patterns are never even considered). MCM is the Roman numeral representation of 1900.
'MD' matches because the first M matches, the second and third M characters are ignored, and the D?C?C?C? pattern matches D (each of the three C characters are optional and are ignored). MD is the Roman numeral representation of 1500.
'MMMCCC' matches because all three M characters match, and the D?C?C?C? pattern matches CCC (the D is optional and is ignored). MMMCCC is the Roman numeral representation of 3300.
-'MCMC' does not match. The first M matches, the second and third M characters are ignored, and the CM matches, but then the $ does not match because you're not at the end of the string yet (you still have an unmatched C character). The C does not match as part of the D?C?C?C? pattern, because the mutually exclusive CM pattern has already matched.
+'MCMC' does not match. The first M matches, the second and third M characters are ignored, and the CM matches, but then the $ does not match because you’re not at the end of the string yet (you still have an unmatched C character). The C does not match as part of the D?C?C?C? pattern, because the mutually exclusive CM pattern has already matched.
- Interestingly, an empty string still matches this pattern, because all the
M characters are optional and ignored, and the empty string matches the D?C?C?C? pattern where all the characters are optional and ignored.
-Whew! See how quickly regular expressions can get nasty? And you've only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they're exactly the same pattern. But let's look at another way to express the pattern.
+
Whew! See how quickly regular expressions can get nasty? And you’ve only covered the thousands and hundreds places of Roman numerals. But if you followed all that, the tens and ones places are easy, because they’re exactly the same pattern. But let’s look at another way to express the pattern.
Using the {n,m} Syntax
In the previous section, you were dealing with a pattern where the same character could be repeated up to three times. There is another way to express this in regular expressions, which some people find more readable. First look at the method we already used in the previous example.
@@ -184,8 +184,8 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> re.search(pattern, 'MMMM') ④
>>>
-- This matches the start of the string, and then the first optional
M, but not the second and third M (but that's okay because they're optional), and then the end of the string.
- - This matches the start of the string, and then the first and second optional
M, but not the third M (but that's okay because it's optional), and then the end of the string.
+ - This matches the start of the string, and then the first optional
M, but not the second and third M (but that’s okay because they’re optional), and then the end of the string.
+ - This matches the start of the string, and then the first and second optional
M, but not the third M (but that’s okay because it’s optional), and then the end of the string.
- This matches the start of the string, and then all three optional
M, and then the end of the string.
- This matches the start of the string, and then all three optional
M, but then does not match the the end of the string (because there is still one unmatched M), so the pattern does not match and returns None.
@@ -207,7 +207,7 @@ characters. If you've used regular expressions in other languages (like Perl), t
- This matches the start of the string, then three
M out of a possible three, but then does not match the end of the string. The regular expression allows for up to only three M characters before the end of the string, but you have four, so the pattern does not match and returns None.
Checking for tens and ones
-Now let's expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
+
Now let’s expand the Roman numeral regular expression to cover the tens and ones place. This example shows the check for tens.
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
>>> re.search(pattern, 'MCMXL') ①
@@ -227,7 +227,7 @@ characters. If you've used regular expressions in other languages (like Perl), t
- This matches the start of the string, then the first optional
M, then CM, then the optional L and all three optional X characters, then the end of the string. MCMLXXX is the Roman numeral representation of 1980.
- This matches the start of the string, then the first optional
M, then CM, then the optional L and all three optional X characters, then fails to match the end of the string because there is still one more X unaccounted for. So the entire pattern fails to match, and returns None. MCMLXXXX is not a valid Roman numeral.
-The expression for the ones place follows the same pattern. I'll spare you the details and show you the end result.
+
The expression for the ones place follows the same pattern. I’ll spare you the details and show you the end result.
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)(IX|IV|V?I?I?I?)$'
So what does that look like using this alternate {n,m} syntax? This example shows the new syntax.
@@ -244,19 +244,19 @@ characters. If you've used regular expressions in other languages (like Perl), t
- This matches the start of the string, then one of a possible four
M characters, then D?C{0,3}. Of that, it matches the optional D and zero of three possible C characters. Moving on, it matches L?X{0,3} by matching the optional L and zero of three possible X characters. Then it matches V?I{0,3} by matching the optional V and zero of three possible I characters, and finally the end of the string. MDLV is the Roman numeral representation of 1555.
- This matches the start of the string, then two of a possible four
M characters, then the D?C{0,3} with a D and one of three possible C characters; then L?X{0,3} with an L and one of three possible X characters; then V?I{0,3} with a V and one of three possible I characters; then the end of the string. MMDCLXVI is the Roman numeral representation of 2666.
- - This matches the start of the string, then four out of four
M characters, then D?C{0,3} with a D and three out of three C characters; then L?X{0,3} with an L and three out of three X characters; then V?I{0,3} with a V and three out of three I characters; then the end of the string. MMMMDCCCLXXXVIII is the Roman numeral representation of 3888, and it's the longest Roman numeral you can write without extended syntax.
- - Watch closely. (I feel like a magician. “Watch closely, kids, I'm going to pull a rabbit out of my hat.”) This matches the start of the string, then zero out of four
M, then matches D?C{0,3} by skipping the optional D and matching zero out of three C, then matches L?X{0,3} by skipping the optional L and matching zero out of three X, then matches V?I{0,3} by skipping the optional V and matching one out of three I. Then the end of the string. Whoa.
+ - This matches the start of the string, then four out of four
M characters, then D?C{0,3} with a D and three out of three C characters; then L?X{0,3} with an L and three out of three X characters; then V?I{0,3} with a V and three out of three I characters; then the end of the string. MMMMDCCCLXXXVIII is the Roman numeral representation of 3888, and it’s the longest Roman numeral you can write without extended syntax.
+ - Watch closely. (I feel like a magician. “Watch closely, kids, I’m going to pull a rabbit out of my hat.”) This matches the start of the string, then zero out of four
M, then matches D?C{0,3} by skipping the optional D and matching zero out of three C, then matches L?X{0,3} by skipping the optional L and matching zero out of three X, then matches V?I{0,3} by skipping the optional V and matching one out of three I. Then the end of the string. Whoa.
-If you followed all that and understood it on the first try, you're doing better than I did. Now imagine trying to understand someone else's regular expressions, in the middle of a critical function of a large program. Or even imagine coming back to your own regular expressions a few months later. I've done it, and it's not a pretty sight.
-
Now let's explore an alternate syntax that can help keep your expressions maintainable.
+
If you followed all that and understood it on the first try, you’re doing better than I did. Now imagine trying to understand someone else’s regular expressions, in the middle of a critical function of a large program. Or even imagine coming back to your own regular expressions a few months later. I’ve done it, and it’s not a pretty sight.
+
Now let’s explore an alternate syntax that can help keep your expressions maintainable.
Verbose Regular Expressions
-So far you've just been dealing with what I'll call “compact” regular expressions. As you've seen, they are difficult to read, and even if you figure out what one does, that's no guarantee that you'll be able to understand it six months later. What you really need is inline documentation.
+
So far you’ve just been dealing with what I’ll call “compact” regular expressions. As you’ve seen, they are difficult to read, and even if you figure out what one does, that’s no guarantee that you’ll be able to understand it six months later. What you really need is inline documentation.
Python allows you to do this with something called verbose regular expressions. A verbose regular expression is different from a compact regular expression in two ways:
-- Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They're not matched at all. (If you want to match a space in a verbose regular expression, you'll need to escape it by putting a backslash in front of it.)
-
- Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a
# character and goes until the end of the line. In this case it's a comment within a multi-line string instead of within your source code, but it works the same way.
+ - Whitespace is ignored. Spaces, tabs, and carriage returns are not matched as spaces, tabs, and carriage returns. They’re not matched at all. (If you want to match a space in a verbose regular expression, you’ll need to escape it by putting a backslash in front of it.)
+
- Comments are ignored. A comment in a verbose regular expression is just like a comment in Python code: it starts with a
# character and goes until the end of the line. In this case it’s a comment within a multi-line string instead of within your source code, but it works the same way.
-This will be more clear with an example. Let's revisit the compact regular expression you've been working with, and make it a verbose regular expression. This example shows how.
+
This will be more clear with an example. Let’s revisit the compact regular expression you’ve been working with, and make it a verbose regular expression. This example shows how.
>>> pattern = """
^ # beginning of string
@@ -277,14 +277,14 @@ characters. If you've used regular expressions in other languages (like Perl), t
<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M') ④
-- The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them:
re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it's a lot more readable.
+ - The most important thing to remember when using verbose regular expressions is that you need to pass an extra argument when working with them:
re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it’s a lot more readable.
- This matches the start of the string, then one of a possible four
M, then CM, then L and three of a possible three X, then IX, then the end of the string.
- This matches the start of the string, then four of a possible four
M, then D and three of a possible three C, then L and three of a possible three X, then V and three of a possible three I, then the end of the string.
- - This does not match. Why? Because it doesn't have the
re.VERBOSE flag, so the re.search function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can't auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose.
+ - This does not match. Why? Because it doesn’t have the
re.VERBOSE flag, so the re.search function is treating the pattern as a compact regular expression, with significant whitespace and literal hash marks. Python can’t auto-detect whether a regular expression is verbose or not. Python assumes every regular expression is compact unless you explicitly state that it is verbose.
Case study: parsing phone numbers
-So far you've concentrated on matching whole patterns. Either the pattern matches, or it doesn't. But regular expressions are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.
-
This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the area code, trunk, number, and optionally an extension separately in the company's database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.
+
So far you’ve concentrated on matching whole patterns. Either the pattern matches, or it doesn’t. But regular expressions are much more powerful than that. When a regular expression does match, you can pick out specific pieces of it. You can find out what matched where.
+
This example came from another real-world problem I encountered, again from a previous day job. The problem: parsing an American phone number. The client wanted to be able to enter the number free-form (in a single field), but then wanted to store the area code, trunk, number, and optionally an extension separately in the company’s database. I scoured the Web and found many examples of regular expressions that purported to do this, but none of them were permissive enough.
Here are the phone numbers I needed to be able to accept:
800-555-1212
@@ -298,7 +298,7 @@ characters. If you've used regular expressions in other languages (like Perl), t
work 1-(800) 555.1212 #1234
Quite a variety! In each of these cases, I need to know that the area code was 800, the trunk was 555, and the rest of the phone number was 1212. For those with an extension, I need to know that the extension was 1234.
-
Let's work through developing a solution for phone number parsing. This example shows the first step.
+
Let’s work through developing a solution for phone number parsing. This example shows the first step.
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$') ①
>>> phonePattern.search('800-555-1212').groups() ②
@@ -306,9 +306,9 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> phonePattern.search('800-555-1212-1234') ③
>>>
-- Always read regular expressions from left to right. This one matches the beginning of the string, and then
(\d{3}). What's \d{3}? Well, the {3} means “match exactly three numeric digits”; it's a variation on the {n,m} syntax you saw earlier. \d means “any numeric digit” (0 through 9). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another group of exactly four digits. Then match the end of the string.
+ - Always read regular expressions from left to right. This one matches the beginning of the string, and then
(\d{3}). What’s \d{3}? Well, the {3} means “match exactly three numeric digits”; it’s a variation on the {n,m} syntax you saw earlier. \d means “any numeric digit” (0 through 9). Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. Then match a literal hyphen. Then match another group of exactly three digits. Then another literal hyphen. Then another group of exactly four digits. Then match the end of the string.
- To get access to the groups that the regular expression parser remembered along the way, use the
groups() method on the object that the search() method returns. It will return a tuple of however many groups were defined in the regular expression. In this case, you defined three groups, one with three digits, one with three digits, and one with four digits.
- - This regular expression is not the final answer, because it doesn't handle a phone number with an extension on the end. For that, you'll need to expand the regular expression.
+
- This regular expression is not the final answer, because it doesn’t handle a phone number with an extension on the end. For that, you’ll need to expand the regular expression.
>>> phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$') ①
@@ -319,10 +319,10 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> phonePattern.search('800-555-1212') ④
>>>
-- This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered group of four digits. What's new is that you then match another hyphen, and a remembered group of one or more digits, then the end of the string.
+
- This regular expression is almost identical to the previous one. Just as before, you match the beginning of the string, then a remembered group of three digits, then a hyphen, then a remembered group of three digits, then a hyphen, then a remembered group of four digits. What’s new is that you then match another hyphen, and a remembered group of one or more digits, then the end of the string.
- The
groups() method now returns a tuple of four elements, since the regular expression now defines four groups to remember.
- - Unfortunately, this regular expression is not the final answer either, because it assumes that the different parts of the phone number are separated by hyphens. What if they're separated by spaces, or commas, or dots? You need a more general solution to match several different types of separators.
-
- Oops! Not only does this regular expression not do everything you want, it's actually a step backwards, because now you can't parse phone numbers without an extension. That's not what you wanted at all; if the extension is there, you want to know what it is, but if it's not there, you still want to know what the different parts of the main number are.
+
- Unfortunately, this regular expression is not the final answer either, because it assumes that the different parts of the phone number are separated by hyphens. What if they’re separated by spaces, or commas, or dots? You need a more general solution to match several different types of separators.
+
- Oops! Not only does this regular expression not do everything you want, it’s actually a step backwards, because now you can’t parse phone numbers without an extension. That’s not what you wanted at all; if the extension is there, you want to know what it is, but if it’s not there, you still want to know what the different parts of the main number are.
The next example shows the regular expression to handle separators between the different parts of the phone number.
@@ -336,11 +336,11 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> phonePattern.search('800-555-1212') ⑤
>>>
-- Hang on to your hat. You're matching the beginning of the string, then a group of three digits, then
\D+. What the heck is that? Well, \D matches any character except a numeric digit, and + means “1 or more”. So \D+ matches one or more characters that are not digits. This is what you're using instead of a literal hyphen, to try to match different separators.
+ - Hang on to your hat. You’re matching the beginning of the string, then a group of three digits, then
\D+. What the heck is that? Well, \D matches any character except a numeric digit, and + means “1 or more”. So \D+ matches one or more characters that are not digits. This is what you’re using instead of a literal hyphen, to try to match different separators.
- Using
\D+ instead of - means you can now match phone numbers where the parts are separated by spaces instead of hyphens.
- Of course, phone numbers separated by hyphens still work too.
- Unfortunately, this is still not the final answer, because it assumes that there is a separator at all. What if the phone number is entered without any spaces or hyphens at all?
-
- Oops! This still hasn't fixed the problem of requiring extensions. Now you have two problems, but you can solve both of them with the same technique.
+
- Oops! This still hasn’t fixed the problem of requiring extensions. Now you have two problems, but you can solve both of them with the same technique.
The next example shows the regular expression for handling phone numbers without separators.
@@ -354,11 +354,11 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> phonePattern.search('(800)5551212 x1234') ⑤
>>>
-- The only change you've made since that last step is changing all the
+ to *. Instead of \D+ between the parts of the phone number, you now match on \D*. Remember that + means “1 or more”? Well, * means “zero or more”. So now you should be able to parse phone numbers even when there is no separator character at all.
+ - The only change you’ve made since that last step is changing all the
+ to *. Instead of \D+ between the parts of the phone number, you now match on \D*. Remember that + means “1 or more”? Well, * means “zero or more”. So now you should be able to parse phone numbers even when there is no separator character at all.
- Lo and behold, it actually works. Why? You matched the beginning of the string, then a remembered group of three digits (
800), then zero non-numeric characters, then a remembered group of three digits (555), then zero non-numeric characters, then a remembered group of four digits (1212), then zero non-numeric characters, then a remembered group of an arbitrary number of digits (1234), then the end of the string.
- Other variations work now too: dots instead of hyphens, and both a space and an
x before the extension.
- - Finally, you've solved the other long-standing problem: extensions are optional again. If no extension is found, the
groups() method still returns a tuple of four elements, but the fourth element is just an empty string.
- - I hate to be the bearer of bad news, but you're not finished yet. What's the problem here? There's an extra character before the area code, but the regular expression assumes that the area code is the first thing at the beginning of the string. No problem, you can use the same technique of “zero or more non-numeric characters” to skip over the leading characters before the area code.
+
- Finally, you’ve solved the other long-standing problem: extensions are optional again. If no extension is found, the
groups() method still returns a tuple of four elements, but the fourth element is just an empty string.
+ - I hate to be the bearer of bad news, but you’re not finished yet. What’s the problem here? There’s an extra character before the area code, but the regular expression assumes that the area code is the first thing at the beginning of the string. No problem, you can use the same technique of “zero or more non-numeric characters” to skip over the leading characters before the area code.
The next example shows how to handle leading characters in phone numbers.
@@ -370,12 +370,12 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> phonePattern.search('work 1-(800) 555.1212 #1234') ④
>>>
-- This is the same as in the previous example, except now you're matching
\D*, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you're not remembering these non-numeric characters (they're not in parentheses). If you find them, you'll just skip over them and then start remembering the area code whenever you get to it.
- - You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis after the area code is already handled; it's treated as a non-numeric separator and matched by the
\D* after the first remembered group.)
- - Just a sanity check to make sure you haven't broken anything that used to work. Since the leading characters are entirely optional, this matches the beginning of the string, then zero non-numeric characters, then a remembered group of three digits (
800), then one non-numeric character (the hyphen), then a remembered group of three digits (555), then one non-numeric character (the hyphen), then a remembered group of four digits (1212), then zero non-numeric characters, then a remembered group of zero digits, then the end of the string.
- - This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn't this phone number match? Because there's a
1 before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (\D*). Aargh.
+ - This is the same as in the previous example, except now you’re matching
\D*, zero or more non-numeric characters, before the first remembered group (the area code). Notice that you’re not remembering these non-numeric characters (they’re not in parentheses). If you find them, you’ll just skip over them and then start remembering the area code whenever you get to it.
+ - You can successfully parse the phone number, even with the leading left parenthesis before the area code. (The right parenthesis after the area code is already handled; it’s treated as a non-numeric separator and matched by the
\D* after the first remembered group.)
+ - Just a sanity check to make sure you haven’t broken anything that used to work. Since the leading characters are entirely optional, this matches the beginning of the string, then zero non-numeric characters, then a remembered group of three digits (
800), then one non-numeric character (the hyphen), then a remembered group of three digits (555), then one non-numeric character (the hyphen), then a remembered group of four digits (1212), then zero non-numeric characters, then a remembered group of zero digits, then the end of the string.
+ - This is where regular expressions make me want to gouge my eyes out with a blunt object. Why doesn’t this phone number match? Because there’s a
1 before the area code, but you assumed that all the leading characters before the area code were non-numeric characters (\D*). Aargh.
-Let's back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let's take a different approach: don't explicitly match the beginning of the string at all. This approach is shown in the next example.
+
Let’s back up for a second. So far the regular expressions have all matched from the beginning of the string. But now you see that there may be an indeterminate amount of stuff at the beginning of the string that you want to ignore. Rather than trying to match it all just so you can skip over it, let’s take a different approach: don’t explicitly match the beginning of the string at all. This approach is shown in the next example.
>>> phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') ①
>>> phonePattern.search('work 1-(800) 555.1212 #1234').groups() ②
@@ -385,13 +385,13 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> phonePattern.search('80055512121234') ④
('800', '555', '1212', '1234')
-- Note the lack of
^ in this regular expression. You are not matching the beginning of the string anymore. There's nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.
+ - Note the lack of
^ in this regular expression. You are not matching the beginning of the string anymore. There’s nothing that says you need to match the entire input with your regular expression. The regular expression engine will do the hard work of figuring out where the input string starts to match, and go from there.
- Now you can successfully parse a phone number that includes leading characters and a leading digit, plus any number of any kind of separators around each part of the phone number.
- Sanity check. this still works.
- That still works too.
See how quickly a regular expression can get out of control? Take a quick glance at any of the previous iterations. Can you tell the difference between one and the next?
-
While you still understand the final answer (and it is the final answer; if you've discovered a case it doesn't handle, I don't want to know about it), let's write it out as a verbose regular expression, before you forget why you made the choices you made.
+
While you still understand the final answer (and it is the final answer; if you’ve discovered a case it doesn’t handle, I don’t want to know about it), let’s write it out as a verbose regular expression, before you forget why you made the choices you made.
>>> phonePattern = re.compile(r'''
# don't match beginning of string, number can start anywhere
@@ -409,11 +409,11 @@ characters. If you've used regular expressions in other languages (like Perl), t
>>> phonePattern.search('800-555-1212') ②
('800', '555', '1212', '')
-- Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it's no surprise that it parses the same inputs.
-
- Final sanity check. Yes, this still works. You're done.
+
- Other than being spread out over multiple lines, this is exactly the same regular expression as the last step, so it’s no surprise that it parses the same inputs.
+
- Final sanity check. Yes, this still works. You’re done.
Summary
-This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you're completely overwhelmed by them now, believe me, you ain't seen nothing yet.
+
This is just the tiniest tip of the iceberg of what regular expressions can do. In other words, even though you’re completely overwhelmed by them now, believe me, you ain’t seen nothing yet.
You should now be familiar with the following techniques:
^ matches the beginning of a string.
diff --git a/roman1.py b/roman1.py
index e051037..b9d6311 100644
--- a/roman1.py
+++ b/roman1.py
@@ -27,7 +27,3 @@ def to_roman(n):
result += numeral
n -= integer
return result
-
-def from_roman(s):
- """convert Roman numeral to integer"""
- pass
diff --git a/roman2.py b/roman2.py
index 285abce..c751dfc 100644
--- a/roman2.py
+++ b/roman2.py
@@ -22,8 +22,8 @@ roman_numeral_map = (('M', 1000),
def to_roman(n):
"""convert integer to Roman numeral"""
- if n > 3999:
- raise OutOfRangeError("number out of range (must be less than 3999)")
+# if n > 3999:
+# raise OutOfRangeError("number out of range (must be less than 3999)")
result = ""
for numeral, integer in roman_numeral_map:
@@ -31,7 +31,3 @@ def to_roman(n):
result += numeral
n -= integer
return result
-
-def from_roman(s):
- """convert Roman numeral to integer"""
- pass
diff --git a/roman3.py b/roman3.py
index 832dfcf..42cf9f2 100644
--- a/roman3.py
+++ b/roman3.py
@@ -31,7 +31,3 @@ def to_roman(n):
result += numeral
n -= integer
return result
-
-def from_roman(s):
- """convert Roman numeral to integer"""
- pass
diff --git a/roman4.py b/roman4.py
index e0ddc8b..14ee944 100644
--- a/roman4.py
+++ b/roman4.py
@@ -34,7 +34,3 @@ def to_roman(n):
result += numeral
n -= integer
return result
-
-def from_roman(s):
- """convert Roman numeral to integer"""
- pass
diff --git a/table-of-contents.html b/table-of-contents.html
index a7c1e3c..b05c4d3 100644
--- a/table-of-contents.html
+++ b/table-of-contents.html
@@ -121,22 +121,16 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
- ...mention why from module import * is only allowed at module level
-
- Unit testing
+
- Unit testing
- - Introduction to Roman numerals
-
- Diving in
-
- Introducing romantest.py
-
- Testing for success
-
- Testing for failure
-
- Testing for sanity
+
- (Not) diving in
+
romantest1.py
+ romantest2.py
+ - ...
- Test-first programming
- - roman.py, stage 1
-
- roman.py, stage 2
-
- roman.py, stage 3
-
- roman.py, stage 4
-
- roman.py, stage 5
+
- ...
- Refactoring your code
diff --git a/unit-testing.html b/unit-testing.html
new file mode 100644
index 0000000..2424e55
--- /dev/null
+++ b/unit-testing.html
@@ -0,0 +1,278 @@
+
+
+
+
+Unit testing - Dive into Python 3
+
+
+
+
+
+
+Unit testing
+
+❝ Certitude is not the test of certainty. We have been cocksure of many things that were not so. ❞
— Oliver Wendell Holmes, Jr.
+
+
+- (Not) diving in
+
romantest1.py
+romantest2.py
+- ...
+
+(Not) diving in
+In previous chapters, you “dived in” by immediately looking at code and trying to understand it as quickly as possible. Now that you have some Python under your belt, you're going to step back and look at the steps that happen before the code gets written.
+
In this chapter, you're going to write, debug, and optimize a set of utility functions to convert to and from Roman numerals. You saw the mechanics of constructing and validating Roman numerals in “Case study: roman numerals”. Now let's step back and consider what it would take to expand that into a two-way utility.
+
The rules for Roman numerals lead to a number of interesting observations:
+
+- There is only one correct way to represent a particular number as Roman numerals.
+
- The converse is also true: if a string of characters is a valid Roman numeral, it represents only one number (that is, it can only be read one way).
+
- There is a limited range of numbers that can be expressed as Roman numerals, specifically
1 through 3999. (The Romans did have several ways of expressing larger numbers, for instance by having a bar over a numeral to represent that its normal value should be multiplied by 1000, but you're not going to deal with that. For the purposes of this chapter, let's stipulate that Roman numerals go from 1 to 3999.)
+ - There is no way to represent
0 in Roman numerals.
+ - There is no way to represent negative numbers in Roman numerals.
+
- There is no way to represent fractions or non-integer numbers in Roman numerals.
+
+Let's start mapping out what a roman.py module should do. It will have two main functions, to_roman() and from_roman(). The to_roman() function should take an integer from 1 to 3999 and return the Roman numeral representation as a string…
+Stop right there. Now let's do something a little unexpected: write a test case that checks whether the to_roman() function does what you want it to. You read that right: you're going to write code that tests code that you haven't written yet.
+
This is called unit testing. The set of two conversion functions — to_roman(), and later from_roman() — can be written and tested as a unit, separate from any larger program that imports them. Python has a framework for unit testing, the appropriately-named unittest module.
+
Unit testing is an important part of an overall testing-centric development strategy. If you write unit tests, it is important to write them early (preferably before writing the code that they test), and to keep them updated as code and requirements change. Unit testing is not a replacement for higher-level functional or system testing, but it is important in all phases of development:
+
+- Before writing code, it forces you to detail your requirements in a useful fashion.
+
- While writing code, it keeps you from over-coding. When all the test cases pass, the function is complete.
+
- When refactoring code, it assures you that the new version behaves the same way as the old version.
+
- When maintaining code, it helps you cover your ass when someone comes screaming that your latest change broke their old code. (“But sir, all the unit tests passed when I checked it in...”)
+
- When writing code in a team, it increases confidence that the code you're about to commit isn't going to break someone else's code, because you can run their unit tests first. (I've seen this sort of thing in code sprints. A team breaks up the assignment, everybody takes the specs for their task, writes unit tests for it, then shares their unit tests with the rest of the team. That way, nobody goes off too far into developing code that doesn't play well with others.)
+
+romantest1.py
+A test case answers a single question about the code it is testing. A test case should be able to...
+
+- ...run completely by itself, without any human input. Unit testing is about automation.
+
- ...determine by itself whether the function it is testing has passed or failed, without a human interpreting the results.
+
- ...run in isolation, separate from any other test cases (even if they test the same functions). Each test case is an island.
+
+Given that, let's build a test case for the first requirement:
+
+- The
to_roman() function should return the Roman numeral representation for all integers 1 to 3999.
+
+It is not immediately obvious how this code does… well, anything. It defines a class which has no __init__() method. The class does have another method, but it is never called. The entire script has a __main__ block, but it doesn't reference the class or its method. But it does do something, I promise.
+
import roman1
+import unittest
+
+class KnownValues(unittest.TestCase): ①
+ known_values = ( (1, 'I'),
+ (2, 'II'),
+ (3, 'III'),
+ (4, 'IV'),
+ (5, 'V'),
+ (6, 'VI'),
+ (7, 'VII'),
+ (8, 'VIII'),
+ (9, 'IX'),
+ (10, 'X'),
+ (50, 'L'),
+ (100, 'C'),
+ (500, 'D'),
+ (1000, 'M'),
+ (31, 'XXXI'),
+ (148, 'CXLVIII'),
+ (294, 'CCXCIV'),
+ (312, 'CCCXII'),
+ (421, 'CDXXI'),
+ (528, 'DXXVIII'),
+ (621, 'DCXXI'),
+ (782, 'DCCLXXXII'),
+ (870, 'DCCCLXX'),
+ (941, 'CMXLI'),
+ (1043, 'MXLIII'),
+ (1110, 'MCX'),
+ (1226, 'MCCXXVI'),
+ (1301, 'MCCCI'),
+ (1485, 'MCDLXXXV'),
+ (1509, 'MDIX'),
+ (1607, 'MDCVII'),
+ (1754, 'MDCCLIV'),
+ (1832, 'MDCCCXXXII'),
+ (1993, 'MCMXCIII'),
+ (2074, 'MMLXXIV'),
+ (2152, 'MMCLII'),
+ (2212, 'MMCCXII'),
+ (2343, 'MMCCCXLIII'),
+ (2499, 'MMCDXCIX'),
+ (2574, 'MMDLXXIV'),
+ (2646, 'MMDCXLVI'),
+ (2723, 'MMDCCXXIII'),
+ (2892, 'MMDCCCXCII'),
+ (2975, 'MMCMLXXV'),
+ (3051, 'MMMLI'),
+ (3185, 'MMMCLXXXV'),
+ (3250, 'MMMCCL'),
+ (3313, 'MMMCCCXIII'),
+ (3408, 'MMMCDVIII'),
+ (3501, 'MMMDI'),
+ (3610, 'MMMDCX'),
+ (3743, 'MMMDCCXLIII'),
+ (3844, 'MMMDCCCXLIV'),
+ (3888, 'MMMDCCCLXXXVIII'),
+ (3940, 'MMMCMXL'),
+ (3999, 'MMMCMXCIX')) ②
+
+ def test_to_roman_known_values(self): ③
+ """to_roman should give known result with known input"""
+ for integer, numeral in self.known_values:
+ result = roman1.to_roman(integer) ④
+ self.assertEqual(numeral, result) ⑤
+
+if __name__ == "__main__":
+ unittest.main()
+
+- To write a test case, first subclass the
TestCase class of the unittest module. This class provides many useful methods which you can use in your test case to test specific conditions.
+ - This is a list of integer/numeral pairs that I verified manually. It includes the lowest ten numbers, the highest number, every number that translates to a single-character Roman numeral, and a random sampling of other valid numbers. The point of a unit test is not to test every possible input, but to test a representative sample.
+
- Every individual test is its own method, which must take no parameters and return no value. If the method exits normally without raising an exception, the test is considered passed; if the method raises an exception, the test is considered failed.
+
- Here you call the actual
to_roman() function. (Well, the function hasn't be written yet, but once it is, this is the line that will call it.) Notice that you have now defined the API for the to_roman() function: it must take an integer (the number to convert) and return a string (the Roman numeral representation). If the API is different than that, this test is considered failed. Also notice that you are not trapping any exceptions when you call to_roman(). This is intentional. to_roman() shouldn't raise an exception when you call it with valid input, and these input values are all valid. If to_roman() raises an exception, this test is considered failed.
+ - Assuming the
to_roman() function was defined correctly, called correctly, completed successfully, and returned a value, the last step is to check whether it returned the right value. This is a common question, and the TestCase class provides a method, assertEqual, to check whether two values are equal. If the result returned from to_roman() (result) does not match the known value you were expecting (numeral), assertEqual will raise an exception and the test will fail. If the two values are equal, assertEqual will do nothing. If every value returned from to_roman() matches the known value you expect, assertEqual never raises an exception, so testToRomanKnownValues eventually exits normally, which means to_roman() has passed this test.
+
+Once you have a test case, you can start coding the to_roman() function. First, you should stub it out as an empty function and make sure the tests fail. If the tests succeed before you've written any code, you're doing it wrong — your tests aren't testing your code at all! Write a test that fails, then code until it passes.
+
# roman1.py
+
+function to_roman(n):
+ """convert integer to Roman numeral"""
+ pass ①
+
+- At this stage, you want to define the API of the
to_roman() function, but you don't want to code it yet. (Your test needs to fail first.) To stub it out, use the Python reserved word pass [FIXME ref], which does precisely nothing..
+
+Execute romantest1.py on the command line to run the test. If you call it with the -v command-line option, it will give more verbose output so you can see exactly what's going on as each test case runs. With any luck, your output should look like this:
+
+you@localhost:~$ python3 romantest1.py -v
+to_roman should give known result with known input ... FAIL ①
+
+======================================================================
+FAIL: to_roman should give known result with known input
+----------------------------------------------------------------------
+Traceback (most recent call last):
+ File "romantest1.py", line 73, in test_to_roman_known_values
+ self.assertEqual(numeral, result)
+AssertionError: 'I' != None ②
+
+----------------------------------------------------------------------
+Ran 1 test in 0.016s ③
+
+FAILED (failures=1) ④
+
+- Running the script runs
unittest.main(), which runs each test case. Each test case is a method within each class in romantest.py that inherits from unittest.TestCase. For each test case, the unittest module will print out the docstring of the method and whether that test passed or failed. As expected, this test case fails.
+ - For each failed test case,
unittest displays the trace information showing exactly what happened. In this case, the call to assertEqual() raised an AssertionError because it was expecting to_roman(1) to return "I", but it didn't. (Since there was no explicit return statement, the function returned None, the Python null value.)
+ - After the detail of each test,
unittest displays a summary of how many tests were performed and how long it took.
+ - Overall, the unit test failed because at least one test case did not pass. When a test case doesn't pass,
unittest distinguishes between failures and errors. A failure is a call to an assertXYZ method, like assertEqual or assertRaises, that fails because the asserted condition is not true or the expected exception was not raised. An error is any other sort of exception raised in the code you're testing or the unit test case itself.
+
+Now, finally, you can write the to_roman() function.
+
roman_numeral_map = (('M', 1000),
+ ('CM', 900),
+ ('D', 500),
+ ('CD', 400),
+ ('C', 100),
+ ('XC', 90),
+ ('L', 50),
+ ('XL', 40),
+ ('X', 10),
+ ('IX', 9),
+ ('V', 5),
+ ('IV', 4),
+ ('I', 1)) ①
+
+def to_roman(n):
+ """convert integer to Roman numeral"""
+ result = ""
+ for numeral, integer in roman_numeral_map:
+ while n >= integer: ②
+ result += numeral
+ n -= integer
+ return result
+
+- roman_numeral_map is a tuple of tuples which defines three things: the character representations of the most basic Roman numerals; the order of the Roman numerals (in descending value order, from
M all the way down to I); the value of each Roman numeral. Each inner tuple is a pair of (numeral, value). It's not just single-character Roman numerals; it also defines two-character pairs like CM (“one hundred less than one thousand”). This makes the to_roman() function code simpler.
+ - Here's where the rich data structure of roman_numeral_map pays off, because you don't need any special logic to handle the subtraction rule. To convert to Roman numerals, simply iterate through roman_numeral_map looking for the largest integer value less than or equal to the input. Once found, add the Roman numeral representation to the end of the output, subtract the corresponding integer value from the input, lather, rinse, repeat.
+
+If you're still not clear how the to_roman() function works, add a print() call to the end of the while loop:
+
+while n >= integer:
+ result += numeral
+ n -= integer
+ print('subtracting {0} from input, adding {1} to output'.format(integer, numeral))
+With the debug print() statements, the output looks like this:
+
+>>> import roman1
+>>> roman1.to_roman(1424)
+subtracting 1000 from input, adding M to output
+subtracting 400 from input, adding CD to output
+subtracting 10 from input, adding X to output
+subtracting 10 from input, adding X to output
+subtracting 4 from input, adding IV to output
+'MCDXXIV'
+So the to_roman() function appears to work, at least in this manual spot check. But will it pass the test case you wrote?
+
+you@localhost:~$ python3 romantest1.py -v
+to_roman should give known result with known input ... ok
+
+----------------------------------------------------------------------
+Ran 1 test in 0.016s
+
+OK
+
+- Hooray! The
to_roman() function passes the “known values” test case. It's not comprehensive, but it does put the function through its paces with a variety of inputs, including inputs that produce every single-character Roman numeral, the largest possible input (3999), and the input that produces the longest possible Roman numeral (3888). At this point, you can be reasonably confident that the function works for any good input value you could throw at it.
+
+“Good” input? Hmm. What about bad input?
+
romantest2.py
+It is not enough to test that functions succeed when given good input; you must also test that they fail when given bad input. And not just any sort of failure; they must fail in the way you expect.
+
+>>> import roman1
+>>> roman1.to_roman(4000) ①
+'MMMM'
+>>> roman1.to_roman(5000)
+'MMMMM'
+>>> roman1.to_roman(9999)
+'MMMMMMMMMCMXCIX'
+
+- FIXME
+
+The question to ask yourself is, “How can I express this as a testable requirement?” How's this for starters:
+
+The to_roman() function should fail when given an integer greater than 3999.
+
+What would that test look like?
+
class ToRomanBadInput(unittest.TestCase):
+ def test_too_large(self):
+ """to_roman should fail with large input"""
+ self.assertRaises(roman2.OutOfRangeError, roman2.to_roman, 4000)
+
+...
+
+
+
+
+
+
+
+
© 2001–4, 2009 ℳark Pilgrim, CC-BY-SA-3.0
+
+
diff --git a/your-first-python-program.html b/your-first-python-program.html
index 39939fe..bc400bc 100644
--- a/your-first-python-program.html
+++ b/your-first-python-program.html
@@ -40,7 +40,7 @@ body{counter-reset:h1 1}
Diving in
You know how other books go on and on about programming fundamentals and finally work up to building something useful? Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
-
[download]
+
SUFFIXES = {1000: ['KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'],
1024: ['KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB']}
@@ -70,11 +70,13 @@ if __name__ == "__main__":
print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))
Now let's run this program on the command line. On Windows, it will look something like this:
-
c:\home\diveintopython3> c:\python30\python.exe humansize.py
+
+c:\home\diveintopython3> c:\python30\python.exe humansize.py
1.0 TB
931.3 GiB
On Mac OS X or Linux, it would look something like this:
-
you@localhost:~$ python3 humansize.py
+
+you@localhost:~$ python3 humansize.py
1.0 TB
931.3 GiB
@@ -103,14 +105,14 @@ if __name__ == "__main__":
- A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
- weakly typed language
-- A language in which types are “automagically” coerced to other types as needed; the opposite of strongly typed. PHP is weakly typed. In PHP, you can concatenate the string
'12' and the integer 3 to get the string '123', then treat that as the integer 123, all without any explicit conversion. [FIXME double-check this]
+- A language in which types are “automagically” coerced to other types as needed; the opposite of strongly typed. PHP is weakly typed. In PHP, you can concatenate the string
'12' and the integer 3 to get the string '123', then treat that as the integer 123, all without any explicit conversion. [FIXME double-check this]
So Python is both dynamically typed (because it doesn't use explicit datatype declarations) and strongly typed (because once a variable has a datatype, it actually matters).
If you have experience in other programming languages, this table may help you visualize how Python compares to them:
Statically typed Dynamically typed
-Weakly typed C, Objective-C JavaScript, Perl 5, PHP
+Weakly typed C, Objective-C JavaScript, Perl 5, PHP
Strongly typed Pascal, Java Python, Ruby
Writing readable code
@@ -220,11 +222,13 @@ if __name__ == "__main__":
☞Like C, Python uses == for comparison and = for assignment. Unlike C, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
So what makes this if statement special? Well, modules are objects, and all modules have a built-in attribute __name__. A module's __name__ depends on how you're using the module. If you import the module, then __name__ is the module's filename, without a directory path or file extension.
-
>>> import humansize
+
+>>> import humansize
>>> humansize.__name__
'humansize'
But you can also run the module directly as a standalone program, in which case __name__ will be a special default value, __main__. Python will evaluate this if statement, find a true expression, and execute the if code block. In this case, to print two values.
-
c:\home\diveintopython3> c:\python30\python.exe humansize.py
+
+c:\home\diveintopython3> c:\python30\python.exe humansize.py
1.0 TB
931.3 GiB
Further reading