diff --git a/regular-expressions.html b/regular-expressions.html index 01ab4fc..549544b 100755 --- a/regular-expressions.html +++ b/regular-expressions.html @@ -96,14 +96,14 @@ body{counter-reset:h1 5} >>> import re >>> pattern = '^M?M?M?$' ① >>> re.search(pattern, 'M') ② -<SRE_Match object at 0106FB58> +<_sre.SRE_Match object at 0106FB58> >>> re.search(pattern, 'MM') ③ -<SRE_Match object at 0106C290> +<_sre.SRE_Match object at 0106C290> >>> re.search(pattern, 'MMM') ④ -<SRE_Match object at 0106AA38> +<_sre.SRE_Match object at 0106AA38> >>> re.search(pattern, 'MMMM') ⑤ >>> re.search(pattern, '') ⑥ -<SRE_Match object at 0106F4A8> +<_sre.SRE_Match object at 0106F4A8>
^ matches what follows only at the beginning of the string. If this were not specified, the pattern would match no matter where the M characters were, which is not what you want. You want to make sure that the M characters, if they’re there, are at the beginning of the string. M? optionally matches a single M character. Since this is repeated three times, you’re matching anywhere from zero to three M characters in a row. And $ matches the end of the string. When combined with the ^ character at the beginning, this means that the pattern must match the entire string, with no other characters before or after the M characters.
re module is the search() function, that takes a regular expression (pattern) and a string ('M') to try to match against the regular expression. If a match is found, search() returns an object which has various methods to describe the match; if no match is found, search() returns None, the Python null value. All you care about at the moment is whether the pattern matches, which you can tell by just looking at the return value of search(). 'M' matches this regular expression, because the first optional M matches and the second and third optional M characters are ignored.
@@ -142,14 +142,14 @@ body{counter-reset:h1 5}
>>> import re
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)$' ①
>>> re.search(pattern, 'MCM') ②
-<SRE_Match object at 01070390>
+<_sre.SRE_Match object at 01070390>
>>> re.search(pattern, 'MD') ③
-<SRE_Match object at 01073A50>
+<_sre.SRE_Match object at 01073A50>
>>> re.search(pattern, 'MMMCCC') ④
-<SRE_Match object at 010748A8>
+<_sre.SRE_Match object at 010748A8>
>>> re.search(pattern, 'MCMC') ⑤
>>> re.search(pattern, '') ⑥
-<SRE_Match object at 01071D98>
+<_sre.SRE_Match object at 01071D98>
^), then the thousands place (M?M?M?). Then it has the new part, in parentheses, which defines a set of three mutually exclusive patterns, separated by vertical bars: CM, CD, and D?C?C?C? (which is an optional D followed by zero to three optional C characters). The regular expression parser checks for each of these patterns in order (from left to right), takes the first one that matches, and ignores the rest.
'MCM' matches because the first M matches, the second and third M characters are ignored, and the CM matches (so the CD and D?C?C?C? patterns are never even considered). MCM is the Roman numeral representation of 1900.
@@ -168,14 +168,14 @@ body{counter-reset:h1 5}
>>> import re
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'M') ①
-<_sre.SRE_Match object at 0x008EE090>
+<_sre.SRE_Match object at 0x008EE090>
>>> pattern = '^M?M?M?$'
>>> re.search(pattern, 'MM') ②
-<_sre.SRE_Match object at 0x008EEB48>
+<_sre.SRE_Match object at 0x008EEB48>
>>> pattern = '^M?M?M?$'
->>> re.search(pattern, 'MMM') ③
+>>> re.search(pattern, 'MMM') ③
<_sre.SRE_Match object at 0x008EE090>
->>> re.search(pattern, 'MMMM') ④
+>>> re.search(pattern, 'MMMM') ④
>>>
M, but not the second and third M (but that’s okay because they’re optional), and then the end of the string.
@@ -186,13 +186,13 @@ body{counter-reset:h1 5}
>>> pattern = '^M{0,3}$' ① >>> re.search(pattern, 'M') ② -<_sre.SRE_Match object at 0x008EEB48> +<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MM') ③ -<_sre.SRE_Match object at 0x008EE090> +<_sre.SRE_Match object at 0x008EE090> >>> re.search(pattern, 'MMM') ④ -<_sre.SRE_Match object at 0x008EEDA8> +<_sre.SRE_Match object at 0x008EEDA8> >>> re.search(pattern, 'MMMM') ⑤ ->>>+>>>
M characters, then the end of the string.” The 0 and 3 can be any numbers; if you want to match at least one but no more than three M characters, you could say M{1,3}.
M out of a possible three, then the end of the string.
@@ -205,13 +205,13 @@ body{counter-reset:h1 5}
>>> pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$' >>> re.search(pattern, 'MCMXL') ① -<_sre.SRE_Match object at 0x008EEB48> +<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCML') ② -<_sre.SRE_Match object at 0x008EEB48> +<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCMLX') ③ -<_sre.SRE_Match object at 0x008EEB48> +<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCMLXXX') ④ -<_sre.SRE_Match object at 0x008EEB48> +<_sre.SRE_Match object at 0x008EEB48> >>> re.search(pattern, 'MCMLXXXX') ⑤ >>>
>>> pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
>>> re.search(pattern, 'MDLV') ①
-<_sre.SRE_Match object at 0x008EEB48>
+<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMDCLXVI') ②
-<_sre.SRE_Match object at 0x008EEB48>
+<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMDCCCLXXXVIII') ③
-<_sre.SRE_Match object at 0x008EEB48>
+<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'I') ④
-<_sre.SRE_Match object at 0x008EEB48>
+<_sre.SRE_Match object at 0x008EEB48>
M characters, then D?C{0,3}. Of that, it matches the optional D and zero of three possible C characters. Moving on, it matches L?X{0,3} by matching the optional L and zero of three possible X characters. Then it matches V?I{0,3} by matching the optional V and zero of three possible I characters, and finally the end of the string. MDLV is the Roman numeral representation of 1555.
M characters, then the D?C{0,3} with a D and one of three possible C characters; then L?X{0,3} with an L and one of three possible X characters; then V?I{0,3} with a V and one of three possible I characters; then the end of the string. MMDCLXVI is the Roman numeral representation of 2666.
@@ -267,11 +267,11 @@ body{counter-reset:h1 5}
$ # end of string
'''
>>> re.search(pattern, 'M', re.VERBOSE) ①
-<_sre.SRE_Match object at 0x008EEB48>
+<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MCMLXXXIX', re.VERBOSE) ②
-<_sre.SRE_Match object at 0x008EEB48>
+<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE) ③
-<_sre.SRE_Match object at 0x008EEB48>
+<_sre.SRE_Match object at 0x008EEB48>
>>> re.search(pattern, 'M') ④
re.VERBOSE is a constant defined in the re module that signals that the pattern should be treated as a verbose regular expression. As you can see, this pattern has quite a bit of whitespace (all of which is ignored), and several comments (all of which are ignored). Once you ignore the whitespace and the comments, this is exactly the same regular expression as you saw in the previous section, but it’s a lot more readable.