From 6cb6013d9033b4621ad13dde6129bd9a86613168 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Tue, 14 Jul 2009 11:02:00 -0400 Subject: [PATCH] added bit about overlapping matches in re.findall --- advanced-iterators.html | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/advanced-iterators.html b/advanced-iterators.html index 6727d13..2427f4a 100755 --- a/advanced-iterators.html +++ b/advanced-iterators.html @@ -6,6 +6,7 @@ @@ -102,6 +103,26 @@ if __name__ == '__main__':
  • Here the regular expression pattern matches sequences of letters. Again, the return value is a list, and each item in the list is a string that matched the regular expression pattern. +

    Here’s another example that will stretch your brain a little. + +

    +>>> re.findall(' s.*? s', "The sixth sick sheikh's sixth sheep's sick.")
    +[' sixth s', " sheikh's s", " sheep's s"]
    + +

    Surprised? The regular expression looks for a space, an s, and then the shortest possible series of any character (.*?), then a space, then another s. Well, looking at that input string, I see five matches: + +

      +
    1. The sixth sick sheikh's sixth sheep's sick. +
    2. The sixth sick sheikh's sixth sheep's sick. +
    3. The sixth sick sheikh's sixth sheep's sick. +
    4. The sixth sick sheikh's sixth sheep's sick. +
    5. The sixth sick sheikh's sixth sheep's sick. +
    + +

    But the re.findall() function only returned three matches. Specifically, it returned the first, the third, and the fifth. Why is that? Because it doesn’t return overlapping matches. The first match overlaps with the second, so the first is returned and the second is skipped. Then the third overlaps with the fourth, so the third is returned and the fourth is skipped. Finally, the fifth is returned. Three matches, not five. + +

    This has nothing to do with the alphametics solver; I just thought it was interesting. +

    Finding the unique items in a sequence