mirror of
https://github.com/kennethreitz-archive/.com.git
synced 2026-06-05 23:30:19 +00:00
125 lines
4.9 KiB
ReStructuredText
125 lines
4.9 KiB
ReStructuredText
Python + Regular Expressions
|
||
############################
|
||
|
||
:date: 2009-03-17 05:30
|
||
:category: Code
|
||
|
||
|
||
Have you ever needed to parse through large amounts of text looking
|
||
for a specific pattern? Patterns like “one capital letter followed
|
||
by three numbers” or “dd/mm/yyyy”? This is known as Pattern
|
||
Matching. Regular Expressions allow easy syntax for pattern
|
||
matching, and is an invaluable skill to add to one’s toolkit, no
|
||
matter what your area of expertise/practice is. Whether you’re
|
||
writing a Compiler, Form Validator, Text Editor, Django Project, or
|
||
Language Translator, Regular Expressions will always prove to be
|
||
invaluable. Here is a very basic overview of some syntax: ‘\\d’
|
||
represents a digit. ‘\\s’ represents whitespace. ‘.’ represents any
|
||
character. If you have worked with Python for very long, you are
|
||
probably already familiar with the concept. Take a look at the
|
||
following code::
|
||
|
||
print("Rounded = %05d" % (42))
|
||
|
||
This makes sure that the digit printed has 5 digits, and will
|
||
automatically add 0’s to compensate. If you understand this
|
||
concept, then you shouldn’t have a problem. Perl-style Regular
|
||
Expressions are a very widely-accepted implementation, and Python
|
||
has built in support for this mini-language! It’s easily
|
||
accessible, so let’s get started. The included ‘re’ module will
|
||
give us everything we need to get started: import re
|
||
|
||
Lets give our new module a try! It will enable you to do anything
|
||
you could ever want with regular expressions. Here’s a quick
|
||
example of some basic use::
|
||
|
||
import re
|
||
|
||
string0 = 'Kenneth Reitz is a cool guy!'
|
||
regExp = r’kenneth[- ]?reitz’
|
||
|
||
if re.match(regExp, string0, re.IGNORECASE):
|
||
print “True”
|
||
else:
|
||
print “False”
|
||
|
||
|
||
This script takes the string ``'Kenneth Reitz is a cool guy'``, and
|
||
searches for ``'kenneth reitz'`` inside of it. If ‘kenneth reitz’ is
|
||
found within string0 (re.match compares the expression with the
|
||
string), the script will print ``True``, if not, it will print
|
||
``False``. Additional parameters can be passed to the re.match
|
||
function when needed. Note the ``re.IGNORECASE`` flag used here –-
|
||
This tells the function be case-insensitive. Once you master the
|
||
regular expression syntax, you’ll realize how truly powerful they
|
||
can be. The options become limitless and the usefulness becomes
|
||
undeniable. Here’s another example::
|
||
|
||
import re
|
||
|
||
string0 = '10.03.1988'
|
||
regExp = r'^\d\d[./]\d\d[./]\d\d\d\d?$'
|
||
|
||
if re.match(regExp, string0):
|
||
print 'True'
|
||
else:
|
||
print 'False/
|
||
|
||
|
||
When run, this script prints out ``True``. If we were to change
|
||
string0 to ``‘10.03.88’``, it would print ``False``. Simple, isn’t it?
|
||
Now, while a True/False return could be useful in certain
|
||
applications (i.e. form validation), most of the time, we’re going
|
||
to want to have a bit more information in order for our checks to
|
||
be useful. We can tell Python to show us the data that matches our
|
||
query. To do this, we’re going to have to break our expression up
|
||
into different groups. In the date we have defined, there are three
|
||
obvious groups we could separate this into: the day, month, and
|
||
year. While defining a Regular Expression, you can use parentheses
|
||
to define groups: ``regExp = r’^()././$’``
|
||
|
||
This separates our expression into 3 separate groups. Python also
|
||
supports turning a Regular Expression string into an
|
||
heavily-supported object with the re.compile() function. Once you
|
||
define a string as a Regular Expression object, you can use the
|
||
built in methods to preform powerful parsing. Now we can ask python
|
||
what is in those groups::
|
||
|
||
import re
|
||
|
||
string0 = ‘10.03.1988’
|
||
|
||
regExp = re.compile(‘^()././$’)
|
||
regExpMatches = regExp.match(string0)
|
||
|
||
if re.match(regExp, string0):
|
||
print(“Day: %s\nMonth: %s\nYear: %s” % (regExpMatches.group(1), \
|
||
regExpMatches.group(2), regExpMatches.group(3)))
|
||
else:
|
||
print(“Invalid Date.”)
|
||
|
||
|
||
When executed, this script parses through our validated date,
|
||
breaks it down into groups, and prints the following::
|
||
|
||
Day: 10
|
||
Month: 03
|
||
Year: 1988
|
||
|
||
|
||
The possibilities are limitless! Here’s a quick run-down of the re
|
||
module’s functions, strait from the Python documentation for
|
||
reference: match: Match a regular expression pattern to the
|
||
beginning of a string. search: Search a string for the presence of
|
||
a pattern. sub: Substitute occurrences of a pattern found in a
|
||
string subn: Same as sub, but also return the number of
|
||
substitutions made. split: Split a string by the occurrences of a
|
||
pattern. findall: Find all occurrences of a pattern in a string.
|
||
compile: Compile a pattern into a RegexObject. purge: Clear the
|
||
regular expression cache. escape: Backslash all non-alphanumerics
|
||
in a string.
|
||
|
||
Remember, you can always type help(re) (after importing the re
|
||
module) into the Python interpret to take a quick look at the
|
||
module’s built-in documentation. Good luck and happy coding!
|