Files
.com/python-regular-expressions.html
Kenneth Reitz 8f3f30005c content update
2011-01-03 00:33:11 -05:00

264 lines
8.8 KiB
HTML
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!DOCTYPE html>
<html lang="en">
<head>
<title>Python + Regular Expressions</title>
<meta charset="utf-8" />
<link rel="stylesheet" href="./theme/css/main.css" type="text/css" />
<link href="./feeds/all.atom.xml" type="application/atom+xml" rel="alternate" title="Kenneth's log ATOM Feed" />
<!--[if IE]>
<script src="http://html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]-->
<!--[if lte IE 7]>
<link rel="stylesheet" type="text/css" media="all" href="./css/ie.css"/>
<script src="./js/IE8.js" type="text/javascript"></script><![endif]-->
<!--[if lt IE 7]>
<link rel="stylesheet" type="text/css" media="all" href="./css/ie6.css"/><![endif]-->
</head>
<body id="index" class="home">
<a href="http://github.com/kennethreitz/">
<img style="position: absolute; top: 0; right: 0; border: 0;" src="http://s3.amazonaws.com/github/ribbons/forkme_right_red_aa0000.png" alt="Fork me on GitHub" />
</a>
<header id="banner" class="body">
<h1>
<a href=".">Kenneth's log </a>
</h1>
<nav><ul>
<li >
<a href="./category/Life.html">Life</a>
</li>
<li class="active">
<a href="./category/Code.html">Code</a>
</li>
<li >
<a href="./category/projects.html">projects</a>
</li>
</ul></nav>
</header><!-- /#banner -->
<section id="content" class="body">
<article>
<header> <h1 class="entry-title"><a href="python-regular-expressions.html"
rel="bookmark" title="Permalink to Python + Regular Expressions">Python + Regular Expressions</a></h1> </header>
<div class="entry-content">
<footer class="post-info">
<abbr class="published" title="2009-03-17T05:30:00">
Tue 17 March 2009
</abbr>
<p>In <a href="./category/Code.html">Code</a>.
</p>
</footer><!-- /.post-info -->
<p>Have you ever needed to parse through large amounts of text looking
for a specific pattern? Patterns like “one capital letter followed
by three numbers” or “dd/mm/yyyy”? This is known as Pattern
Matching. Regular Expressions allow easy syntax for pattern
matching, and is an invaluable skill to add to ones toolkit, no
matter what your area of expertise/practice is. Whether youre
writing a Compiler, Form Validator, Text Editor, Django Project, or
Language Translator, Regular Expressions will always prove to be
invaluable. Here is a very basic overview of some syntax: \d
represents a digit. \s represents whitespace. . represents any
character. If you have worked with Python for very long, you are
probably already familiar with the concept. Take a look at the
following code: print(“Rounded = %05d” % (42))</p>
<p>This makes sure that the digit printed has 5 digits, and will
automatically add 0s to compensate. If you understand this
concept, then you shouldnt have a problem. Perl-style Regular
Expressions are a very widely-accepted implementation, and Python
has built in support for this mini-language! Its easily
accessible, so lets get started. The included re module will
give us everything we need to get started: import re</p>
<p>Lets give our new module a try! It will enable you to do anything
you could ever want with regular expressions. Heres a quick
example of some basic use. import re</p>
<pre class="literal-block">
string0 = 'Kenneth Reitz is a cool guy!'
regExp = rkenneth[- ]?reitz
if re.match(regExp, string0, re.IGNORECASE):
print “True”
else:
print “False”
</pre>
<p>This script takes the string Kenneth Reitz is a cool guy, and
searches for kenneth reitz inside of it. If kenneth reitz is
found within string0 (re.match compares the expression with the
string), the script will print “True”, if not, it will print
“False”. Additional parameters can be passed to the re.match
function when needed. Note the re.IGNORECASE flag used here
This tells the function be case-insensitive. Once you master the
regular expression syntax, youll realize how truly powerful they
can be. The options become limitless and the usefulness becomes
undeniable. Heres another example: import re</p>
<pre class="literal-block">
string0 = '10.03.1988'
regExp = r'^\d\d[./]\d\d[./]\d\d\d\d?$'
if re.match(regExp, string0):
print 'True'
else:
print 'False/
</pre>
<p>When run, this script prints out “True”. If we were to change
string0 to 10.03.88, it would print “False”. Simple, isnt it?
Now, while a True/False return could be useful in certain
applications (i.e. form validation), most of the time, were going
to want to have a bit more information in order for our checks to
be useful. We can tell Python to show us the data that matches our
query. To do this, were going to have to break our expression up
into different groups. In the date we have defined, there are three
obvious groups we could separate this into: the day, month, and
year. While defining a Regular Expression, you can use parentheses
() to define groups: regExp = r^()././$</p>
<p>This separates our expression into 3 separate groups. Python also
supports turning a Regular Expression string into an
heavily-supported object with the re.compile() function. Once you
define a string as a Regular Expression object, you can use the
built in methods to preform powerful parsing. Now we can ask python
what is in those groups: import restring0 = 10.03.1988 regExp =
re.compile(^()././$) regExpMatches = regExp.match(string0)</p>
<pre class="literal-block">
if re.match(regExp, string0):
print(“Day: %s\nMonth: %s\nYear: %s” % (regExpMatches.group(1), \
regExpMatches.group(2), regExpMatches.group(3)))
else:
print(“Invalid Date.”)
</pre>
<p>When executed, this script parses through our validated date,
breaks it down into groups, and prints the following: &gt; Day: 10 &gt;
Month: 03 &gt; Year: 1988</p>
<p>The possibilities are limitless! Heres a quick run-down of the re
modules functions, strait from the Python documentation for
reference: match: Match a regular expression pattern to the
beginning of a string. search: Search a string for the presence of
a pattern. sub: Substitute occurrences of a pattern found in a
string subn: Same as sub, but also return the number of
substitutions made. split: Split a string by the occurrences of a
pattern. findall: Find all occurrences of a pattern in a string.
compile: Compile a pattern into a RegexObject. purge: Clear the
regular expression cache. escape: Backslash all non-alphanumerics
in a string.</p>
<p>Remember, you can always type help(re) (after importing the re
module) into the Python interpret to take a quick look at the
modules built-in documentation. Good luck and happy coding!</p>
</div><!-- /.entry-content -->
<div class="comments">
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_identifier = "python-regular-expressions.html";
(function() {
var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true;
dsq.src = 'http://kennethreitz.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
</div>
</article>
</section>
<section id="extras" class="body">
<div class="blogroll">
<h2>Links</h2>
<ul>
<li><a href="http://github.com/kennethreitz">GitHub Repos</a></li>
<li><a href="http://flickr.com/kennethreitz">Photography (Flickr)</a></li>
<li><a href="http://twitter.com/kennethreitz">Latest Tweets</a></li>
<li><a href="http://www.linkedin.com/in/kennethreitz">R&eacute;sum&eacute;</a></li>
<li><a href="http://pick.im/kenneth-reitz">Design Portfolio</a></li>
<li><a href="http://laterstars.com/kennethreitz">Later Stars</a></li>
</ul>
</div><!-- /.blogroll -->
<div class="social">
<ul>
<li><a href="./feeds/all.atom.xml" rel="alternate">atom feed</a></li>
<li><a href="http://facebook.com/kennethreitz">Facebook</a></li>
</ul>
</div><!-- /.social -->
</section><!-- /#extras -->
<footer id="contentinfo" class="body">
<address id="about" class="vcard body">
&copy; 2011 Kenneth Reitz &amp; co. All Rights Reserved.
</address><!-- /#about -->
</footer><!-- /#contentinfo -->
<script type="text/javascript">
var disqus_shortname = 'kennethreitz';
(function () {
var s = document.createElement('script'); s.async = true;
s.type = 'text/javascript';
s.src = 'http://' + disqus_shortname + '.disqus.com/count.js';
(document.getElementsByTagName('HEAD')[0] || document.getElementsByTagName('BODY')[0]).appendChild(s);
}());
</script>
<script type="text/javascript" charset="utf-8">
var disqus_developer = 1;
</script>
</body>
</html>