mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
started strings chapter, rewrote case-study intro, added some FIXMEs for obvious holes
This commit is contained in:
@@ -23,52 +23,12 @@
|
||||
<li><a href="#install.summary">1.9. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper">2. Your First Python Program</a><ul>
|
||||
<li><a href="#odbchelper.divein">2.1. Diving in</a>
|
||||
<li><a href="#odbchelper.funcdef">2.2. Declaring Functions</a><ul>
|
||||
<li><a href="#d0e4188">2.2.1. How Python's Datatypes Compare to Other Programming Languages</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.docstring">2.3. Documenting Functions</a>
|
||||
<li><a href="#odbchelper.objects">2.4. Everything Is an Object</a><ul>
|
||||
<li><a href="#d0e4550">2.4.1. The Import Search Path</a>
|
||||
<li><a href="#d0e4665">2.4.2. What's an Object?</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.indenting">2.5. Indenting Code</a>
|
||||
<li><a href="#odbchelper.testing">2.6. Testing Modules</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#datatypes">3. Native Datatypes</a><ul>
|
||||
<li><a href="#odbchelper.dict">3.1. Introducing Dictionaries</a><ul>
|
||||
<li><a href="#d0e5174">3.1.1. Defining Dictionaries</a>
|
||||
<li><a href="#d0e5269">3.1.2. Modifying Dictionaries</a>
|
||||
<li><a href="#d0e5450">3.1.3. Deleting Items From Dictionaries</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.list">3.2. Introducing Lists</a><ul>
|
||||
<li><a href="#d0e5623">3.2.1. Defining Lists</a>
|
||||
<li><a href="#d0e5887">3.2.2. Adding Elements to Lists</a>
|
||||
<li><a href="#d0e6115">3.2.3. Searching Lists</a>
|
||||
<li><a href="#d0e6277">3.2.4. Deleting List Elements</a>
|
||||
<li><a href="#d0e6392">3.2.5. Using List Operators</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.tuple">3.3. Introducing Tuples</a>
|
||||
<li><a href="#odbchelper.vardef">3.4. Declaring variables</a><ul>
|
||||
<li><a href="#d0e6873">3.4.1. Referencing Variables</a>
|
||||
<li><a href="#odbchelper.multiassign">3.4.2. Assigning Multiple Values at Once</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.stringformatting">3.5. Formatting Strings</a>
|
||||
<li><a href="#odbchelper.map">3.6. Mapping Lists</a>
|
||||
<li><a href="#odbchelper.join">3.7. Joining Lists and Splitting Strings</a><ul>
|
||||
<li><a href="#d0e7982">3.7.1. Historical Note on String Methods</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.summary">3.8. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#apihelper">4. The Power Of Introspection</a><ul>
|
||||
<li><a href="#apihelper.divein">4.1. Diving In</a>
|
||||
<li><a href="#apihelper.optional">4.2. Using Optional and Named Arguments</a>
|
||||
@@ -138,23 +98,6 @@
|
||||
<li><a href="#fileinfo.summary2">6.7. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#re">7. Regular Expressions</a><ul>
|
||||
<li><a href="#re.intro">7.1. Diving In</a>
|
||||
<li><a href="#re.matching">7.2. Case Study: Street Addresses</a>
|
||||
<li><a href="#re.roman">7.3. Case Study: Roman Numerals</a><ul>
|
||||
<li><a href="#d0e17592">7.3.1. Checking for Thousands</a>
|
||||
<li><a href="#d0e17785">7.3.2. Checking for Hundreds</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#re.nm">7.4. Using the {n,m} Syntax</a><ul>
|
||||
<li><a href="#d0e18326">7.4.1. Checking for Tens and Ones</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#re.verbose">7.5. Verbose Regular Expressions</a>
|
||||
<li><a href="#re.phone">7.6. Case study: Parsing Phone Numbers</a>
|
||||
<li><a href="#re.summary">7.7. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#dialect">8. HTML Processing</a><ul>
|
||||
<li><a href="#dialect.divein">8.1. Diving in</a>
|
||||
<li><a href="#dialect.sgmllib">8.2. Introducing sgmllib.py</a>
|
||||
@@ -172,7 +115,6 @@
|
||||
<li><a href="#kgp.divein">9.1. Diving in</a>
|
||||
<li><a href="#kgp.packages">9.2. Packages</a>
|
||||
<li><a href="#kgp.parse">9.3. Parsing XML</a>
|
||||
<li><a href="#kgp.unicode">9.4. Unicode</a>
|
||||
<li><a href="#kgp.search">9.5. Searching for elements</a>
|
||||
<li><a href="#kgp.attributes">9.6. Accessing element attributes</a>
|
||||
<li><a href="#kgp.segue">9.7. Segue</a>
|
||||
@@ -209,23 +151,6 @@
|
||||
<li><a href="#oa.summary">11.10. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#soap">12. SOAP Web Services</a><ul>
|
||||
<li><a href="#soap.divein">12.1. Diving In</a>
|
||||
<li><a href="#soap.install">12.2. Installing the SOAP Libraries</a><ul>
|
||||
<li><a href="#d0e29967">12.2.1. Installing PyXML</a>
|
||||
<li><a href="#d0e30070">12.2.2. Installing fpconst</a>
|
||||
<li><a href="#d0e30171">12.2.3. Installing SOAPpy</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#soap.firststeps">12.3. First Steps with SOAP</a>
|
||||
<li><a href="#soap.debug">12.4. Debugging SOAP Web Services</a>
|
||||
<li><a href="#soap.wsdl">12.5. Introducing WSDL</a>
|
||||
<li><a href="#soap.introspection">12.6. Introspecting SOAP Web Services with WSDL</a>
|
||||
<li><a href="#soap.google">12.7. Searching Google</a>
|
||||
<li><a href="#soap.troubleshooting">12.8. Troubleshooting SOAP Web Services</a>
|
||||
<li><a href="#soap.summary">12.9. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#roman">13. Unit Testing</a><ul>
|
||||
<li><a href="#roman.intro">13.1. Introduction to Roman numerals</a>
|
||||
<li><a href="#roman.divein">13.2. Diving in</a>
|
||||
@@ -614,74 +539,9 @@ hello world
|
||||
<p>You should now have a version of Python installed that works for you.
|
||||
<p>Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing <kbd>python</kbd> on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version.
|
||||
<p>Congratulations, and welcome to Python.
|
||||
<div class=chapter>
|
||||
<h2 id="odbchelper">Chapter 2. Your First Python Program</h2>
|
||||
<p>You know how other books go on and on about programming fundamentals and finally work up to building a complete, working program?
|
||||
Let's skip all that.
|
||||
<h2 id="odbchelper.divein">2.1. Diving in</h2>
|
||||
<p>Here is a complete, working Python program.
|
||||
<p>It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But
|
||||
read through it first and see what, if anything, you can make of it.
|
||||
<div class=example><h3>Example 2.1. <code>odbchelper.py</code></h3>
|
||||
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
|
||||
<pre><code>
|
||||
def buildConnectionString(params):
|
||||
"""Build a connection string from a dictionary of parameters.
|
||||
|
||||
Returns string."""
|
||||
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
|
||||
|
||||
if __name__ == "__main__":
|
||||
myParams = {"server":"mpilgrim", \
|
||||
"database":"master", \
|
||||
"uid":"sa", \
|
||||
"pwd":"secret" \
|
||||
}
|
||||
print buildConnectionString(myParams)</pre><p>Now run this program and see what happens.
|
||||
<table id="tip.run.windows" class=tip border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In the ActivePython <abbr>IDE</abbr> on Windows, you can run the Python program you're editing by choosing
|
||||
File->Run... (<kbd class=shortcut>Ctrl-R</kbd>). Output is displayed in the interactive window.
|
||||
<table id="tip.run.mac" class=tip border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In the Python <abbr>IDE</abbr> on Mac OS, you can run a Python program with
|
||||
Python->Run window... (<kbd class=shortcut>Cmd-R</kbd>), but there is an important option you must set first. Open the <code>.py</code> file in the <abbr>IDE</abbr>, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked. This is a per-file setting, but you'll only need to do it once per file.
|
||||
<table id="tip.run.unix" class=tip border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">On <abbr>UNIX</abbr>-compatible systems (including Mac OS X), you can run a Python program from the command line: <kbd>python <code>odbchelper.py</code></kbd><p>The id="odbchelper.output" output of <code>odbchelper.py</code> will look like this:<pre class=screen>server=mpilgrim;uid=sa;database=master;pwd=secret</pre><h2 id="odbchelper.funcdef">2.2. Declaring Functions</h2>
|
||||
<p>Python has functions like most other languages, but it does not have separate header files like <abbr>C++</abbr> or <code>interface</code>/<code>implementation</code> sections like Pascal. When you need a function, just declare it, like this:
|
||||
<pre><code>
|
||||
def buildConnectionString(params):</pre><p>Note that the keyword <code>def</code> starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments
|
||||
(not shown here) are separated with commas.
|
||||
<p>Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value.
|
||||
In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.
|
||||
<table id="compare.funcdef.vb" class=note border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In Visual Basic, functions (that return a value) start with <code>function</code>, and subroutines (that do not return a value) start with <code>sub</code>. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's <code>None</code>), and all functions start with <code>def</code>.
|
||||
<p>The argument, <code>params</code>, doesn't specify a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
|
||||
<table id="compare.funcdef.java" class=note border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In Java, <abbr>C++</abbr>, and other statically-typed languages, you must specify the datatype of the function return value and each function argument.
|
||||
In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
|
||||
<h3>2.2.1. How Python's Datatypes Compare to Other Programming Languages</h3>
|
||||
<p>An erudite reader sent me this explanation of how Python compares to other programming languages:
|
||||
<div class=variablelist>
|
||||
<dl>
|
||||
<dt>statically typed language</dt>
|
||||
<dd>A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare
|
||||
all variables with their datatypes before using them. Java and <abbr>C</abbr> are statically typed languages.
|
||||
</dd>
|
||||
<dt>dynamically typed language</dt>
|
||||
<dd>A language in which types are discovered at execution time; the opposite of statically typed. VBScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
|
||||
</dd>
|
||||
<dt>strongly typed language</dt>
|
||||
<dd>A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
|
||||
</dd>
|
||||
<dt>weakly typed language</dt>
|
||||
<dd>A language in which types may be ignored; the opposite of strongly typed. VBScript is weakly typed. In VBScript, you can concatenate the string <code>'12'</code> and the integer <code>3</code> to get the string <code>'123'</code>, then treat that as the integer <code>123</code>, all without any explicit conversion.
|
||||
</dd>
|
||||
</dl>
|
||||
<p>So Python is both <em>dynamically typed</em> (because it doesn't use explicit datatype declarations) and <em>strongly typed</em> (because once a variable has a datatype, it actually matters).
|
||||
<h2 id="odbchelper.docstring">2.3. Documenting Functions</h2>
|
||||
<p>You can document a Python function by giving it a <code>docstring</code>.
|
||||
<div class=example><h3 id="odbchelper.triplequotes">Example 2.2. Defining the <code>buildConnectionString</code> Function's <code>docstring</code></h3><pre><code>
|
||||
@@ -729,9 +589,18 @@ them into a larger program.
|
||||
<li><a href="http://www.python.org/doc/current/ref/"><i class=citetitle>Python Reference Manual</i></a> discusses the low-level details of <a href="http://www.python.org/doc/current/ref/import.html">importing modules</a>.
|
||||
|
||||
</ul>
|
||||
<div class=chapter>
|
||||
<h2 id="datatypes">Chapter 3. Native Datatypes</h2>
|
||||
<h2 id="odbchelper.list">3.2. Introducing Lists</h2>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<h2 id="odbchelper.vardef">3.4. Declaring variables</h2>
|
||||
<p>Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from <a href="#odbchelper">Chapter 2</a>, <code>odbchelper.py</code>.
|
||||
<p>Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring
|
||||
@@ -795,65 +664,6 @@ NameError: There is no variable named 'x'</samp>
|
||||
|
||||
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class=citetitle>How to Think Like a Computer Scientist</i></a> shows how to use multi-variable assignment to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap09.htm">swap the values of two variables</a>.
|
||||
|
||||
</ul>
|
||||
<h2 id="odbchelper.stringformatting">3.5. Formatting Strings</h2>
|
||||
<p>Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is
|
||||
to insert values into a string with the <code>%s</code> placeholder.
|
||||
<table id="compare.stringformatting.c" class=note border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">String formatting in Python uses the same syntax as the <code>sprintf</code> function in <abbr>C</abbr>.
|
||||
<div class=example><h3>Example 3.21. Introducing String Formatting</h3><pre class=screen><samp class=prompt>>>> </samp><kbd>k = "uid"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>v = "sa"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>"%s=%s" % (k, v)</kbd> <span>①</span>
|
||||
'uid=sa'</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>The whole expression evaluates to a string. The first <code>%s</code> is replaced by the value of <var>k</var>; the second <code>%s</code> is replaced by the value of <var>v</var>. All other characters in the string (in this case, the equal sign) stay as they are.
|
||||
<p>Note that <code>(k, v)</code> is a tuple. I told you they were good for something.
|
||||
<p>You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that
|
||||
string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
|
||||
<div class=example><h3 id="odbchelper.stringformatting.coerce">Example 3.22. String Formatting vs. Concatenating</h3><pre class=screen><samp class=prompt>>>> </samp><kbd>uid = "sa"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>pwd = "secret"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>print pwd + " is not a good password for " + uid</kbd> <span>①</span>
|
||||
secret is not a good password for sa
|
||||
<samp class=prompt>>>> </samp><kbd>print "%s is not a good password for %s" % (pwd, uid)</kbd> <span>②</span>
|
||||
secret is not a good password for sa
|
||||
<samp class=prompt>>>> </samp><kbd>userCount = 6</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>print "Users connected: %d" % (userCount, )</kbd> <span>③</span> <span>④</span>
|
||||
Users connected: 6
|
||||
<samp class=prompt>>>> </samp><kbd>print "Users connected: " + userCount</kbd> <span>⑤</span>
|
||||
<samp class=traceback>Traceback (innermost last):
|
||||
File "<interactive input>", line 1, in ?
|
||||
TypeError: cannot concatenate 'str' and 'int' objects</span></pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li><code>+</code> is the string concatenation operator.
|
||||
<li>In this trivial case, string formatting accomplishes the same result as concatentation.
|
||||
<li><code>(userCount, )</code> is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a
|
||||
tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the
|
||||
comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether <code>(userCount)</code> was a tuple with one element or just the value of <var>userCount</var>.
|
||||
<li>String formatting works with integers by specifying <code>%d</code> instead of <code>%s</code>.
|
||||
<li>Trying to concatenate a string with a non-string raises an exception. Unlike string formatting, string concatenation works
|
||||
only when everything is already a string.
|
||||
<p>As with <code>printf</code> in <abbr>C</abbr>, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
|
||||
<div class=example><h3 id="odbchelper.stringformatting.numbers">Example 3.23. Formatting Numbers</h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>print "Today's stock price: %f" % 50.4625</kbd> <span>①</span>
|
||||
50.462500
|
||||
<samp class=prompt>>>> </samp><kbd>print "Today's stock price: %.2f" % 50.4625</kbd> <span>②</span>
|
||||
50.46
|
||||
<samp class=prompt>>>> </samp><kbd>print "Change since yesterday: %+.2f" % 1.5</kbd> <span>③</span>
|
||||
+1.50
|
||||
</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>The <code>%f</code> string formatting option treats the value as a decimal, and prints it to six decimal places.
|
||||
<li>The ".2" modifier of the <code>%f</code> option truncates the value to two decimal places.
|
||||
<li>You can even combine modifiers. Adding the <code>+</code> modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding
|
||||
the value to exactly two decimal places.
|
||||
<div class=itemizedlist>
|
||||
<h3>Further Reading on String Formatting</h3>
|
||||
<ul>
|
||||
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/typesseq-strings.html">all the string formatting format characters</a>.
|
||||
|
||||
<li><a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Top"><i class=citetitle>Effective <abbr>AWK</abbr> Programming</i></a> discusses <a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Control+Letters">all the format characters</a> and advanced string formatting techniques like <a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Format+Modifiers">specifying width, precision, and zero-padding</a>.
|
||||
|
||||
</ul>
|
||||
<h2 id="odbchelper.map">3.6. Mapping Lists</h2>
|
||||
<p>One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each
|
||||
@@ -909,75 +719,23 @@ as <code><var>params</var>.<code>items</code>()</code>, but each element in the
|
||||
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> shows how to <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007140000000000000000">do nested list comprehensions</a>.
|
||||
|
||||
</ul>
|
||||
<h2 id="odbchelper.join">3.7. Joining Lists and Splitting Strings</h2>
|
||||
<p>You have a list of key-value pairs in the form <code><var>key</var>=<var>value</var></code>, and you want to join them into a single string. To join any list of strings into a single string, use the <code>join</code> method of a string object.
|
||||
|
||||
<p>Here is an example of joining a list from the <code>buildConnectionString</code> function:<pre><code>
|
||||
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])</pre><p>One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything
|
||||
is an object. You might have thought I meant that string <em>variables</em> are objects. But no, look closely at this example and you'll see that the string <code>";"</code> itself is an object, and you are calling its <code>join</code> method.
|
||||
<p>The <code>join</code> method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't
|
||||
need to be a semi-colon; it doesn't even need to be a single character. It can be any string.
|
||||
<table id="tip.join" class=caution border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/caution.png" alt="Caution" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code>join</code> works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements
|
||||
will raise an exception.
|
||||
<div class=example><h3 id="odbchelper.join.example">Example 3.27. Output of <code>odbchelper.py</code></h3><pre class=screen><samp class=prompt>>>> </samp><kbd>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>["%s=%s" % (k, v) for k, v in params.items()]</kbd>
|
||||
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
|
||||
<samp class=prompt>>>> </samp><kbd>";".join(["%s=%s" % (k, v) for k, v in params.items()])</kbd>
|
||||
'server=mpilgrim;uid=sa;database=master;pwd=secret'</pre><p>This string is then returned from the <code>odbchelper</code> function and printed by the calling block, which gives you the output that you marveled at when you started reading this
|
||||
chapter.
|
||||
<p>You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's
|
||||
called <code>split</code>.
|
||||
<div class=example><h3 id="odbchelper.split.example">Example 3.28. Splitting a String</h3><pre class=screen><samp class=prompt>>>> </samp><kbd>li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>s = ";".join(li)</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>s</kbd>
|
||||
'server=mpilgrim;uid=sa;database=master;pwd=secret'
|
||||
<samp class=prompt>>>> </samp><kbd>s.split(";")</kbd> <span>①</span>
|
||||
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
|
||||
<samp class=prompt>>>> </samp><kbd>s.split(";", 1)</kbd> <span>②</span>
|
||||
['server=mpilgrim', 'uid=sa;database=master;pwd=secret']</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li><code>split</code> reverses <code>join</code> by splitting a string into a multi-element list. Note that the delimiter (“<code>;</code>”) is stripped out completely; it does not appear in any of the elements of the returned list.
|
||||
<li><code>split</code> takes an optional second argument, which is the number of times to split. (“Oooooh, optional arguments...” You'll learn how to do this in your own functions in the next chapter.)
|
||||
<table id="tip.split" class=tip border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code><var>anystring</var>.<code>split</code>(<var>delimiter</var>, 1)</code> is a useful technique when you want to search a string for a substring and then work with everything before the substring
|
||||
(which ends up in the first element of the returned list) and everything after it (which ends up in the second element).
|
||||
<div class=itemizedlist>
|
||||
<h3>Further Reading on String Methods</h3>
|
||||
<ul>
|
||||
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/480">common questions about strings</a> and has a lot of <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/539">example code using strings</a>.
|
||||
|
||||
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/string-methods.html">all the string methods</a>.
|
||||
|
||||
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-string.html"><code>string</code> module</a>.
|
||||
|
||||
<li><a href="http://www.python.org/doc/FAQ.html"><i class=citetitle>The Whole Python <abbr>FAQ</abbr></i></a> explains <a href="http://www.python.org/cgi-bin/faqw.py?query=4.96&querytype=simple&casefold=yes&req=search">why <code>join</code> is a string method</a> instead of a list method.
|
||||
|
||||
</ul>
|
||||
<h3>3.7.1. Historical Note on String Methods</h3>
|
||||
<p>When I first learned Python, I expected <code>join</code> to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story
|
||||
behind the <code>join</code> method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate <code>string</code> module that contained all the string functions; each function took a string as its first argument. The functions were deemed
|
||||
important enough to put onto the strings themselves, which made sense for functions like <code>lower</code>, <code>upper</code>, and <code>split</code>. But many hard-core Python programmers objected to the new <code>join</code> method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of
|
||||
the old <code>string</code> module (which still has a lot of useful stuff in it). I use the new <code>join</code> method exclusively, but you will see code written either way, and if it really bothers you, you can use the old <code>string.join</code> function instead.
|
||||
<h2 id="odbchelper.summary">3.8. Summary</h2>
|
||||
<p>The <code>odbchelper.py</code> program and its output should now make perfect sense.
|
||||
<pre><code>
|
||||
def buildConnectionString(params):
|
||||
"""Build a connection string from a dictionary of parameters.
|
||||
(String splitting stuff was here)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Returns string."""
|
||||
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
|
||||
|
||||
if __name__ == "__main__":
|
||||
myParams = {"server":"mpilgrim", \
|
||||
"database":"master", \
|
||||
"uid":"sa", \
|
||||
"pwd":"secret" \
|
||||
}
|
||||
print buildConnectionString(myParams)</pre>
|
||||
<p>Here is the output of <code>odbchelper.py</code>:<pre class=screen>server=mpilgrim;uid=sa;database=master;pwd=secret</pre><div class=highlights>
|
||||
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
|
||||
<div class=itemizedlist>
|
||||
<ul>
|
||||
@@ -4162,53 +3920,21 @@ u'0'</pre><div class=calloutlist>
|
||||
<li>You can even use the <code>toxml</code> method here, deeply nested within the document.
|
||||
<li>The <code>p</code> element has only one child node (you can't tell that from this example, but look at <code>pNode.childNodes</code> if you don't believe me), and it is a <code>Text</code> node for the single character <code>'0'</code>.
|
||||
<li>The <code>.data</code> attribute of a <code>Text</code> node gives you the actual string that the text node represents. But what is that <code>'u'</code> in front of the string? The answer to that deserves its own section.
|
||||
<h2 id="kgp.unicode">9.4. Unicode</h2>
|
||||
<p>Unicode is a system to represent characters from all the world's different languages. When Python parses an <abbr>XML</abbr> document, all data is stored in memory as unicode.
|
||||
<p>You'll get to all that in a minute, but first, some background.
|
||||
<p><b>Historical note. </b>Before unicode, there were separate character encoding systems for each language, each using the same numbers (0-255) to represent
|
||||
that language's characters. Some languages (like Russian) have multiple conflicting standards about how to represent the
|
||||
same characters; other languages (like Japanese) have so many characters that they require multiple-byte character sets.
|
||||
Exchanging documents between systems was difficult because there was no way for a computer to tell for certain which character
|
||||
encoding scheme the document author had used; the computer only saw numbers, and the numbers could mean different things.
|
||||
Then think about trying to store these documents in the same place (like in the same database table); you would need to store
|
||||
the character encoding alongside each piece of text, and make sure to pass it around whenever you passed the text around.
|
||||
Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used
|
||||
escape codes to switch modes; poof, you're in Russian koi8-r mode, so character 241 means this; poof, now you're in Mac Greek
|
||||
mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve.
|
||||
<p>To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.
|
||||
<sup>[<a name="d0e23786" href="#ftn.d0e23786">5</a>]</sup> Each 2-byte number represents a unique character used in at least one of the world's languages. (Characters that are used
|
||||
in multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per number.
|
||||
Unicode data is never ambiguous.
|
||||
<p>Of course, there is still the matter of all these legacy encoding systems. 7-bit <abbr>ASCII</abbr>, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital “<code>A</code>”, 97 is lowercase “<code>a</code>”, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit <abbr>ASCII</abbr>. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called “latin-1”), which uses the 7-bit <abbr>ASCII</abbr> characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it
|
||||
(241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit <abbr>ASCII</abbr> for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from there into characters
|
||||
for other languages with the remaining numbers, 256 through 65535.
|
||||
<p>When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy encoding
|
||||
systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding
|
||||
scheme, or to print it to a non-unicode-aware terminal or printer. Or to store it in an <abbr>XML</abbr> document which explicitly specifies the encoding scheme.
|
||||
<p>And on that note, let's get back to Python.
|
||||
<p>Python has had unicode support throughout the language since version 2.0. The <abbr>XML</abbr> package uses unicode to store all parsed <abbr>XML</abbr> data, but you can use unicode anywhere.
|
||||
<div class=example><h3>Example 9.13. Introducing unicode</h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>s = u'Dive in'</kbd> <span>①</span>
|
||||
<samp class=prompt>>>> </samp><kbd>s</kbd>
|
||||
u'Dive in'
|
||||
<samp class=prompt>>>> </samp><kbd>print s</kbd> <span>②</span>
|
||||
Dive in</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>To create a unicode string instead of a regular <abbr>ASCII</abbr> string, add the letter “<code>u</code>” before the string. Note that this particular string doesn't have any non-<abbr>ASCII</abbr> characters. That's fine; unicode is a superset of <abbr>ASCII</abbr> (a very large superset at that), so any regular <abbr>ASCII</abbr> string can also be stored as unicode.
|
||||
<li>When printing a string, Python will attempt to convert it to your default encoding, which is usually <abbr>ASCII</abbr>. (More on this in a minute.) Since this unicode string is made up of characters that are also <abbr>ASCII</abbr> characters, printing it has the same result as printing a normal <abbr>ASCII</abbr> string; the conversion is seamless, and if you didn't know that <var>s</var> was a unicode string, you'd never notice the difference.
|
||||
<div class=example><h3>Example 9.14. Storing non-<abbr>ASCII</abbr> characters</h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>s = u'La Pe\xf1a'</kbd> <span>①</span>
|
||||
<samp class=prompt>>>> </samp><kbd>print s</kbd> <span>②</span>
|
||||
<samp class=traceback>Traceback (innermost last):
|
||||
File "<interactive input>", line 1, in ?
|
||||
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
|
||||
<samp class=prompt>>>> </samp><kbd>print s.encode('latin-1')</kbd> <span>③</span>
|
||||
La Peña</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>The real advantage of unicode, of course, is its ability to store non-<abbr>ASCII</abbr> characters, like the Spanish “<code>ñ</code>” (<code>n</code> with a tilde over it). The unicode character code for the tilde-n is <code>0xf1</code> in hexadecimal (241 in decimal), which you can type like this: <code>\xf1</code>.
|
||||
<li>Remember I said that the <code>print</code> function attempts to convert a unicode string to <abbr>ASCII</abbr> so it can print it? Well, that's not going to work here, because your unicode string contains non-<abbr>ASCII</abbr> characters, so Python raises a <samp>UnicodeError</samp> error.
|
||||
<li>Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. <var>s</var> is a unicode string, but <code>print</code> can only print a regular string. To solve this problem, you call the <code>encode</code> method, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme,
|
||||
which you pass as a parameter. In this case, you're using <code>latin-1</code> (also known as <code>iso-8859-1</code>), which includes the tilde-n (whereas the default <abbr>ASCII</abbr> encoding scheme did not, since it only includes characters numbered 0 through 127).
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
(Unicode stuff was here)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<p>Remember I said Python usually converted unicode to <abbr>ASCII</abbr> whenever it needed to make a regular string out of a unicode string? Well, this default encoding scheme is an option which
|
||||
you can customize.
|
||||
<div class=example><h3>Example 9.15. <code>sitecustomize.py</code></h3><pre><code>
|
||||
@@ -4233,57 +3959,19 @@ La Peña</pre><div class=calloutlist>
|
||||
<li>This example assumes that you have made the changes listed in the previous example to your <code>sitecustomize.py</code> file, and restarted Python. If your default encoding still says <code>'ascii'</code>, you didn't set up your <code>sitecustomize.py</code> properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even
|
||||
call <code>sys.setdefaultencoding</code> after Python has started up. Dig into <code>site.py</code> and search for “<code>setdefaultencoding</code>” to find out how.)
|
||||
<li>Now that the default encoding scheme includes all the characters you use in your string, Python has no problem auto-coercing the string and printing it.
|
||||
<div class=example><h3>Example 9.17. Specifying encoding in <code>.py</code> files</h3>
|
||||
<p>If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual <code>.py</code> file by putting an encoding declaration at the top of each file. This declaration defines the <code>.py</code> file to be UTF-8:<pre><code>
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: UTF-8 -*-
|
||||
</pre><p>Now, what about <abbr>XML</abbr>? Well, every <abbr>XML</abbr> document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R
|
||||
is popular for Russian texts. The encoding, if specified, is in the header of the <abbr>XML</abbr> document.
|
||||
<div class=example><h3>Example 9.18. <code>russiansample.xml</code></h3><pre class=screen><samp>
|
||||
<?xml version="1.0" encoding="koi8-r"?> </span><span>①</span><samp>
|
||||
<preface>
|
||||
<title>Предисловие</title> </span><span>②</span><samp>
|
||||
</preface></span></pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>This is a sample extract from a real Russian <abbr>XML</abbr> document; it's part of a Russian translation of this very book. Note the encoding, <code>koi8-r</code>, specified in the header.
|
||||
<li>These are Cyrillic characters which, as far as I know, spell the Russian word for “Preface”. If you open this file in a regular text editor, the characters will most likely like gibberish, because they're encoded
|
||||
using the <code>koi8-r</code> encoding scheme, but they're being displayed in <code>iso-8859-1</code>.
|
||||
<div class=example><h3>Example 9.19. Parsing <code>russiansample.xml</code></h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>from xml.dom import minidom</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>xmldoc = minidom.parse('russiansample.xml')</kbd> <span>①</span>
|
||||
<samp class=prompt>>>> </samp><kbd>title = xmldoc.getElementsByTagName('title')[0].firstChild.data</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>title</kbd> <span>②</span>
|
||||
u'\u041f\u0440\u0435\u0434\u0438\u0441\u043b\u043e\u0432\u0438\u0435'
|
||||
<samp class=prompt>>>> </samp><kbd>print title</kbd> <span>③</span>
|
||||
<samp class=traceback>Traceback (innermost last):
|
||||
File "<interactive input>", line 1, in ?
|
||||
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
|
||||
<samp class=prompt>>>> </samp><kbd>convertedtitle = title.encode('koi8-r')</kbd> <span>④</span>
|
||||
<samp class=prompt>>>> </samp><kbd>convertedtitle</kbd>
|
||||
'\xf0\xd2\xc5\xc4\xc9\xd3\xcc\xcf\xd7\xc9\xc5'
|
||||
<samp class=prompt>>>> </samp><kbd>print convertedtitle</kbd> <span>⑤</span>
|
||||
Предисловие</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>I'm assuming here that you saved the previous example as <code>russiansample.xml</code> in the current directory. I am also, for the sake of completeness, assuming that you've changed your default encoding back
|
||||
to <code>'ascii'</code> by removing your <code>sitecustomize.py</code> file, or at least commenting out the <code>setdefaultencoding</code> line.
|
||||
<li>Note that the text data of the <code>title</code> tag (now in the <var>title</var> variable, thanks to that long concatenation of Python functions which I hastily skipped over and, annoyingly, won't explain until the next section) -- the text data inside the
|
||||
<abbr>XML</abbr> document's <code>title</code> element is stored in unicode.
|
||||
<li>Printing the title is not possible, because this unicode string contains non-<abbr>ASCII</abbr> characters, so Python can't convert it to <abbr>ASCII</abbr> because that doesn't make sense.
|
||||
<li>You can, however, explicitly convert it to <code>koi8-r</code>, in which case you get a (regular, not unicode) string of single-byte characters (<code>f0</code>, <code>d2</code>, <code>c5</code>, and so forth) that are the <code>koi8-r</code>-encoded versions of the characters in the original unicode string.
|
||||
<li>Printing the <code>koi8-r</code>-encoded string will probably show gibberish on your screen, because your Python <abbr>IDE</abbr> is interpreting those characters as <code>iso-8859-1</code>, not <code>koi8-r</code>. But at least they do print. (And, if you look carefully, it's the same gibberish that you saw when you opened the original
|
||||
<abbr>XML</abbr> document in a non-unicode-aware text editor. Python converted it from <code>koi8-r</code> into unicode when it parsed the <abbr>XML</abbr> document, and you've just converted it back.)
|
||||
<p>To sum up, unicode itself is a bit intimidating if you've never seen it before, but unicode data is really very easy to handle
|
||||
in Python. If your <abbr>XML</abbr> documents are all 7-bit <abbr>ASCII</abbr> (like the examples in this chapter), you will literally never think about unicode. Python will convert the <abbr>ASCII</abbr> data in the <abbr>XML</abbr> documents into unicode while parsing, and auto-coerce it back to <abbr>ASCII</abbr> whenever necessary, and you'll never even notice. But if you need to deal with that in other languages, Python is ready.
|
||||
<div class=itemizedlist>
|
||||
<h3>Further reading</h3>
|
||||
<ul>
|
||||
<li><a href="http://www.unicode.org/">Unicode.org</a> is the home page of the unicode standard, including a brief <a href="http://www.unicode.org/standard/principles.html">technical introduction</a>.
|
||||
|
||||
<li><a href="http://www.reportlab.com/i18n/python_unicode_tutorial.html">Unicode Tutorial</a> has some more examples of how to use Python's unicode functions, including how to force Python to coerce unicode into <abbr>ASCII</abbr> even when it doesn't really want to.
|
||||
|
||||
<li><a href="http://www.python.org/peps/pep-0263.html">PEP 263</a> goes into more detail about how and when to define a character encoding in your <code>.py</code> files.
|
||||
|
||||
</ul>
|
||||
|
||||
|
||||
(More Unicode stuff was here)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<h2 id="kgp.search">9.5. Searching for elements</h2>
|
||||
<p>Traversing <abbr>XML</abbr> documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within
|
||||
your <abbr>XML</abbr> document, there is a shortcut you can use to find it quickly: <code>getElementsByTagName</code>.
|
||||
|
||||
Reference in New Issue
Block a user