mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 15:00:18 +00:00
started strings chapter, rewrote case-study intro, added some FIXMEs for obvious holes
This commit is contained in:
+1
-1
@@ -11,7 +11,7 @@
|
||||
h1:before{content:""}
|
||||
</style>
|
||||
</head>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> <a href=table-of-contents.html>Dive Into Python 3</a> <span>‣</span>
|
||||
<h1>About the book</h1>
|
||||
<p>The content of <cite>Dive Into Python 3</cite> is licensed under the <a href=http://creativecommons.org/licenses/by-sa/3.0/ rel=license>Creative Commons Attribution-ShareAlike 3.0 Unported License</a>.
|
||||
|
||||
@@ -12,20 +12,18 @@ body{counter-reset:h1 20}
|
||||
</style>
|
||||
</head>
|
||||
<p class=skip><a href=#divingin>skip to main content</a>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> <a href=table-of-contents.html#case-study-porting-chardet-to-python-3>Dive Into Python 3</a> <span>‣</span>
|
||||
<h1>Case study: porting <code>chardet</code> to Python 3</h1>
|
||||
<blockquote class=q>
|
||||
<p><span>❝</span> Words, words. They’re all we have to go on. <span>❞</span><br>— <cite>Rosencrantz and Guildenstern are Dead</cite>
|
||||
</blockquote>
|
||||
<ol>
|
||||
<li><a href=#divingin>What is character encoding?</a>
|
||||
<li><a href=#divingin>Diving in</a>
|
||||
<li><a href=#faq.what>What is character encoding auto-detection?</h2>
|
||||
<ol>
|
||||
<li><a href=#faq.what>What is character encoding auto-detection?</a>
|
||||
<li><a href=#faq.impossible>Isn’t that impossible?</a>
|
||||
<li><a href=#faq.who>Who wrote this detection algorithm?</a>
|
||||
<li><a href=#faq.yippie>Yippie! Screw the standards, I’ll just auto-detect everything!</a>
|
||||
<li><a href=#faq.why>Why bother with auto-detection if it’s slow, inaccurate, and non-standard?</a>
|
||||
<li><a href=#faq.who>Does such an algorithm exist?</a>
|
||||
</ol>
|
||||
<li><a href=#divingin2>Diving in</a>
|
||||
<ol>
|
||||
@@ -50,31 +48,26 @@ body{counter-reset:h1 20}
|
||||
</ol>
|
||||
<li><a href=#summary>Summary</a>
|
||||
</ol>
|
||||
<h2 id=divingin>What is character encoding?</h2>
|
||||
<p class=fancy>Usually, when people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular <em>character encoding</em>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
|
||||
<p>In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
|
||||
<h3 id=faq.what>What is character encoding auto-detection?</h3>
|
||||
<h2 id=divingin>Diving in</h2>
|
||||
<p class=fancy>Unknown or incorrect character encoding is the #1 cause of gibberish text on the web, in your inbox, and indeed across every computer system ever written. In <a href=strings.html>Chapter 3</a>, I talked about the history of character encoding and the creation of Unicode, the “one encoding to rule them all.” I’d love it if I never had to see a gibberish character on a web page again, because all authoring systems stored accurate encoding information, all transfer protocols were Unicode-aware, and every system that handled text maintained perfect fidelity when converting between encodings.
|
||||
<p>I’d also like a pony.
|
||||
<p>A Unicode pony.
|
||||
<p>A Unipony, as it were.
|
||||
<p>I’ll settle for character encoding auto-detection.
|
||||
|
||||
<h2 id=faq.what>What is character encoding auto-detection?</h2>
|
||||
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key.
|
||||
|
||||
<h3 id=faq.impossible>Isn’t that impossible?</h3>
|
||||
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.
|
||||
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
|
||||
<h3 id=faq.who>Who wrote this detection algorithm?</h3>
|
||||
<p>This library is a port of <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>the auto-detection code in Mozilla</a>. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors’ comments, which are quite extensive and informative.
|
||||
<p>You may also be interested in the research paper which led to the Mozilla implementation, <a href=http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>A composite approach to language/encoding detection</a>.
|
||||
<h3 id=faq.yippie>Yippie! Screw the standards, I’ll just auto-detect everything!</h3>
|
||||
<p>Don’t do that. Virtually every format and protocol contains a method for specifying character encoding.
|
||||
<ul>
|
||||
<li><abbr>HTTP</abbr> can define a <code>charset</code> parameter in the <code>Content-type</code> header.
|
||||
<li><abbr>HTML</abbr> documents can define a <code><meta http-equiv="content-type"></code> element in the <code><head></code> of a web page.
|
||||
<li><abbr>XML</abbr> documents can define an <code>encoding</code> attribute in the <abbr>XML</abbr> prolog.
|
||||
</ul>
|
||||
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over <abbr>HTTP</abbr>, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
|
||||
<p>Despite the complexity, it’s worthwhile to follow standards and <a href=http://www.w3.org/2001/tag/doc/mime-respect>respect explicit character encoding information</a>. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
|
||||
<h3 id=faq.why>Why bother with auto-detection if it’s slow, inaccurate, and non-standard?</h3>
|
||||
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all.
|
||||
<p>If following the relevant standards gets you nowhere, <em>and</em> you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my <a href=http://feedparser.org/>Universal Feed Parser</a>, which calls this auto-detection library <a href=http://feedparser.org/docs/character-encoding.html>only after exhausting all other options</a>.
|
||||
<h2 id=divingin2>Diving in</h2>
|
||||
<p>This is a brief guide to navigating the code itself.
|
||||
|
||||
<h3 id=faq.who>Does such an algorithm exist?</h3>
|
||||
<p>As it turns out, yes. All major browsers have character encoding auto-detection, because the web is full of pages that have no encoding information whatsoever. <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>Mozilla Firefox contains an encoding auto-detection library</a> which is open source. <a href=http://chardet.feedparser.org/>I ported the library to Python 2</a> and dubbed it the <code>chardet</code> module. This chapter will take you step-by-step through the process of porting the <code>chardet</code> module from Python 2 to Python 3.
|
||||
|
||||
<h2 id=divingin2>Introducing the <code>chardet</code> module</h2>
|
||||
<p>[FIXME download link, possibly on chardet.feedparser.org, possibly local]
|
||||
<p>Before we set off porting the code, it would help if you understood how the code worked! This is a brief guide to navigating the code itself.
|
||||
<p>The main entry point for the detection algorithm is <code>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code>chardet/__init__.py</code>, but that’s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
|
||||
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
|
||||
<ol>
|
||||
@@ -98,11 +91,11 @@ body{counter-reset:h1 20}
|
||||
<h3 id=how.sb>Single-byte encodings</h3>
|
||||
<p>The single-byte encoding prober, <code>SBCSGroupProber</code> (defined in <code>sbcsgroupprober.py</code>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <code>windows-1251</code>, <code>KOI8-R</code>, <code>ISO-8859-5</code>, <code>MacCyrillic</code>, <code>IBM855</code>, and <code>IBM866</code> (Russian); <code>ISO-8859-7</code> and <code>windows-1253</code> (Greek); <code>ISO-8859-5</code> and <code>windows-1251</code> (Bulgarian); <code>ISO-8859-2</code> and <code>windows-1250</code> (Hungarian); <code>TIS-620</code> (Thai); <code>windows-1255</code> and <code>ISO-8859-8</code> (Hebrew).
|
||||
<p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code>sbcharsetprober.py</code>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
|
||||
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored "<span class=quote>backwards</span>" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
|
||||
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
|
||||
<h3 id=how.windows1252><code>windows-1252</code></h3>
|
||||
<p>If <code>UniversalDetector</code> detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a <code>Latin1Prober</code> (defined in <code>latin1prober.py</code>) to try to detect English text in a <code>windows-1252</code> encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish <code>windows-1252</code> is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. <code>Latin1Prober</code> automatically reduces its confidence rating to allow more accurate probers to win if at all possible.
|
||||
<h2 id=running2to3>Running <code>2to3</code></h2>
|
||||
<p>We’re going to migrate the <code>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>. In this chapter, we’ll start by running <code>2to3</code> on the <code>chardet</code> package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.
|
||||
<p>We’re going to migrate the <code>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy — a function was renamed or moved to a different modules — but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>. In this chapter, we’ll start by running <code>2to3</code> on the <code>chardet</code> package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.
|
||||
<p>The main <code>chardet</code> package is split across several different files, all in the same directory. The <code>2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code>2to3</code> will convert each of the files in turn.
|
||||
<p id=noscript>[The code examples will be easier to follow if you enable Javascript, but whatever.]
|
||||
<p class=skip><a href=#skip2to3output>skip over this</a>
|
||||
@@ -604,7 +597,8 @@ RefactoringTool: Skipping implicit fixer: ws_comma
|
||||
<ins>+print(count, 'tests')</ins>
|
||||
RefactoringTool: Files that were modified:
|
||||
RefactoringTool: test.py</samp></pre>
|
||||
<p id=skip2to3outputtest>Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
|
||||
<p id=skip2to3outputtest>[FIXME explain the difference in import syntax]
|
||||
<p>Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
|
||||
<h2 id=manual>Fixing what <code>2to3</code> can’t</h2>
|
||||
<h3 id=falseisinvalidsyntax><code>False</code> is invalid syntax</h3>
|
||||
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere.
|
||||
@@ -643,7 +637,7 @@ else:
|
||||
File "C:\home\chardet\chardet\universaldetector.py", line 29, in <module>
|
||||
import constants, sys
|
||||
ImportError: No module named constants</samp></pre>
|
||||
<p id=skipnomodulenamedconstants>What’s that you say? No module named <code>constants</code>? Of course there’s a module named <code>constants</code>. …Oh wait, no there isn’t. Remember when the <code>2to3</code> script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
|
||||
<p id=skipnomodulenamedconstants>What’s that you say? No module named <code>constants</code>? Of course there’s a module named <code>constants</code>. …Oh wait, no there isn’t. Remember when the <code>2to3</code> script fixed up all those import statements? This library has a lot of relative imports — that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
|
||||
<pre><code>from . import constants</code></pre>
|
||||
<p>But wait. Wasn’t the <code>2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code>constants</code> module within the library, and an absolute import of the <code>sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the <code>2to3</code> script is not smart enough to split the import statement into two.
|
||||
<p>The solution is to split the import statement manually. So this two-in-one import:
|
||||
@@ -685,7 +679,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
||||
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
|
||||
<p id=skiphighbitdetectorcode>This pre-compiles a regular expression designed to find non-<abbr>ASCII</abbr> characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-<abbr>ASCII</abbr> <em>bytes</em> in the range 128-255.
|
||||
<p>And therein lies the problem.
|
||||
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
|
||||
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string — that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string — again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in <code>universaldetector.py</code>:
|
||||
<p class=skip><a href=#skipfeedhighbitdetectorcode>skip over this</a>
|
||||
<pre><code>def feed(self, aBuf):
|
||||
.
|
||||
@@ -701,7 +695,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
||||
.
|
||||
for line in open(f, 'rb'):
|
||||
u.feed(line)</code></pre>
|
||||
<p id=skiptestharnessfeedcode>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for “read”; OK, big deal, we’re reading the file. Ah, but <code>'b'</code> is for “binary.” Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
|
||||
<p id=skiptestharnessfeedcode>And here we find our answer: in the <code>UniversalDetector.feed()</code> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for “read”; OK, big deal, we’re reading the file. Ah, but <code>'b'</code> is for “binary.” Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string — an array of Unicode characters — according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <code>UniversalDetector.feed()</code>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit… characters. But we don’t have characters; we have bytes. Oops.
|
||||
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
|
||||
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. (There is one other case of this same problem, on the very next line.)
|
||||
<p class=skip><a href=#skip-cant-use-a-string-pattern-solution>skip over this code listing</a>
|
||||
|
||||
@@ -23,52 +23,12 @@
|
||||
<li><a href="#install.summary">1.9. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper">2. Your First Python Program</a><ul>
|
||||
<li><a href="#odbchelper.divein">2.1. Diving in</a>
|
||||
<li><a href="#odbchelper.funcdef">2.2. Declaring Functions</a><ul>
|
||||
<li><a href="#d0e4188">2.2.1. How Python's Datatypes Compare to Other Programming Languages</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.docstring">2.3. Documenting Functions</a>
|
||||
<li><a href="#odbchelper.objects">2.4. Everything Is an Object</a><ul>
|
||||
<li><a href="#d0e4550">2.4.1. The Import Search Path</a>
|
||||
<li><a href="#d0e4665">2.4.2. What's an Object?</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.indenting">2.5. Indenting Code</a>
|
||||
<li><a href="#odbchelper.testing">2.6. Testing Modules</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#datatypes">3. Native Datatypes</a><ul>
|
||||
<li><a href="#odbchelper.dict">3.1. Introducing Dictionaries</a><ul>
|
||||
<li><a href="#d0e5174">3.1.1. Defining Dictionaries</a>
|
||||
<li><a href="#d0e5269">3.1.2. Modifying Dictionaries</a>
|
||||
<li><a href="#d0e5450">3.1.3. Deleting Items From Dictionaries</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.list">3.2. Introducing Lists</a><ul>
|
||||
<li><a href="#d0e5623">3.2.1. Defining Lists</a>
|
||||
<li><a href="#d0e5887">3.2.2. Adding Elements to Lists</a>
|
||||
<li><a href="#d0e6115">3.2.3. Searching Lists</a>
|
||||
<li><a href="#d0e6277">3.2.4. Deleting List Elements</a>
|
||||
<li><a href="#d0e6392">3.2.5. Using List Operators</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.tuple">3.3. Introducing Tuples</a>
|
||||
<li><a href="#odbchelper.vardef">3.4. Declaring variables</a><ul>
|
||||
<li><a href="#d0e6873">3.4.1. Referencing Variables</a>
|
||||
<li><a href="#odbchelper.multiassign">3.4.2. Assigning Multiple Values at Once</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.stringformatting">3.5. Formatting Strings</a>
|
||||
<li><a href="#odbchelper.map">3.6. Mapping Lists</a>
|
||||
<li><a href="#odbchelper.join">3.7. Joining Lists and Splitting Strings</a><ul>
|
||||
<li><a href="#d0e7982">3.7.1. Historical Note on String Methods</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#odbchelper.summary">3.8. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#apihelper">4. The Power Of Introspection</a><ul>
|
||||
<li><a href="#apihelper.divein">4.1. Diving In</a>
|
||||
<li><a href="#apihelper.optional">4.2. Using Optional and Named Arguments</a>
|
||||
@@ -138,23 +98,6 @@
|
||||
<li><a href="#fileinfo.summary2">6.7. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#re">7. Regular Expressions</a><ul>
|
||||
<li><a href="#re.intro">7.1. Diving In</a>
|
||||
<li><a href="#re.matching">7.2. Case Study: Street Addresses</a>
|
||||
<li><a href="#re.roman">7.3. Case Study: Roman Numerals</a><ul>
|
||||
<li><a href="#d0e17592">7.3.1. Checking for Thousands</a>
|
||||
<li><a href="#d0e17785">7.3.2. Checking for Hundreds</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#re.nm">7.4. Using the {n,m} Syntax</a><ul>
|
||||
<li><a href="#d0e18326">7.4.1. Checking for Tens and Ones</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#re.verbose">7.5. Verbose Regular Expressions</a>
|
||||
<li><a href="#re.phone">7.6. Case study: Parsing Phone Numbers</a>
|
||||
<li><a href="#re.summary">7.7. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#dialect">8. HTML Processing</a><ul>
|
||||
<li><a href="#dialect.divein">8.1. Diving in</a>
|
||||
<li><a href="#dialect.sgmllib">8.2. Introducing sgmllib.py</a>
|
||||
@@ -172,7 +115,6 @@
|
||||
<li><a href="#kgp.divein">9.1. Diving in</a>
|
||||
<li><a href="#kgp.packages">9.2. Packages</a>
|
||||
<li><a href="#kgp.parse">9.3. Parsing XML</a>
|
||||
<li><a href="#kgp.unicode">9.4. Unicode</a>
|
||||
<li><a href="#kgp.search">9.5. Searching for elements</a>
|
||||
<li><a href="#kgp.attributes">9.6. Accessing element attributes</a>
|
||||
<li><a href="#kgp.segue">9.7. Segue</a>
|
||||
@@ -209,23 +151,6 @@
|
||||
<li><a href="#oa.summary">11.10. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#soap">12. SOAP Web Services</a><ul>
|
||||
<li><a href="#soap.divein">12.1. Diving In</a>
|
||||
<li><a href="#soap.install">12.2. Installing the SOAP Libraries</a><ul>
|
||||
<li><a href="#d0e29967">12.2.1. Installing PyXML</a>
|
||||
<li><a href="#d0e30070">12.2.2. Installing fpconst</a>
|
||||
<li><a href="#d0e30171">12.2.3. Installing SOAPpy</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#soap.firststeps">12.3. First Steps with SOAP</a>
|
||||
<li><a href="#soap.debug">12.4. Debugging SOAP Web Services</a>
|
||||
<li><a href="#soap.wsdl">12.5. Introducing WSDL</a>
|
||||
<li><a href="#soap.introspection">12.6. Introspecting SOAP Web Services with WSDL</a>
|
||||
<li><a href="#soap.google">12.7. Searching Google</a>
|
||||
<li><a href="#soap.troubleshooting">12.8. Troubleshooting SOAP Web Services</a>
|
||||
<li><a href="#soap.summary">12.9. Summary</a>
|
||||
</ul>
|
||||
|
||||
<li><a href="#roman">13. Unit Testing</a><ul>
|
||||
<li><a href="#roman.intro">13.1. Introduction to Roman numerals</a>
|
||||
<li><a href="#roman.divein">13.2. Diving in</a>
|
||||
@@ -614,74 +539,9 @@ hello world
|
||||
<p>You should now have a version of Python installed that works for you.
|
||||
<p>Depending on your platform, you may have more than one version of Python intsalled. If so, you need to be aware of your paths. If simply typing <kbd>python</kbd> on the command line doesn't run the version of Python that you want to use, you may need to enter the full pathname of your preferred version.
|
||||
<p>Congratulations, and welcome to Python.
|
||||
<div class=chapter>
|
||||
<h2 id="odbchelper">Chapter 2. Your First Python Program</h2>
|
||||
<p>You know how other books go on and on about programming fundamentals and finally work up to building a complete, working program?
|
||||
Let's skip all that.
|
||||
<h2 id="odbchelper.divein">2.1. Diving in</h2>
|
||||
<p>Here is a complete, working Python program.
|
||||
<p>It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But
|
||||
read through it first and see what, if anything, you can make of it.
|
||||
<div class=example><h3>Example 2.1. <code>odbchelper.py</code></h3>
|
||||
<p>If you have not already done so, you can <a href="http://diveintopython3.org/download/diveintopython3-examples-5.4.zip" title="Download example scripts">download this and other examples</a> used in this book.
|
||||
<pre><code>
|
||||
def buildConnectionString(params):
|
||||
"""Build a connection string from a dictionary of parameters.
|
||||
|
||||
Returns string."""
|
||||
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
|
||||
|
||||
if __name__ == "__main__":
|
||||
myParams = {"server":"mpilgrim", \
|
||||
"database":"master", \
|
||||
"uid":"sa", \
|
||||
"pwd":"secret" \
|
||||
}
|
||||
print buildConnectionString(myParams)</pre><p>Now run this program and see what happens.
|
||||
<table id="tip.run.windows" class=tip border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In the ActivePython <abbr>IDE</abbr> on Windows, you can run the Python program you're editing by choosing
|
||||
File->Run... (<kbd class=shortcut>Ctrl-R</kbd>). Output is displayed in the interactive window.
|
||||
<table id="tip.run.mac" class=tip border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In the Python <abbr>IDE</abbr> on Mac OS, you can run a Python program with
|
||||
Python->Run window... (<kbd class=shortcut>Cmd-R</kbd>), but there is an important option you must set first. Open the <code>.py</code> file in the <abbr>IDE</abbr>, pop up the options menu by clicking the black triangle in the upper-right corner of the window, and make sure the Run as __main__ option is checked. This is a per-file setting, but you'll only need to do it once per file.
|
||||
<table id="tip.run.unix" class=tip border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">On <abbr>UNIX</abbr>-compatible systems (including Mac OS X), you can run a Python program from the command line: <kbd>python <code>odbchelper.py</code></kbd><p>The id="odbchelper.output" output of <code>odbchelper.py</code> will look like this:<pre class=screen>server=mpilgrim;uid=sa;database=master;pwd=secret</pre><h2 id="odbchelper.funcdef">2.2. Declaring Functions</h2>
|
||||
<p>Python has functions like most other languages, but it does not have separate header files like <abbr>C++</abbr> or <code>interface</code>/<code>implementation</code> sections like Pascal. When you need a function, just declare it, like this:
|
||||
<pre><code>
|
||||
def buildConnectionString(params):</pre><p>Note that the keyword <code>def</code> starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments
|
||||
(not shown here) are separated with commas.
|
||||
<p>Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value.
|
||||
In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.
|
||||
<table id="compare.funcdef.vb" class=note border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In Visual Basic, functions (that return a value) start with <code>function</code>, and subroutines (that do not return a value) start with <code>sub</code>. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's <code>None</code>), and all functions start with <code>def</code>.
|
||||
<p>The argument, <code>params</code>, doesn't specify a datatype. In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
|
||||
<table id="compare.funcdef.java" class=note border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">In Java, <abbr>C++</abbr>, and other statically-typed languages, you must specify the datatype of the function return value and each function argument.
|
||||
In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
|
||||
<h3>2.2.1. How Python's Datatypes Compare to Other Programming Languages</h3>
|
||||
<p>An erudite reader sent me this explanation of how Python compares to other programming languages:
|
||||
<div class=variablelist>
|
||||
<dl>
|
||||
<dt>statically typed language</dt>
|
||||
<dd>A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare
|
||||
all variables with their datatypes before using them. Java and <abbr>C</abbr> are statically typed languages.
|
||||
</dd>
|
||||
<dt>dynamically typed language</dt>
|
||||
<dd>A language in which types are discovered at execution time; the opposite of statically typed. VBScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
|
||||
</dd>
|
||||
<dt>strongly typed language</dt>
|
||||
<dd>A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
|
||||
</dd>
|
||||
<dt>weakly typed language</dt>
|
||||
<dd>A language in which types may be ignored; the opposite of strongly typed. VBScript is weakly typed. In VBScript, you can concatenate the string <code>'12'</code> and the integer <code>3</code> to get the string <code>'123'</code>, then treat that as the integer <code>123</code>, all without any explicit conversion.
|
||||
</dd>
|
||||
</dl>
|
||||
<p>So Python is both <em>dynamically typed</em> (because it doesn't use explicit datatype declarations) and <em>strongly typed</em> (because once a variable has a datatype, it actually matters).
|
||||
<h2 id="odbchelper.docstring">2.3. Documenting Functions</h2>
|
||||
<p>You can document a Python function by giving it a <code>docstring</code>.
|
||||
<div class=example><h3 id="odbchelper.triplequotes">Example 2.2. Defining the <code>buildConnectionString</code> Function's <code>docstring</code></h3><pre><code>
|
||||
@@ -729,9 +589,18 @@ them into a larger program.
|
||||
<li><a href="http://www.python.org/doc/current/ref/"><i class=citetitle>Python Reference Manual</i></a> discusses the low-level details of <a href="http://www.python.org/doc/current/ref/import.html">importing modules</a>.
|
||||
|
||||
</ul>
|
||||
<div class=chapter>
|
||||
<h2 id="datatypes">Chapter 3. Native Datatypes</h2>
|
||||
<h2 id="odbchelper.list">3.2. Introducing Lists</h2>
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<h2 id="odbchelper.vardef">3.4. Declaring variables</h2>
|
||||
<p>Now that you know something about dictionaries, tuples, and lists (oh my!), let's get back to the sample program from <a href="#odbchelper">Chapter 2</a>, <code>odbchelper.py</code>.
|
||||
<p>Python has local and global variables like most other languages, but it has no explicit variable declarations. Variables spring
|
||||
@@ -795,65 +664,6 @@ NameError: There is no variable named 'x'</samp>
|
||||
|
||||
<li><a href="http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors"><i class=citetitle>How to Think Like a Computer Scientist</i></a> shows how to use multi-variable assignment to <a href="http://www.ibiblio.org/obp/thinkCSpy/chap09.htm">swap the values of two variables</a>.
|
||||
|
||||
</ul>
|
||||
<h2 id="odbchelper.stringformatting">3.5. Formatting Strings</h2>
|
||||
<p>Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is
|
||||
to insert values into a string with the <code>%s</code> placeholder.
|
||||
<table id="compare.stringformatting.c" class=note border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/note.png" alt="Note" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%">String formatting in Python uses the same syntax as the <code>sprintf</code> function in <abbr>C</abbr>.
|
||||
<div class=example><h3>Example 3.21. Introducing String Formatting</h3><pre class=screen><samp class=prompt>>>> </samp><kbd>k = "uid"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>v = "sa"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>"%s=%s" % (k, v)</kbd> <span>①</span>
|
||||
'uid=sa'</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>The whole expression evaluates to a string. The first <code>%s</code> is replaced by the value of <var>k</var>; the second <code>%s</code> is replaced by the value of <var>v</var>. All other characters in the string (in this case, the equal sign) stay as they are.
|
||||
<p>Note that <code>(k, v)</code> is a tuple. I told you they were good for something.
|
||||
<p>You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that
|
||||
string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
|
||||
<div class=example><h3 id="odbchelper.stringformatting.coerce">Example 3.22. String Formatting vs. Concatenating</h3><pre class=screen><samp class=prompt>>>> </samp><kbd>uid = "sa"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>pwd = "secret"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>print pwd + " is not a good password for " + uid</kbd> <span>①</span>
|
||||
secret is not a good password for sa
|
||||
<samp class=prompt>>>> </samp><kbd>print "%s is not a good password for %s" % (pwd, uid)</kbd> <span>②</span>
|
||||
secret is not a good password for sa
|
||||
<samp class=prompt>>>> </samp><kbd>userCount = 6</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>print "Users connected: %d" % (userCount, )</kbd> <span>③</span> <span>④</span>
|
||||
Users connected: 6
|
||||
<samp class=prompt>>>> </samp><kbd>print "Users connected: " + userCount</kbd> <span>⑤</span>
|
||||
<samp class=traceback>Traceback (innermost last):
|
||||
File "<interactive input>", line 1, in ?
|
||||
TypeError: cannot concatenate 'str' and 'int' objects</span></pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li><code>+</code> is the string concatenation operator.
|
||||
<li>In this trivial case, string formatting accomplishes the same result as concatentation.
|
||||
<li><code>(userCount, )</code> is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a
|
||||
tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the
|
||||
comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether <code>(userCount)</code> was a tuple with one element or just the value of <var>userCount</var>.
|
||||
<li>String formatting works with integers by specifying <code>%d</code> instead of <code>%s</code>.
|
||||
<li>Trying to concatenate a string with a non-string raises an exception. Unlike string formatting, string concatenation works
|
||||
only when everything is already a string.
|
||||
<p>As with <code>printf</code> in <abbr>C</abbr>, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
|
||||
<div class=example><h3 id="odbchelper.stringformatting.numbers">Example 3.23. Formatting Numbers</h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>print "Today's stock price: %f" % 50.4625</kbd> <span>①</span>
|
||||
50.462500
|
||||
<samp class=prompt>>>> </samp><kbd>print "Today's stock price: %.2f" % 50.4625</kbd> <span>②</span>
|
||||
50.46
|
||||
<samp class=prompt>>>> </samp><kbd>print "Change since yesterday: %+.2f" % 1.5</kbd> <span>③</span>
|
||||
+1.50
|
||||
</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>The <code>%f</code> string formatting option treats the value as a decimal, and prints it to six decimal places.
|
||||
<li>The ".2" modifier of the <code>%f</code> option truncates the value to two decimal places.
|
||||
<li>You can even combine modifiers. Adding the <code>+</code> modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding
|
||||
the value to exactly two decimal places.
|
||||
<div class=itemizedlist>
|
||||
<h3>Further Reading on String Formatting</h3>
|
||||
<ul>
|
||||
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/typesseq-strings.html">all the string formatting format characters</a>.
|
||||
|
||||
<li><a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Top"><i class=citetitle>Effective <abbr>AWK</abbr> Programming</i></a> discusses <a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Control+Letters">all the format characters</a> and advanced string formatting techniques like <a href="http://www-gnats.gnu.org:8080/cgi-bin/info2www?(gawk)Format+Modifiers">specifying width, precision, and zero-padding</a>.
|
||||
|
||||
</ul>
|
||||
<h2 id="odbchelper.map">3.6. Mapping Lists</h2>
|
||||
<p>One of the most powerful features of Python is the list comprehension, which provides a compact way of mapping a list into another list by applying a function to each
|
||||
@@ -909,75 +719,23 @@ as <code><var>params</var>.<code>items</code>()</code>, but each element in the
|
||||
<li><a href="http://www.python.org/doc/current/tut/tut.html"><i class=citetitle>Python Tutorial</i></a> shows how to <a href="http://www.python.org/doc/current/tut/node7.html#SECTION007140000000000000000">do nested list comprehensions</a>.
|
||||
|
||||
</ul>
|
||||
<h2 id="odbchelper.join">3.7. Joining Lists and Splitting Strings</h2>
|
||||
<p>You have a list of key-value pairs in the form <code><var>key</var>=<var>value</var></code>, and you want to join them into a single string. To join any list of strings into a single string, use the <code>join</code> method of a string object.
|
||||
|
||||
<p>Here is an example of joining a list from the <code>buildConnectionString</code> function:<pre><code>
|
||||
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])</pre><p>One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything
|
||||
is an object. You might have thought I meant that string <em>variables</em> are objects. But no, look closely at this example and you'll see that the string <code>";"</code> itself is an object, and you are calling its <code>join</code> method.
|
||||
<p>The <code>join</code> method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't
|
||||
need to be a semi-colon; it doesn't even need to be a single character. It can be any string.
|
||||
<table id="tip.join" class=caution border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/caution.png" alt="Caution" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code>join</code> works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements
|
||||
will raise an exception.
|
||||
<div class=example><h3 id="odbchelper.join.example">Example 3.27. Output of <code>odbchelper.py</code></h3><pre class=screen><samp class=prompt>>>> </samp><kbd>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>["%s=%s" % (k, v) for k, v in params.items()]</kbd>
|
||||
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
|
||||
<samp class=prompt>>>> </samp><kbd>";".join(["%s=%s" % (k, v) for k, v in params.items()])</kbd>
|
||||
'server=mpilgrim;uid=sa;database=master;pwd=secret'</pre><p>This string is then returned from the <code>odbchelper</code> function and printed by the calling block, which gives you the output that you marveled at when you started reading this
|
||||
chapter.
|
||||
<p>You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's
|
||||
called <code>split</code>.
|
||||
<div class=example><h3 id="odbchelper.split.example">Example 3.28. Splitting a String</h3><pre class=screen><samp class=prompt>>>> </samp><kbd>li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>s = ";".join(li)</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>s</kbd>
|
||||
'server=mpilgrim;uid=sa;database=master;pwd=secret'
|
||||
<samp class=prompt>>>> </samp><kbd>s.split(";")</kbd> <span>①</span>
|
||||
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
|
||||
<samp class=prompt>>>> </samp><kbd>s.split(";", 1)</kbd> <span>②</span>
|
||||
['server=mpilgrim', 'uid=sa;database=master;pwd=secret']</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li><code>split</code> reverses <code>join</code> by splitting a string into a multi-element list. Note that the delimiter (“<code>;</code>”) is stripped out completely; it does not appear in any of the elements of the returned list.
|
||||
<li><code>split</code> takes an optional second argument, which is the number of times to split. (“Oooooh, optional arguments...” You'll learn how to do this in your own functions in the next chapter.)
|
||||
<table id="tip.split" class=tip border="0" summary="">
|
||||
|
||||
<td rowspan="2" align="center" valign="top" width="1%"><img src="images/tip.png" alt="Tip" title="" width="24" height="24"><td colspan="2" align="left" valign="top" width="99%"><code><var>anystring</var>.<code>split</code>(<var>delimiter</var>, 1)</code> is a useful technique when you want to search a string for a substring and then work with everything before the substring
|
||||
(which ends up in the first element of the returned list) and everything after it (which ends up in the second element).
|
||||
<div class=itemizedlist>
|
||||
<h3>Further Reading on String Methods</h3>
|
||||
<ul>
|
||||
<li><a href="http://www.faqts.com/knowledge-base/index.phtml/fid/199/">Python Knowledge Base</a> answers <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/480">common questions about strings</a> and has a lot of <a href="http://www.faqts.com/knowledge-base/index.phtml/fid/539">example code using strings</a>.
|
||||
|
||||
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> summarizes <a href="http://www.python.org/doc/current/lib/string-methods.html">all the string methods</a>.
|
||||
|
||||
<li><a href="http://www.python.org/doc/current/lib/"><i class=citetitle>Python Library Reference</i></a> documents the <a href="http://www.python.org/doc/current/lib/module-string.html"><code>string</code> module</a>.
|
||||
|
||||
<li><a href="http://www.python.org/doc/FAQ.html"><i class=citetitle>The Whole Python <abbr>FAQ</abbr></i></a> explains <a href="http://www.python.org/cgi-bin/faqw.py?query=4.96&querytype=simple&casefold=yes&req=search">why <code>join</code> is a string method</a> instead of a list method.
|
||||
|
||||
</ul>
|
||||
<h3>3.7.1. Historical Note on String Methods</h3>
|
||||
<p>When I first learned Python, I expected <code>join</code> to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story
|
||||
behind the <code>join</code> method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate <code>string</code> module that contained all the string functions; each function took a string as its first argument. The functions were deemed
|
||||
important enough to put onto the strings themselves, which made sense for functions like <code>lower</code>, <code>upper</code>, and <code>split</code>. But many hard-core Python programmers objected to the new <code>join</code> method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of
|
||||
the old <code>string</code> module (which still has a lot of useful stuff in it). I use the new <code>join</code> method exclusively, but you will see code written either way, and if it really bothers you, you can use the old <code>string.join</code> function instead.
|
||||
<h2 id="odbchelper.summary">3.8. Summary</h2>
|
||||
<p>The <code>odbchelper.py</code> program and its output should now make perfect sense.
|
||||
<pre><code>
|
||||
def buildConnectionString(params):
|
||||
"""Build a connection string from a dictionary of parameters.
|
||||
(String splitting stuff was here)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Returns string."""
|
||||
return ";".join(["%s=%s" % (k, v) for k, v in params.items()])
|
||||
|
||||
if __name__ == "__main__":
|
||||
myParams = {"server":"mpilgrim", \
|
||||
"database":"master", \
|
||||
"uid":"sa", \
|
||||
"pwd":"secret" \
|
||||
}
|
||||
print buildConnectionString(myParams)</pre>
|
||||
<p>Here is the output of <code>odbchelper.py</code>:<pre class=screen>server=mpilgrim;uid=sa;database=master;pwd=secret</pre><div class=highlights>
|
||||
<p>Before diving into the next chapter, make sure you're comfortable doing all of these things:
|
||||
<div class=itemizedlist>
|
||||
<ul>
|
||||
@@ -4162,53 +3920,21 @@ u'0'</pre><div class=calloutlist>
|
||||
<li>You can even use the <code>toxml</code> method here, deeply nested within the document.
|
||||
<li>The <code>p</code> element has only one child node (you can't tell that from this example, but look at <code>pNode.childNodes</code> if you don't believe me), and it is a <code>Text</code> node for the single character <code>'0'</code>.
|
||||
<li>The <code>.data</code> attribute of a <code>Text</code> node gives you the actual string that the text node represents. But what is that <code>'u'</code> in front of the string? The answer to that deserves its own section.
|
||||
<h2 id="kgp.unicode">9.4. Unicode</h2>
|
||||
<p>Unicode is a system to represent characters from all the world's different languages. When Python parses an <abbr>XML</abbr> document, all data is stored in memory as unicode.
|
||||
<p>You'll get to all that in a minute, but first, some background.
|
||||
<p><b>Historical note. </b>Before unicode, there were separate character encoding systems for each language, each using the same numbers (0-255) to represent
|
||||
that language's characters. Some languages (like Russian) have multiple conflicting standards about how to represent the
|
||||
same characters; other languages (like Japanese) have so many characters that they require multiple-byte character sets.
|
||||
Exchanging documents between systems was difficult because there was no way for a computer to tell for certain which character
|
||||
encoding scheme the document author had used; the computer only saw numbers, and the numbers could mean different things.
|
||||
Then think about trying to store these documents in the same place (like in the same database table); you would need to store
|
||||
the character encoding alongside each piece of text, and make sure to pass it around whenever you passed the text around.
|
||||
Then think about multilingual documents, with characters from multiple languages in the same document. (They typically used
|
||||
escape codes to switch modes; poof, you're in Russian koi8-r mode, so character 241 means this; poof, now you're in Mac Greek
|
||||
mode, so character 241 means something else. And so on.) These are the problems which unicode was designed to solve.
|
||||
<p>To solve these problems, unicode represents each character as a 2-byte number, from 0 to 65535.
|
||||
<sup>[<a name="d0e23786" href="#ftn.d0e23786">5</a>]</sup> Each 2-byte number represents a unique character used in at least one of the world's languages. (Characters that are used
|
||||
in multiple languages have the same numeric code.) There is exactly 1 number per character, and exactly 1 character per number.
|
||||
Unicode data is never ambiguous.
|
||||
<p>Of course, there is still the matter of all these legacy encoding systems. 7-bit <abbr>ASCII</abbr>, for instance, which stores English characters as numbers ranging from 0 to 127. (65 is capital “<code>A</code>”, 97 is lowercase “<code>a</code>”, and so forth.) English has a very simple alphabet, so it can be completely expressed in 7-bit <abbr>ASCII</abbr>. Western European languages like French, Spanish, and German all use an encoding system called ISO-8859-1 (also called “latin-1”), which uses the 7-bit <abbr>ASCII</abbr> characters for the numbers 0 through 127, but then extends into the 128-255 range for characters like n-with-a-tilde-over-it
|
||||
(241), and u-with-two-dots-over-it (252). And unicode uses the same characters as 7-bit <abbr>ASCII</abbr> for 0 through 127, and the same characters as ISO-8859-1 for 128 through 255, and then extends from there into characters
|
||||
for other languages with the remaining numbers, 256 through 65535.
|
||||
<p>When dealing with unicode data, you may at some point need to convert the data back into one of these other legacy encoding
|
||||
systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding
|
||||
scheme, or to print it to a non-unicode-aware terminal or printer. Or to store it in an <abbr>XML</abbr> document which explicitly specifies the encoding scheme.
|
||||
<p>And on that note, let's get back to Python.
|
||||
<p>Python has had unicode support throughout the language since version 2.0. The <abbr>XML</abbr> package uses unicode to store all parsed <abbr>XML</abbr> data, but you can use unicode anywhere.
|
||||
<div class=example><h3>Example 9.13. Introducing unicode</h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>s = u'Dive in'</kbd> <span>①</span>
|
||||
<samp class=prompt>>>> </samp><kbd>s</kbd>
|
||||
u'Dive in'
|
||||
<samp class=prompt>>>> </samp><kbd>print s</kbd> <span>②</span>
|
||||
Dive in</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>To create a unicode string instead of a regular <abbr>ASCII</abbr> string, add the letter “<code>u</code>” before the string. Note that this particular string doesn't have any non-<abbr>ASCII</abbr> characters. That's fine; unicode is a superset of <abbr>ASCII</abbr> (a very large superset at that), so any regular <abbr>ASCII</abbr> string can also be stored as unicode.
|
||||
<li>When printing a string, Python will attempt to convert it to your default encoding, which is usually <abbr>ASCII</abbr>. (More on this in a minute.) Since this unicode string is made up of characters that are also <abbr>ASCII</abbr> characters, printing it has the same result as printing a normal <abbr>ASCII</abbr> string; the conversion is seamless, and if you didn't know that <var>s</var> was a unicode string, you'd never notice the difference.
|
||||
<div class=example><h3>Example 9.14. Storing non-<abbr>ASCII</abbr> characters</h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>s = u'La Pe\xf1a'</kbd> <span>①</span>
|
||||
<samp class=prompt>>>> </samp><kbd>print s</kbd> <span>②</span>
|
||||
<samp class=traceback>Traceback (innermost last):
|
||||
File "<interactive input>", line 1, in ?
|
||||
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
|
||||
<samp class=prompt>>>> </samp><kbd>print s.encode('latin-1')</kbd> <span>③</span>
|
||||
La Peña</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>The real advantage of unicode, of course, is its ability to store non-<abbr>ASCII</abbr> characters, like the Spanish “<code>ñ</code>” (<code>n</code> with a tilde over it). The unicode character code for the tilde-n is <code>0xf1</code> in hexadecimal (241 in decimal), which you can type like this: <code>\xf1</code>.
|
||||
<li>Remember I said that the <code>print</code> function attempts to convert a unicode string to <abbr>ASCII</abbr> so it can print it? Well, that's not going to work here, because your unicode string contains non-<abbr>ASCII</abbr> characters, so Python raises a <samp>UnicodeError</samp> error.
|
||||
<li>Here's where the conversion-from-unicode-to-other-encoding-schemes comes in. <var>s</var> is a unicode string, but <code>print</code> can only print a regular string. To solve this problem, you call the <code>encode</code> method, available on every unicode string, to convert the unicode string to a regular string in the given encoding scheme,
|
||||
which you pass as a parameter. In this case, you're using <code>latin-1</code> (also known as <code>iso-8859-1</code>), which includes the tilde-n (whereas the default <abbr>ASCII</abbr> encoding scheme did not, since it only includes characters numbered 0 through 127).
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
(Unicode stuff was here)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<p>Remember I said Python usually converted unicode to <abbr>ASCII</abbr> whenever it needed to make a regular string out of a unicode string? Well, this default encoding scheme is an option which
|
||||
you can customize.
|
||||
<div class=example><h3>Example 9.15. <code>sitecustomize.py</code></h3><pre><code>
|
||||
@@ -4233,57 +3959,19 @@ La Peña</pre><div class=calloutlist>
|
||||
<li>This example assumes that you have made the changes listed in the previous example to your <code>sitecustomize.py</code> file, and restarted Python. If your default encoding still says <code>'ascii'</code>, you didn't set up your <code>sitecustomize.py</code> properly, or you didn't restart Python. The default encoding can only be changed during Python startup; you can't change it later. (Due to some wacky programming tricks that I won't get into right now, you can't even
|
||||
call <code>sys.setdefaultencoding</code> after Python has started up. Dig into <code>site.py</code> and search for “<code>setdefaultencoding</code>” to find out how.)
|
||||
<li>Now that the default encoding scheme includes all the characters you use in your string, Python has no problem auto-coercing the string and printing it.
|
||||
<div class=example><h3>Example 9.17. Specifying encoding in <code>.py</code> files</h3>
|
||||
<p>If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual <code>.py</code> file by putting an encoding declaration at the top of each file. This declaration defines the <code>.py</code> file to be UTF-8:<pre><code>
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: UTF-8 -*-
|
||||
</pre><p>Now, what about <abbr>XML</abbr>? Well, every <abbr>XML</abbr> document is in a specific encoding. Again, ISO-8859-1 is a popular encoding for data in Western European languages. KOI8-R
|
||||
is popular for Russian texts. The encoding, if specified, is in the header of the <abbr>XML</abbr> document.
|
||||
<div class=example><h3>Example 9.18. <code>russiansample.xml</code></h3><pre class=screen><samp>
|
||||
<?xml version="1.0" encoding="koi8-r"?> </span><span>①</span><samp>
|
||||
<preface>
|
||||
<title>Предисловие</title> </span><span>②</span><samp>
|
||||
</preface></span></pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>This is a sample extract from a real Russian <abbr>XML</abbr> document; it's part of a Russian translation of this very book. Note the encoding, <code>koi8-r</code>, specified in the header.
|
||||
<li>These are Cyrillic characters which, as far as I know, spell the Russian word for “Preface”. If you open this file in a regular text editor, the characters will most likely like gibberish, because they're encoded
|
||||
using the <code>koi8-r</code> encoding scheme, but they're being displayed in <code>iso-8859-1</code>.
|
||||
<div class=example><h3>Example 9.19. Parsing <code>russiansample.xml</code></h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>from xml.dom import minidom</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>xmldoc = minidom.parse('russiansample.xml')</kbd> <span>①</span>
|
||||
<samp class=prompt>>>> </samp><kbd>title = xmldoc.getElementsByTagName('title')[0].firstChild.data</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>title</kbd> <span>②</span>
|
||||
u'\u041f\u0440\u0435\u0434\u0438\u0441\u043b\u043e\u0432\u0438\u0435'
|
||||
<samp class=prompt>>>> </samp><kbd>print title</kbd> <span>③</span>
|
||||
<samp class=traceback>Traceback (innermost last):
|
||||
File "<interactive input>", line 1, in ?
|
||||
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
|
||||
<samp class=prompt>>>> </samp><kbd>convertedtitle = title.encode('koi8-r')</kbd> <span>④</span>
|
||||
<samp class=prompt>>>> </samp><kbd>convertedtitle</kbd>
|
||||
'\xf0\xd2\xc5\xc4\xc9\xd3\xcc\xcf\xd7\xc9\xc5'
|
||||
<samp class=prompt>>>> </samp><kbd>print convertedtitle</kbd> <span>⑤</span>
|
||||
Предисловие</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>I'm assuming here that you saved the previous example as <code>russiansample.xml</code> in the current directory. I am also, for the sake of completeness, assuming that you've changed your default encoding back
|
||||
to <code>'ascii'</code> by removing your <code>sitecustomize.py</code> file, or at least commenting out the <code>setdefaultencoding</code> line.
|
||||
<li>Note that the text data of the <code>title</code> tag (now in the <var>title</var> variable, thanks to that long concatenation of Python functions which I hastily skipped over and, annoyingly, won't explain until the next section) -- the text data inside the
|
||||
<abbr>XML</abbr> document's <code>title</code> element is stored in unicode.
|
||||
<li>Printing the title is not possible, because this unicode string contains non-<abbr>ASCII</abbr> characters, so Python can't convert it to <abbr>ASCII</abbr> because that doesn't make sense.
|
||||
<li>You can, however, explicitly convert it to <code>koi8-r</code>, in which case you get a (regular, not unicode) string of single-byte characters (<code>f0</code>, <code>d2</code>, <code>c5</code>, and so forth) that are the <code>koi8-r</code>-encoded versions of the characters in the original unicode string.
|
||||
<li>Printing the <code>koi8-r</code>-encoded string will probably show gibberish on your screen, because your Python <abbr>IDE</abbr> is interpreting those characters as <code>iso-8859-1</code>, not <code>koi8-r</code>. But at least they do print. (And, if you look carefully, it's the same gibberish that you saw when you opened the original
|
||||
<abbr>XML</abbr> document in a non-unicode-aware text editor. Python converted it from <code>koi8-r</code> into unicode when it parsed the <abbr>XML</abbr> document, and you've just converted it back.)
|
||||
<p>To sum up, unicode itself is a bit intimidating if you've never seen it before, but unicode data is really very easy to handle
|
||||
in Python. If your <abbr>XML</abbr> documents are all 7-bit <abbr>ASCII</abbr> (like the examples in this chapter), you will literally never think about unicode. Python will convert the <abbr>ASCII</abbr> data in the <abbr>XML</abbr> documents into unicode while parsing, and auto-coerce it back to <abbr>ASCII</abbr> whenever necessary, and you'll never even notice. But if you need to deal with that in other languages, Python is ready.
|
||||
<div class=itemizedlist>
|
||||
<h3>Further reading</h3>
|
||||
<ul>
|
||||
<li><a href="http://www.unicode.org/">Unicode.org</a> is the home page of the unicode standard, including a brief <a href="http://www.unicode.org/standard/principles.html">technical introduction</a>.
|
||||
|
||||
<li><a href="http://www.reportlab.com/i18n/python_unicode_tutorial.html">Unicode Tutorial</a> has some more examples of how to use Python's unicode functions, including how to force Python to coerce unicode into <abbr>ASCII</abbr> even when it doesn't really want to.
|
||||
|
||||
<li><a href="http://www.python.org/peps/pep-0263.html">PEP 263</a> goes into more detail about how and when to define a character encoding in your <code>.py</code> files.
|
||||
|
||||
</ul>
|
||||
|
||||
|
||||
(More Unicode stuff was here)
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<h2 id="kgp.search">9.5. Searching for elements</h2>
|
||||
<p>Traversing <abbr>XML</abbr> documents by stepping through each node can be tedious. If you're looking for something in particular, buried deep within
|
||||
your <abbr>XML</abbr> document, there is a shortcut you can use to find it quickly: <code>getElementsByTagName</code>.
|
||||
|
||||
+17
-4
@@ -8,20 +8,28 @@
|
||||
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
|
||||
<link rel=stylesheet type=text/css href=dip3.css>
|
||||
<style>
|
||||
.first{clear:both;margin-top:0;padding-top:1.75em}
|
||||
h1:before{content:""}
|
||||
li:last-child{list-style:none;margin:0 0 0 -1.7em}
|
||||
li:last-child:before{content:"A. \00a0 \00a0"}
|
||||
li.todo{background:white;color:gainsboro}
|
||||
li.todo{color:#ddd}
|
||||
span{cursor:default}
|
||||
</style>
|
||||
</head>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
|
||||
<p class=nav>You are here:  <span title="Ce n'est pas un point">•</span>
|
||||
|
||||
<h1>Dive Into Python 3</h1>
|
||||
|
||||
<p class=first><cite>Dive Into Python 3</cite> will cover Python 3 and its differences from Python 2. Compared to the original <cite><a href=http://diveintopython.org/>Dive Into Python</a></cite>, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final version will be published on paper by Apress. The book will remain online under the <a rel=license href=http://creativecommons.org/licenses/by-sa/3.0/>CC-BY-SA-3.0</a> license.
|
||||
|
||||
<p>You can see the <a href=table-of-contents.html>full table of contents</a> (<strong>not finalized</strong>), or read what I’ve written so far:</p>
|
||||
|
||||
<ol start=0>
|
||||
<li class=todo>Installing Python
|
||||
<li><a href=your-first-python-program.html>Your first Python program</a>
|
||||
<li><a href=native-datatypes.html>Native datatypes</a>
|
||||
<li class=todo>Strings
|
||||
<li><a href=strings.html>Strings</a>
|
||||
<li><a href=regular-expressions.html>Regular expressions</a>
|
||||
<li class=todo>The power of introspection
|
||||
<li class=todo>Objects and object-orientation
|
||||
@@ -41,8 +49,13 @@ li.todo{background:white;color:gainsboro}
|
||||
<li><a href=case-study-porting-chardet-to-python-3.html>Case study: porting <code>chardet</code> to Python 3</a>
|
||||
<li><a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code>2to3</code></a>
|
||||
</ol>
|
||||
|
||||
<p>There is a <a href=http://hg.diveintopython3.org/>changelog</a>, a <a type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>feed</a>, and <a href="http://www.reddit.com/search?q=%22Dive+Into+Python+3%22&sort=new">discussion on Reddit</a>. During development, you can download the book by cloning the Mercurial repository:
|
||||
|
||||
<pre><samp class=prompt>you@localhost:~$ </samp><kbd>hg clone http://hg.diveintopython3.org/ diveintopython3</kbd></pre>
|
||||
|
||||
<p>The final version will be downloadable as <abbr>HTML</abbr> and <abbr>PDF</abbr>.
|
||||
|
||||
<p class=c>This site is optimized for Lynx just because fuck you.<br>I’m told it also looks good in graphical browsers.
|
||||
|
||||
<p class=c>© 2001–4, 2009 <span>ℳ</span>ark Pilgrim • <a href=about.html>open standards • open content • open source</a>
|
||||
|
||||
@@ -12,7 +12,7 @@ body{counter-reset:h1 2}
|
||||
</style>
|
||||
</head>
|
||||
<p class=skip><a href=#divingin>skip to main content</a>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=root value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=root value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> <a href=table-of-contents.html#native-datatypes>Dive Into Python 3</a> <span>‣</span>
|
||||
<h1>Native datatypes</h1>
|
||||
<blockquote class=q>
|
||||
|
||||
@@ -14,7 +14,7 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
|
||||
</style>
|
||||
</head>
|
||||
<p class=skip><a href=#divingin>skip to main content</a>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> <a href=table-of-contents.html#porting-code-to-python-3-with-2to3>Dive Into Python 3</a> <span>‣</span>
|
||||
<h1>Porting code to Python 3 with <code>2to3</code></h1>
|
||||
<blockquote class=q>
|
||||
|
||||
@@ -12,7 +12,7 @@ body{counter-reset:h1 4}
|
||||
</style>
|
||||
</head>
|
||||
<p class=skip><a href=#divingin>skip to main content</a>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=root value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=root value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> <a href=table-of-contents.html#regular-expressions>Dive Into Python 3</a> <span>‣</span>
|
||||
<h1>Regular expressions</h1>
|
||||
<blockquote class=q>
|
||||
|
||||
+268
@@ -0,0 +1,268 @@
|
||||
<!DOCTYPE html>
|
||||
<html lang=en>
|
||||
<head>
|
||||
<meta charset=utf-8>
|
||||
<title>Strings - Dive into Python 3</title>
|
||||
<!--[if IE]><script src=html5.js></script><![endif]-->
|
||||
<link rel="shortcut icon" href=data:image/ico,>
|
||||
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
|
||||
<link rel=stylesheet type=text/css href=dip3.css>
|
||||
<style>
|
||||
body{counter-reset:h1 3}
|
||||
</style>
|
||||
</head>
|
||||
<p class=skip><a href=#divingin>skip to main content</a>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> <a href=table-of-contents.html#strings>Dive Into Python 3</a> <span>‣</span>
|
||||
<h1>Strings</h1>
|
||||
<blockquote class=q>
|
||||
<p><span>❝</span> I’m telling you this ’cause you’re one of my friends.<br>
|
||||
My alphabet starts where your alphabet ends! <span>❞</span><br>— <cite>Dr. Seuss, On Beyond Zebra!</cite>
|
||||
</blockquote>
|
||||
<ol>
|
||||
<li><a href=#divingin>Diving in</a>
|
||||
<li><a href=#one-ring-to-rule-them-all>Unicode</a>
|
||||
<ol>
|
||||
<li>How strings are stored in memory
|
||||
<li>Converting between different character encodings
|
||||
<li><a href=#py-encoding>Specifying character encoding in <code>.py</code> files</a>
|
||||
</ol>
|
||||
<li>Strings in Python 3
|
||||
<li>Common string operations
|
||||
<li>Formatting strings
|
||||
<li><a href=#string-module>The <code>string</code> module</a>
|
||||
<li><a href=#byte-arrays>Strings vs. bytes</a>
|
||||
<li><a href=#furtherreading>Further reading</a>
|
||||
</ol>
|
||||
<h2 id=divingin>Diving in</h2>
|
||||
<p class=fancy>Chinese has thousands of characters. The <a href="http://en.wikipedia.org/wiki/Rotokas_alphabet">Rotokas alphabet</a> of <a href="http://en.wikipedia.org/wiki/Bougainville_Province">Bougainville</a> is the smallest alphabet in the world, with just 12 letters. English has 26, plus a handful of punctuation marks. Python 3 can handle all of these languages, and more.
|
||||
|
||||
<p>When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular <i>character encoding</i>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
|
||||
|
||||
<p>In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key. Whenever someone gives you a sequence of bytes — a file, a web page, whatever — and claims it’s “text,” you need to know what character encoding they used so you can decode the bytes into characters. If they give you the wrong key or no key at all, you’re left with the unenviable task of cracking the code yourself. Chances are you’ll get it wrong, and the result will be gibberish.
|
||||
|
||||
<p>Surely you’ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn’t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it’s merely annoying; in other languages, the result can be completely unreadable.
|
||||
|
||||
<p>As I mentioned, there are separate character encodings for each major language in the world, and a lot of minor ones. Since each language is different, and disk space has historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. <abbr>ASCII</abbr>, for instance, stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte.
|
||||
|
||||
<p>Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with <abbr>ASCII</abbr> in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), and so on. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte.
|
||||
|
||||
<p>Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they require multiple-byte character sets. That is, each “character” is represented by a two-byte number from 0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings, namely that they each use the same numbers to mean different things. It’s just that the range of numbers is broader, because there are many more characters to represent.
|
||||
|
||||
<p>That was mostly OK in a non-networked world, where “text” was something you typed yourself and occasionally printed. There wasn’t much “plain text” — your word processor had its own format with stored character encoding information, rich styling, and so on. Word processors were customized for each language, so they automatically used the most appropriate character encoding in the Russian edition and in the English edition and in the Spanish edition. People who read these documents were using the same word processing program as the original author, so everything worked, more or less.
|
||||
|
||||
<p>Now think about the rise of global networks like email and the web. Lots of “plain text” flying around the globe, being authored on one computer, transmitted through a second computer, and received and displayed by a third computer. Computers can only see numbers, but the numbers could mean different things. Oh no! What to do? Well, systems had to be designed to carry encoding information along with every piece of “plain text.” Remember, it’s the decryption key that maps computer-readable numbers to human-readable characters. A missing decryption key means garbled text, gibberish, or worse.
|
||||
|
||||
<p>Now think about trying to store multiple pieces of text in the same place, like in the same database table that holds all the email you’ve ever received. You still need to store the character encoding alongside each piece of text so you can display it properly. Think that’s hard? Try searching your email database, which means converting between multiple encodings on the fly. Doesn’t that sound fun?
|
||||
|
||||
<p>Now think about the possibility of multilingual documents, where characters from several languages are next to each other in the same document. (Hint: programs that tried to do this typically used escape codes to switch “modes.” Poof, you’re in Russian koi8-r mode, so 241 means this character; poof, now you’re in Mac Greek mode, so 241 means some other character.) And of course you’ll want to search <em>those</em> documents, too.
|
||||
|
||||
<p>Now cry a lot, because everything you thought you knew about strings is wrong, and there ain’t no such thing as “plain text.”
|
||||
|
||||
<hr>
|
||||
|
||||
<p><b>Nothing below this line is really done yet. Thanks for reading this far! Stop now!</b>
|
||||
|
||||
<h2 id=one-ring-to-rule-them-all>Unicode</h2>
|
||||
|
||||
<p><i>Enter Unicode.</i>
|
||||
|
||||
<p>Unicode is a system designed to represent <em>every</em> character from <em>every</em> language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That's 2<sup>32</sup>−1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; Unicode data is never ambiguous.
|
||||
|
||||
<p>Right away, problems leap out at you. 4 bytes? For every single character<span>‽</span> [FIXME incomplete paragraph]
|
||||
|
||||
<p>Of course, there is still the matter of all those legacy encoding systems. [FIXME incomplete paragraph]
|
||||
|
||||
<p>[FIXME stuff about UTF-32, UTF-16, and finally UTF-8]
|
||||
<!--
|
||||
<p>UTF-8 uses the same characters as 7-bit <abbr>ASCII</abbr> for 0 through 127
|
||||
|
||||
|
||||
|
||||
|
||||
<p>When dealing with Unicode data, you may at some point need to convert the data back into one of these other legacy encoding
|
||||
systems. For instance, to integrate with some other computer system which expects its data in a specific 1-byte encoding
|
||||
scheme, or to print it to a non-Unicode-aware terminal or printer.
|
||||
|
||||
|
||||
|
||||
|
||||
FIXME: update for Python 3
|
||||
|
||||
<p>Python has had Unicode support throughout the language since version 2.0. The <abbr>XML</abbr> package uses Unicode to store all parsed <abbr>XML</abbr> data, but you can use Unicode anywhere.
|
||||
<div class=example><h3>Example 9.13. Introducing Unicode</h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>s = u'Dive in'</kbd> <span>①</span>
|
||||
<samp class=prompt>>>> </samp><kbd>s</kbd>
|
||||
u'Dive in'
|
||||
<samp class=prompt>>>> </samp><kbd>print s</kbd> <span>②</span>
|
||||
Dive in</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>To create a Unicode string instead of a regular <abbr>ASCII</abbr> string, add the letter “<code>u</code>” before the string. Note that this particular string doesn't have any non-<abbr>ASCII</abbr> characters. That's fine; Unicode is a superset of <abbr>ASCII</abbr> (a very large superset at that), so any regular <abbr>ASCII</abbr> string can also be stored as Unicode.
|
||||
<li>When printing a string, Python will attempt to convert it to your default encoding, which is usually <abbr>ASCII</abbr>. (More on this in a minute.) Since this Unicode string is made up of characters that are also <abbr>ASCII</abbr> characters, printing it has the same result as printing a normal <abbr>ASCII</abbr> string; the conversion is seamless, and if you didn't know that <var>s</var> was a Unicode string, you'd never notice the difference.
|
||||
<div class=example><h3>Example 9.14. Storing non-<abbr>ASCII</abbr> characters</h3><pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>s = u'La Pe\xf1a'</kbd> <span>①</span>
|
||||
<samp class=prompt>>>> </samp><kbd>print s</kbd> <span>②</span>
|
||||
<samp class=traceback>Traceback (innermost last):
|
||||
File "<interactive input>", line 1, in ?
|
||||
UnicodeError: ASCII encoding error: ordinal not in range(128)</samp>
|
||||
<samp class=prompt>>>> </samp><kbd>print s.encode('latin-1')</kbd> <span>③</span>
|
||||
La Peña</pre><div class=calloutlist>
|
||||
<ol>
|
||||
<li>The real advantage of Unicode, of course, is its ability to store non-<abbr>ASCII</abbr> characters, like the Spanish “<code>ñ</code>” (<code>n</code> with a tilde over it). The Unicode character code for the tilde-n is <code>0xf1</code> in hexadecimal (241 in decimal), which you can type like this: <code>\xf1</code>.
|
||||
<li>Remember I said that the <code>print</code> function attempts to convert a Unicode string to <abbr>ASCII</abbr> so it can print it? Well, that's not going to work here, because your Unicode string contains non-<abbr>ASCII</abbr> characters, so Python raises a <samp>UnicodeError</samp> error.
|
||||
<li>Here's where the conversion-from-Unicode-to-other-encoding-schemes comes in. <var>s</var> is a Unicode string, but <code>print</code> can only print a regular string. To solve this problem, you call the <code>encode</code> method, available on every Unicode string, to convert the Unicode string to a regular string in the given encoding scheme,
|
||||
which you pass as a parameter. In this case, you're using <code>latin-1</code> (also known as <code>iso-8859-1</code>), which includes the tilde-n (whereas the default <abbr>ASCII</abbr> encoding scheme did not, since it only includes characters numbered 0 through 127).
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
-->
|
||||
<h3 id=py-encoding>Specifying character encoding in <code>.py</code> files</h3>
|
||||
|
||||
<!--
|
||||
http://www.python.org/dev/peps/pep-0263/ - HOWTO specify encoding in .py files
|
||||
http://www.python.org/dev/peps/pep-3120/ - UTF-8 is now the default encoding (Python 2 defaulted to US-ASCII)
|
||||
-->
|
||||
|
||||
<p>[FIXME this appears to be mostly the same in Python 3, except the default encoding is now UTF-8, not ASCII.]
|
||||
|
||||
<p>If you are going to be storing non-ASCII strings within your Python code, you'll need to specify the encoding of each individual <code>.py</code> file by putting an encoding declaration at the top of each file. This declaration defines the <code>.py</code> file to be UTF-8:<pre><code>
|
||||
#!/usr/bin/env python
|
||||
# -*- coding: UTF-8 -*-</code></pre>
|
||||
|
||||
<p>[FIXME maybe some examples here]
|
||||
|
||||
<h2 id=formatting-strings>Formatting strings</h2>
|
||||
|
||||
<p>[FIXME this is all completely different in Python 3. Cover the new way, then maybe show some examples from the old way? Or maybe not. Hey, maybe just point to the original "Dive Into Python".]
|
||||
|
||||
<p>Python supports formatting values into strings. Although this can include very complicated expressions, the most basic usage is
|
||||
to insert values into a string with the <code>%s</code> placeholder.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>k = "uid"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>v = "sa"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>"%s=%s" % (k, v)</kbd> <span>①</span>
|
||||
<samp>'uid=sa'</samp></pre>
|
||||
<ol>
|
||||
<li>The whole expression evaluates to a string. The first <code>%s</code> is replaced by the value of <var>k</var>; the second <code>%s</code> is replaced by the value of <var>v</var>. All other characters in the string (in this case, the equal sign) stay as they are.
|
||||
</ol>
|
||||
|
||||
<p>Note that <code>(k, v)</code> is a tuple. I told you they were good for something.
|
||||
|
||||
<p>You might be thinking that this is a lot of work just to do simple string concatentation, and you would be right, except that
|
||||
string formatting isn't just concatenation. It's not even just formatting. It's also type coercion.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>uid = "sa"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>pwd = "secret"</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>print pwd + " is not a good password for " + uid</kbd> <span>①</span>
|
||||
secret is not a good password for sa
|
||||
<samp class=prompt>>>> </samp><kbd>print "%s is not a good password for %s" % (pwd, uid)</kbd> <span>②</span>
|
||||
secret is not a good password for sa
|
||||
<samp class=prompt>>>> </samp><kbd>userCount = 6</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>print "Users connected: %d" % (userCount, )</kbd> <span>③</span> <span>④</span>
|
||||
Users connected: 6
|
||||
<samp class=prompt>>>> </samp><kbd>print "Users connected: " + userCount</kbd> <span>⑤</span>
|
||||
<samp class=traceback>Traceback (innermost last):
|
||||
File "<interactive input>", line 1, in ?
|
||||
TypeError: cannot concatenate 'str' and 'int' objects</samp></pre>
|
||||
<ol>
|
||||
<li><code>+</code> is the string concatenation operator.
|
||||
<li>In this trivial case, string formatting accomplishes the same result as concatentation.
|
||||
<li><code>(userCount, )</code> is a tuple with one element. Yes, the syntax is a little strange, but there's a good reason for it: it's unambiguously a tuple. In fact, you can always include a comma after the last element when defining a list, tuple, or dictionary, but the comma is required when defining a tuple with one element. If the comma weren't required, Python wouldn't know whether <code>(userCount)</code> was a tuple with one element or just the value of <var>userCount</var>.
|
||||
<li>String formatting works with integers by specifying <code>%d</code> instead of <code>%s</code>.
|
||||
<li>Trying to concatenate a string with a non-string raises an exception. Unlike string formatting, string concatenation works only when everything is already a string.
|
||||
</ol>
|
||||
|
||||
<p>As with <code>printf</code> in <abbr>C</abbr>, string formatting in Python is like a Swiss Army knife. There are options galore, and modifier strings to specially format many different types of values.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>print "Today's stock price: %f" % 50.4625</kbd> <span>①</span>
|
||||
<samp>50.462500</samp>
|
||||
<samp class=prompt>>>> </samp><kbd>print "Today's stock price: %.2f" % 50.4625</kbd> <span>②</span>
|
||||
<samp>50.46</samp>
|
||||
<samp class=prompt>>>> </samp><kbd>print "Change since yesterday: %+.2f" % 1.5</kbd> <span>③</span>
|
||||
<samp>+1.50</samp></pre>
|
||||
<ol>
|
||||
<li>The <code>%f</code> string formatting option treats the value as a decimal, and prints it to six decimal places.
|
||||
<li>The ".2" modifier of the <code>%f</code> option truncates the value to two decimal places.
|
||||
<li>You can even combine modifiers. Adding the <code>+</code> modifier displays a plus or minus sign before the value. Note that the ".2" modifier is still in place, and is padding the value to exactly two decimal places.
|
||||
</ol>
|
||||
|
||||
<h2 id=common-string-operations>Common string operations</h2>
|
||||
|
||||
<p>[FIXME is it worth keeping this section on joining lists / splitting strings? All the examples are from an old code sample that isn't used at all anymore.]
|
||||
|
||||
<p>You have a list of key-value pairs in the form <code><var>key</var>=<var>value</var></code>, and you want to join them into a single string. To join any list of strings into a single string, use the <code>join</code> method of a string object.
|
||||
|
||||
<p>Here is an example of joining a list from the <code>buildConnectionString</code> function:
|
||||
|
||||
<pre><code>return ";".join(["%s=%s" % (k, v) for k, v in params.items()])</code></pre>
|
||||
|
||||
<p>One interesting note before you continue. I keep repeating that functions are objects, strings are objects... everything
|
||||
is an object. You might have thought I meant that string <em>variables</em> are objects. But no, look closely at this example and you'll see that the string <code>";"</code> itself is an object, and you are calling its <code>join</code> method.
|
||||
<p>The <code>join</code> method joins the elements of the list into a single string, with each element separated by a semi-colon. The delimiter doesn't need to be a semi-colon; it doesn't even need to be a single character. It can be any string.
|
||||
|
||||
<!--<code>join</code> works only on lists of strings; it does not do any type coercion. Joining a list that has one or more non-string elements will raise an exception.-->
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>params = {"server":"mpilgrim", "database":"master", "uid":"sa", "pwd":"secret"}</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>["%s=%s" % (k, v) for k, v in params.items()]</kbd>
|
||||
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
|
||||
<samp class=prompt>>>> </samp><kbd>";".join(["%s=%s" % (k, v) for k, v in params.items()])</kbd>
|
||||
'server=mpilgrim;uid=sa;database=master;pwd=secret'</pre>
|
||||
|
||||
<p>This string is then returned from the <code>odbchelper</code> function and printed by the calling block, which gives you the output that you marveled at when you started reading this chapter.
|
||||
|
||||
<p>You're probably wondering if there's an analogous method to split a string into a list. And of course there is, and it's called <code>split</code>.
|
||||
|
||||
<pre class=screen>
|
||||
<samp class=prompt>>>> </samp><kbd>li = ['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>s = ";".join(li)</kbd>
|
||||
<samp class=prompt>>>> </samp><kbd>s</kbd>
|
||||
'server=mpilgrim;uid=sa;database=master;pwd=secret'
|
||||
<samp class=prompt>>>> </samp><kbd>s.split(";")</kbd> <span>①</span>
|
||||
['server=mpilgrim', 'uid=sa', 'database=master', 'pwd=secret']
|
||||
<samp class=prompt>>>> </samp><kbd>s.split(";", 1)</kbd> <span>②</span>
|
||||
['server=mpilgrim', 'uid=sa;database=master;pwd=secret']</pre>
|
||||
<ol>
|
||||
<li><code>split</code> reverses <code>join</code> by splitting a string into a multi-element list. Note that the delimiter (“<code>;</code>”) is stripped out completely; it does not appear in any of the elements of the returned list.
|
||||
<li><code>split</code> takes an optional second argument, which is the number of times to split. (“Oooooh, optional arguments...” You'll learn how to do this in your own functions in the next chapter.)
|
||||
</ol>
|
||||
|
||||
<!--<code><var>anystring</var>.<code>split</code>(<var>delimiter</var>, 1)</code> is a useful technique when you want to search a string for a substring and then work with everything before the substring (which ends up in the first element of the returned list) and everything after it (which ends up in the second element).-->
|
||||
|
||||
<h2 id=string-module>The <code>string</code> module</h2>
|
||||
|
||||
<p>[FIXME is this worth keeping? The module still exists in 3.0; check if it's going away in 3.1 or something.]
|
||||
|
||||
<p>When I first learned Python, I expected <code>join</code> to be a method of a list, which would take the delimiter as an argument. Many people feel the same way, and there's a story behind the <code>join</code> method. Prior to Python 1.6, strings didn't have all these useful methods. There was a separate <code>string</code> module that contained all the string functions; each function took a string as its first argument. The functions were deemed important enough to put onto the strings themselves, which made sense for functions like <code>lower</code>, <code>upper</code>, and <code>split</code>. But many hard-core Python programmers objected to the new <code>join</code> method, arguing that it should be a method of the list instead, or that it shouldn't move at all but simply stay a part of the old <code>string</code> module (which still has a lot of useful stuff in it). I use the new <code>join</code> method exclusively, but you will see code written either way, and if it really bothers you, you can use the old <code>string.join</code> function instead.
|
||||
|
||||
<h2 id=byte-arrays>Strings vs. bytes</h2>
|
||||
|
||||
<h2 id=furtherreading>Further reading</h2>
|
||||
|
||||
<p>FIXME proper links
|
||||
|
||||
<pre>
|
||||
http://docs.python.org/dev/3.0/howto/unicode.html - Unicode HOWTO
|
||||
http://docs.python.org/dev/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit - changes in Python 3
|
||||
http://blog.whatwg.org/the-road-to-html-5-character-encoding
|
||||
http://www.joelonsoftware.com/articles/Unicode.html
|
||||
http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
|
||||
http://www.tbray.org/ongoing/When/200x/2003/04/13/Strings
|
||||
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
|
||||
http://www.w3.org/People/Dürst/papers.html
|
||||
http://rishida.net/scripts/chinese/
|
||||
</pre>
|
||||
|
||||
<p class=c>© 2001–4, 2009 <span>ℳ</span>ark Pilgrim • <a href=about.html>open standards • open content • open source</a>
|
||||
<script src=jquery.js></script>
|
||||
<script src=dip3.js></script>
|
||||
+54
-24
@@ -15,7 +15,7 @@ ul{list-style:none;margin:0;padding:0}
|
||||
ul li ol{margin:0;padding:0 0 0 2.5em}
|
||||
</style>
|
||||
</head>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> Dive Into Python 3 <span>‣</span>
|
||||
<h1>Table of contents</h1>
|
||||
<ol start=0>
|
||||
@@ -47,34 +47,64 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
|
||||
<li><a href=your-first-python-program.html#furtherreading>Further reading</a>
|
||||
</ol>
|
||||
<li id=native-python-datatypes><a href=native-datatypes.html>Native Python datatypes</a>
|
||||
<ol>
|
||||
<li><a href=native-datatypes.html#divingin>Diving in</a>
|
||||
<li><a href=native-datatypes.html#booleans>Booleans</a>
|
||||
<li><a href=native-datatypes.html#numbers>Numbers</a>
|
||||
<ol>
|
||||
<li><a href=native-datatypes.html#divingin>Diving in</a>
|
||||
<li><a href=native-datatypes.html#booleans>Booleans</a>
|
||||
<li><a href=native-datatypes.html#numbers>Numbers</a>
|
||||
<li><a href=native-datatypes.html#lists>Lists</a>
|
||||
<li><a href=native-datatypes.html#number-coercion>Coercing integers to floats and vice-versa</a>
|
||||
<li><a href=native-datatypes.html#common-numerical-operations>Common numerical operations</a>
|
||||
<li><a href=native-datatypes.html#fractions>Fractions</a>
|
||||
<li><a href=native-datatypes.html#trig>Trigonometry</a>
|
||||
<li><a href=native-datatypes.html#numbers-in-a-boolean-context>Numbers in a boolean context</a>
|
||||
</ol>
|
||||
<li><a href=native-datatypes.html#lists>Lists</a>
|
||||
<ol>
|
||||
<li><a href=native-datatypes.html#creatinglists>Creating a list</a>
|
||||
<li><a href=native-datatypes.html#slicinglists>Slicing a list</a>
|
||||
<li><a href=native-datatypes.html#extendinglists>Adding items to a list</a>
|
||||
<li><a href=native-datatypes.html#searchinglists>Searching for values in a list</a>
|
||||
<li><a href=native-datatypes.html#lists-in-a-boolean-context>Lists in a boolean context</a>
|
||||
</ol>
|
||||
<!--
|
||||
<li><a href=native-datatypes.html#sets>Sets</a>
|
||||
-->
|
||||
<li><a href=native-datatypes.html#dictionaries>Dictionaries</a>
|
||||
<li><a href=native-datatypes.html#none><code>None</code></a>
|
||||
<li><a href=native-datatypes.html#furtherreading>Further reading</a>
|
||||
</ol>
|
||||
<li>Strings
|
||||
<ol>
|
||||
<li>There ain't no such thing as plain text
|
||||
<li><a href=native-datatypes.html#sets>Sets</a>
|
||||
<ol>
|
||||
<li>A brief history of character encoding
|
||||
<li>What's a character?
|
||||
<li>How strings are stored in memory
|
||||
<li>Converting between different character encodings
|
||||
<li>Creating a new set
|
||||
<li>Modifying a set
|
||||
<li>Deleting items from a set
|
||||
<li>Common operations on sets (union, intersection, and difference)
|
||||
<li>Frozen sets
|
||||
</ol>
|
||||
<li>Formatting strings
|
||||
<li>What's my string?
|
||||
<li>Lists and strings
|
||||
<li>Historical note on the string module
|
||||
<li>Byte streams
|
||||
<li>Summary
|
||||
-->
|
||||
<li><a href=native-datatypes.html#dictionaries>Dictionaries</a>
|
||||
<ol>
|
||||
<li><a href=native-datatypes.html#creating-dictionaries>Creating a dictionary</a>
|
||||
<li><a href=native-datatypes.html#modifying-dictionaries>Modifying a dictionary</a>
|
||||
<li><a href=native-datatypes.html#mixed-value-dictionaries>Mixed-value dictionaries</a>
|
||||
<li><a href=native-datatypes.html#dictionaries-in-a-boolean-context>Dictionaries in a boolean context</a>
|
||||
</ol>
|
||||
<li><a href=native-datatypes.html#none><code>None</code></a>
|
||||
<ol>
|
||||
<li><a href=native-datatypes.html#none-in-a-boolean-context><code>None</code> in a boolean context</a>
|
||||
</ol>
|
||||
<li><a href=native-datatypes.html#furtherreading>Further reading</a>
|
||||
</ol>
|
||||
<li id=strings><a href=strings.html>Strings</a>
|
||||
<ol>
|
||||
<li><a href=strings.html#divingin>Diving in</a>
|
||||
<li><a href=strings.html#one-ring-to-rule-them-all>Unicode</a>
|
||||
<ol>
|
||||
<li>How strings are stored in memory
|
||||
<li>Converting between different character encodings
|
||||
<li><a href=strings.html#py-encoding>Specifying character encoding in <code>.py</code> files</a>
|
||||
</ol>
|
||||
<li>Strings in Python 3
|
||||
<li>Common string operations
|
||||
<li>Formatting strings
|
||||
<li><a href=strings.html#string-module>The <code>string</code> module</a>
|
||||
<li><a href=strings.html#byte-arrays>Strings vs. bytes</a>
|
||||
<li><a href=strings.html#furtherreading>Further reading</a>
|
||||
</ol>
|
||||
<li id=regular-expressions><a href=regular-expressions.html>Regular expressions</a>
|
||||
<ol>
|
||||
<li><a href=regular-expressions.html#divingin>Diving in</a>
|
||||
|
||||
+1
-1
@@ -12,7 +12,7 @@ body{counter-reset:h1 7}
|
||||
</style>
|
||||
</head>
|
||||
<p class=skip><a href=#divingin>skip to main content</a>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=root value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=root value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> <a href=table-of-contents.html#unit-testing>Dive Into Python 3</a> <span>‣</span>
|
||||
<h1>Unit testing</h1>
|
||||
<blockquote class=q>
|
||||
|
||||
@@ -12,7 +12,7 @@ body{counter-reset:h1 1}
|
||||
</style>
|
||||
</head>
|
||||
<p class=skip><a href=#divingin>skip to main content</a>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<form action=http://www.google.com/cse><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8> <input name=q size=31> <input type=submit name=sa value=Search></div></form>
|
||||
<p class=nav>You are here: <a href=/>Home</a> <span>‣</span> <a href=table-of-contents.html#your-first-python-program>Dive Into Python 3</a> <span>‣</span>
|
||||
<h1>Your first Python program</h1>
|
||||
<blockquote class=q>
|
||||
|
||||
Reference in New Issue
Block a user