several subsections of "lists" in native-datatypes, more js fiddling

This commit is contained in:
Mark Pilgrim
2009-02-12 18:04:02 -05:00
parent 3c39352a32
commit 63b7a47c55
11 changed files with 7104 additions and 2947 deletions
+126 -129
View File
@@ -1,106 +1,105 @@
<!DOCTYPE html>
<html lang="en">
<html lang=en>
<head>
<meta charset="utf-8">
<meta charset=utf-8>
<title>Case study: porting chardet to Python 3 - Dive into Python 3</title>
<link rel="stylesheet" type="text/css" href="dip3.css">
<link rel="shortcut icon" href="data:image/ico,">
<link rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">
<style type="text/css">
<link rel=stylesheet type=text/css href=dip3.css>
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<style type=text/css>
body{counter-reset:h1 20}
</style>
</head>
<body>
<p class="skip"><a href="#divingin">skip to main content</a>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div></form>
<p class="nav">You are here: <a href="/">Home</a> <span>&#8227;</span> <a href="table-of-contents.html">Dive Into Python 3</a> <span>&#8227;</span>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse id=search><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Case study: porting <code>chardet</code> to Python 3</h1>
<blockquote class="q">
<p><span>&#x275D;</span> Words, words. They&#8217;re all we have to go on. <span>&#x275E;</span><br>&mdash; <cite>Rosencrantz and Guildenstern are Dead</cite>
<blockquote class=q>
<p><span>&#x275D;</span> Words, words. They&#8217;re all we have to go on. <span>&#x275E;</span><br>&mdash; <cite>Rosencrantz and Guildenstern are Dead</cite>
</blockquote>
<ol>
<li><a href="#divingin">Introducing <code class="filename">chardet</code></a>
<li><a href=#divingin>Introducing <code class=filename>chardet</code></a>
<ol>
<li><a href="#faq.what">What is character encoding auto-detection?</a>
<li><a href="#faq.impossible">Isn&#8217;t that impossible?</a>
<li><a href="#faq.who">Who wrote this detection algorithm?</a>
<li><a href="#faq.yippie">Yippie! Screw the standards, I&#8217;ll just auto-detect everything!</a>
<li><a href="#faq.why">Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</a>
<li><a href=#faq.what>What is character encoding auto-detection?</a>
<li><a href=#faq.impossible>Isn&#8217;t that impossible?</a>
<li><a href=#faq.who>Who wrote this detection algorithm?</a>
<li><a href=#faq.yippie>Yippie! Screw the standards, I&#8217;ll just auto-detect everything!</a>
<li><a href=#faq.why>Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</a>
</ol>
<li><a href="#divingin2">Diving in</a>
<li><a href=#divingin2>Diving in</a>
<ol>
<li><a href="#how.bom"><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></a>
<li><a href="#how.esc">Escaped encodings</a>
<li><a href="#how.mb">Multi-byte encodings</a>
<li><a href="#how.sb">Single-byte encodings</a>
<li><a href="#how.windows1252"><code>windows-1252</code></a>
<li><a href=#how.bom><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></a>
<li><a href=#how.esc>Escaped encodings</a>
<li><a href=#how.mb>Multi-byte encodings</a>
<li><a href=#how.sb>Single-byte encodings</a>
<li><a href=#how.windows1252><code>windows-1252</code></a>
</ol>
<li><a href="#running2to3">Running <code class="filename">2to3</code></a>
<li><a href="#manual">Fixing what <code class="filename">2to3</code> can&#8217;t</a>
<li><a href=#running2to3>Running <code class=filename>2to3</code></a>
<li><a href=#manual>Fixing what <code class=filename>2to3</code> can&#8217;t</a>
<ol>
<li><a href="#falseisinvalidsyntax"><code>False</code> is invalid syntax</a>
<li><a href="#nomodulenamedconstants">No module named <code class="filename">constants</code></a>
<li><a href="#namefileisnotdefined">Name '<var>file</var>' is not defined</a>
<li><a href="#cantuseastringpattern">Can&#8217;t use a string pattern on a bytes-like object</a>
<li><a href="#cantconvertbytesobject">Can&#8217;t convert '<code>bytes</code>' object to <code>str</code> implicitly</a>
<li><a href=#falseisinvalidsyntax><code>False</code> is invalid syntax</a>
<li><a href=#nomodulenamedconstants>No module named <code class=filename>constants</code></a>
<li><a href=#namefileisnotdefined>Name '<var>file</var>' is not defined</a>
<li><a href=#cantuseastringpattern>Can&#8217;t use a string pattern on a bytes-like object</a>
<li><a href=#cantconvertbytesobject>Can&#8217;t convert '<code>bytes</code>' object to <code>str</code> implicitly</a>
</ol>
</ol>
<h2 id="divingin">Introducing <code class="filename">chardet</code>: a mini-FAQ</h2>
<p class="fancy">When you think of &#8220;text,&#8221; you probably think of &#8220;characters and symbols I see on my computer screen.&#8221; But computers don&#8217;t deal in characters and symbols; they deal in bits and bytes. Every piece of text you&#8217;ve ever seen on a computer screen is actually stored in a particular <em>character encoding</em>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
<p>In reality, it&#8217;s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it&#8217;s &#8220;text&#8221;, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
<h3 id="faq.what">What is character encoding auto-detection?</h3>
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It&#8217;s like cracking a code when you don&#8217;t have the decryption key.
<h3 id="faq.impossible">Isn&#8217;t that impossible?</h3>
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds &#8220;txzqJv 2!dasd0a QqdKjvz&#8221; will instantly recognize that that isn&#8217;t English (even though it is composed entirely of English letters). By studying lots of &#8220;typical&#8221; text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text&#8217;s language.
<h2 id=divingin>Introducing <code class=filename>chardet</code>: a mini-FAQ</h2>
<p class=fancy>When you think of &#8220;text,&#8221; you probably think of &#8220;characters and symbols I see on my computer screen.&#8221; But computers don&#8217;t deal in characters and symbols; they deal in bits and bytes. Every piece of text you&#8217;ve ever seen on a computer screen is actually stored in a particular <em>character encoding</em>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
<p>In reality, it&#8217;s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it&#8217;s &#8220;text&#8221;, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
<h3 id=faq.what>What is character encoding auto-detection?</h3>
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It&#8217;s like cracking a code when you don&#8217;t have the decryption key.
<h3 id=faq.impossible>Isn&#8217;t that impossible?</h3>
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds &#8220;txzqJv 2!dasd0a QqdKjvz&#8221; will instantly recognize that that isn&#8217;t English (even though it is composed entirely of English letters). By studying lots of &#8220;typical&#8221; text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text&#8217;s language.
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
<h3 id="faq.who">Who wrote this detection algorithm?</h3>
<p>This library is a port of <a href="http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/">the auto-detection code in Mozilla</a>. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors&#8217; comments, which are quite extensive and informative.
<p>You may also be interested in the research paper which led to the Mozilla implementation, <a href="http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html">A composite approach to language/encoding detection</a>.
<h3 id="faq.yippie">Yippie! Screw the standards, I&#8217;ll just auto-detect everything!</h3>
<p>Don&#8217;t do that. Virtually every format and protocol contains a method for specifying character encoding.
<h3 id=faq.who>Who wrote this detection algorithm?</h3>
<p>This library is a port of <a href=http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/>the auto-detection code in Mozilla</a>. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors&#8217; comments, which are quite extensive and informative.
<p>You may also be interested in the research paper which led to the Mozilla implementation, <a href=http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html>A composite approach to language/encoding detection</a>.
<h3 id=faq.yippie>Yippie! Screw the standards, I&#8217;ll just auto-detect everything!</h3>
<p>Don&#8217;t do that. Virtually every format and protocol contains a method for specifying character encoding.
<ul>
<li>HTTP can define a <code>charset</code> parameter in the <code>Content-type</code> header.
<li>HTML documents can define a <code>&lt;meta http-equiv="content-type"&gt;</code> element in the <code>&lt;head&gt;</code> of a web page.
<li>XML documents can define an <code>encoding</code> attribute in the XML prolog.
</ul>
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
<p>Despite the complexity, it&#8217;s worthwhile to follow standards and <a href="http://www.w3.org/2001/tag/doc/mime-respect">respect explicit character encoding information</a>. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
<h3 id="faq.why">Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</h3>
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn&#8217;t work. There are also some poorly designed standards that have no way to specify encoding at all.
<p>If following the relevant standards gets you nowhere, <em>and</em> you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my <a href="http://feedparser.org/">Universal Feed Parser</a>, which calls this auto-detection library <a href="http://feedparser.org/docs/character-encoding.html">only after exhausting all other options</a>.
<h2 id="divingin2">Diving in</h2>
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
<p>Despite the complexity, it&#8217;s worthwhile to follow standards and <a href=http://www.w3.org/2001/tag/doc/mime-respect>respect explicit character encoding information</a>. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
<h3 id=faq.why>Why bother with auto-detection if it&#8217;s slow, inaccurate, and non-standard?</h3>
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn&#8217;t work. There are also some poorly designed standards that have no way to specify encoding at all.
<p>If following the relevant standards gets you nowhere, <em>and</em> you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my <a href=http://feedparser.org/>Universal Feed Parser</a>, which calls this auto-detection library <a href=http://feedparser.org/docs/character-encoding.html>only after exhausting all other options</a>.
<h2 id=divingin2>Diving in</h2>
<p>This is a brief guide to navigating the code itself.
<p>The main entry point for the detection algorithm is <code class="filename">universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code class="filename">chardet/__init__.py</code>, but that&#8217;s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
<p>The main entry point for the detection algorithm is <code class=filename>universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code class=filename>chardet/__init__.py</code>, but that&#8217;s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
<ol>
<li><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr>. This includes <code>UTF-8</code>, both <abbr title="Big Endian">BE</abbr> and <abbr title="Little Endian">LE</abbr> variants of <code>UTF-16</code>, and all 4 byte-order variants of <code>UTF-32</code>.
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <code>ISO-2022-JP</code> (Japanese) and <code>HZ-GB-2312</code> (Chinese).
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <code>Big5</code> (Chinese), <code>SHIFT_JIS</code> (Japanese), <code>EUC-KR</code> (Korean), and <code>UTF-8</code> without a <abbr title="Byte Order Mark">BOM</abbr>.
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
<li><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr>. This includes <code>UTF-8</code>, both <abbr title="Big Endian">BE</abbr> and <abbr title="Little Endian">LE</abbr> variants of <code>UTF-16</code>, and all 4 byte-order variants of <code>UTF-32</code>.
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <code>ISO-2022-JP</code> (Japanese) and <code>HZ-GB-2312</code> (Chinese).
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <code>Big5</code> (Chinese), <code>SHIFT_JIS</code> (Japanese), <code>EUC-KR</code> (Korean), and <code>UTF-8</code> without a <abbr title="Byte Order Mark">BOM</abbr>.
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn&#8217;t know a character encoding from a hole in the ground.
</ol>
<h3 id="how.bom"><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></h3>
<p>If the text starts with a <abbr title="Byte Order Mark">BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr title="Byte Order Mark">BOM</abbr> will tell us exactly which one; that&#8217;s what it&#8217;s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
<h3 id="how.esc">Escaped encodings</h3>
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code class="filename">escprober.py</code>) and feeds it the text.
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code class="filename">escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
<h3 id="how.mb">Multi-byte encodings</h3>
<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of &#8220;probers&#8221; for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code class="filename">mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code class="filename">mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
<p>The distribution analyzers (each defined in <code class="filename">chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code class="filename">sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code class="filename">jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.
<h3 id="how.sb">Single-byte encodings</h3>
<p>The single-byte encoding prober, <code>SBCSGroupProber</code> (defined in <code class="filename">sbcsgroupprober.py</code>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <code>windows-1251</code>, <code>KOI8-R</code>, <code>ISO-8859-5</code>, <code>MacCyrillic</code>, <code>IBM855</code>, and <code>IBM866</code> (Russian); <code>ISO-8859-7</code> and <code>windows-1253</code> (Greek); <code>ISO-8859-5</code> and <code>windows-1251</code> (Bulgarian); <code>ISO-8859-2</code> and <code>windows-1250</code> (Hungarian); <code>TIS-620</code> (Thai); <code>windows-1255</code> and <code>ISO-8859-8</code> (Hebrew).
<p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code class="filename">sbcharsetprober.py</code>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code class="filename">hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored "<span class="quote">backwards</span>" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
<h3 id="how.windows1252"><code>windows-1252</code></h3>
<p>If <code>UniversalDetector</code> detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a <code>Latin1Prober</code> (defined in <code class="filename">latin1prober.py</code>) to try to detect English text in a <code>windows-1252</code> encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish <code>windows-1252</code> is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. <code>Latin1Prober</code> automatically reduces its confidence rating to allow more accurate probers to win if at all possible.
<h2 id="running2to3">Running <code class="filename">2to3</code></h2>
<p>We&#8217;re going to migrate the <code class="filename">chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code class="filename">2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href="porting-code-to-python-3-with-2to3.html">Porting code to Python 3 with <code class="filename">2to3</code></a>. In this chapter, we&#8217;ll start by running <code class="filename">2to3</code> on the <code class="filename">chardet</code> package, but as you&#8217;ll see, there will still be a lot of work to do after the automated tools have performed their magic.
<p>The main <code class="filename">chardet</code> package is split across several different files, all in the same directory. The <code class="filename">2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code class="filename">2to3</code> will convert each of the files in turn.
<p class="skip"><a href="#skip2to3output">skip over this</a>
<pre class="screen"><samp class="prompt">C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w chardet\</kbd>
<h3 id=how.bom><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></h3>
<p>If the text starts with a <abbr title="Byte Order Mark">BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr title="Byte Order Mark">BOM</abbr> will tell us exactly which one; that&#8217;s what it&#8217;s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
<h3 id=how.esc>Escaped encodings</h3>
<p>If the text contains a recognizable escape sequence that might indicate an escaped encoding, <code>UniversalDetector</code> creates an <code>EscCharSetProber</code> (defined in <code class=filename>escprober.py</code>) and feeds it the text.
<p><code>EscCharSetProber</code> creates a series of state machines, based on models of <code>HZ-GB-2312</code>, <code>ISO-2022-CN</code>, <code>ISO-2022-JP</code>, and <code>ISO-2022-KR</code> (defined in <code class=filename>escsm.py</code>). <code>EscCharSetProber</code> feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, <code>EscCharSetProber</code> immediately returns the positive result to <code>UniversalDetector</code>, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.
<h3 id=how.mb>Multi-byte encodings</h3>
<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of &#8220;probers&#8221; for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code class=filename>mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
<p>Most of the multi-byte encoding probers are inherited from <code>MultiByteCharSetProber</code> (defined in <code class=filename>mbcharsetprober.py</code>), and simply hook up the appropriate state machine and distribution analyzer and let <code>MultiByteCharSetProber</code> do the rest of the work. <code>MultiByteCharSetProber</code> runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, <code>MultiByteCharSetProber</code> feeds the text to an encoding-specific distribution analyzer.
<p>The distribution analyzers (each defined in <code class=filename>chardistribution.py</code>) use language-specific models of which characters are used most frequently. Once <code>MultiByteCharSetProber</code> has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, <code>MultiByteCharSetProber</code> returns the result to <code>MBCSGroupProber</code>, which returns it to <code>UniversalDetector</code>, which returns it to the caller.
<p>The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between <code>EUC-JP</code> and <code>SHIFT_JIS</code>, so the <code>SJISProber</code> (defined in <code class=filename>sjisprober.py</code>) also uses 2-character distribution analysis. <code>SJISContextAnalysis</code> and <code>EUCJPContextAnalysis</code> (both defined in <code class=filename>jpcntx.py</code> and both inheriting from a common <code>JapaneseContextAnalysis</code> class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to <code>SJISProber</code>, which checks both analyzers and returns the higher confidence level to <code>MBCSGroupProber</code>.
<h3 id=how.sb>Single-byte encodings</h3>
<p>The single-byte encoding prober, <code>SBCSGroupProber</code> (defined in <code class=filename>sbcsgroupprober.py</code>), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: <code>windows-1251</code>, <code>KOI8-R</code>, <code>ISO-8859-5</code>, <code>MacCyrillic</code>, <code>IBM855</code>, and <code>IBM866</code> (Russian); <code>ISO-8859-7</code> and <code>windows-1253</code> (Greek); <code>ISO-8859-5</code> and <code>windows-1251</code> (Bulgarian); <code>ISO-8859-2</code> and <code>windows-1250</code> (Hungarian); <code>TIS-620</code> (Thai); <code>windows-1255</code> and <code>ISO-8859-8</code> (Hebrew).
<p><code>SBCSGroupProber</code> feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, <code>SingleByteCharSetProber</code> (defined in <code class=filename>sbcharsetprober.py</code>), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. <code>SingleByteCharSetProber</code> processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
<p>Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, <code>HebrewProber</code> (defined in <code class=filename>hebrewprober.py</code>) tries to distinguish between Visual Hebrew (where the source text actually stored "<span class=quote>backwards</span>" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (<code>windows-1255</code> for Logical Hebrew, or <code>ISO-8859-8</code> for Visual Hebrew).
<h3 id=how.windows1252><code>windows-1252</code></h3>
<p>If <code>UniversalDetector</code> detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a <code>Latin1Prober</code> (defined in <code class=filename>latin1prober.py</code>) to try to detect English text in a <code>windows-1252</code> encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish <code>windows-1252</code> is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. <code>Latin1Prober</code> automatically reduces its confidence rating to allow more accurate probers to win if at all possible.
<h2 id=running2to3>Running <code class=filename>2to3</code></h2>
<p>We&#8217;re going to migrate the <code class=filename>chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code class=filename>2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href=porting-code-to-python-3-with-2to3.html>Porting code to Python 3 with <code class=filename>2to3</code></a>. In this chapter, we&#8217;ll start by running <code class=filename>2to3</code> on the <code class=filename>chardet</code> package, but as you&#8217;ll see, there will still be a lot of work to do after the automated tools have performed their magic.
<p>The main <code class=filename>chardet</code> package is split across several different files, all in the same directory. The <code class=filename>2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code class=filename>2to3</code> will convert each of the files in turn.
<p class=skip><a href=#skip2to3output>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w chardet\</kbd>
<samp>RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
@@ -566,9 +565,9 @@ RefactoringTool: chardet\sbcsgroupprober.py
RefactoringTool: chardet\sjisprober.py
RefactoringTool: chardet\universaldetector.py
RefactoringTool: chardet\utf8prober.py</samp></pre>
<p id="skip2to3output">Now run the <code class="filename">2to3</code> script on the testing harness, <code class="filename">test.py</code>.
<p class="skip"><a href="#skip2to3outputtest">skip over this</a>
<pre class="screen"><samp class="prompt">C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w test.py</kbd>
<p id=skip2to3output>Now run the <code class=filename>2to3</code> script on the testing harness, <code class=filename>test.py</code>.
<p class=skip><a href=#skip2to3outputtest>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python c:\Python30\Tools\Scripts\2to3.py -w test.py</kbd>
<samp>RefactoringTool: Skipping implicit fixer: buffer
RefactoringTool: Skipping implicit fixer: idioms
RefactoringTool: Skipping implicit fixer: set_literal
@@ -598,21 +597,21 @@ RefactoringTool: Skipping implicit fixer: ws_comma
+print(count, 'tests')
RefactoringTool: Files that were modified:
RefactoringTool: test.py</samp></pre>
<p id="skip2to3outputtest">Well, that wasn&#8217;t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it&#8217;ll work?
<h2 id="manual">Fixing what <code class="filename">2to3</code> can&#8217;t</h2>
<h3 id="falseisinvalidsyntax"><code>False</code> is invalid syntax</h3>
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it&#8217;s a good way to test our ported code to make sure there aren&#8217;t any bugs lurking anywhere.
<p class="skip"><a href="#skipinvalidsyntax">skip over this</a>
<pre class="screen"><samp class="prompt">C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp class="traceback">Traceback (most recent call last):
<p id=skip2to3outputtest>Well, that wasn&#8217;t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it&#8217;ll work?
<h2 id=manual>Fixing what <code class=filename>2to3</code> can&#8217;t</h2>
<h3 id=falseisinvalidsyntax><code>False</code> is invalid syntax</h3>
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it&#8217;s a good way to test our ported code to make sure there aren&#8217;t any bugs lurking anywhere.
<p class=skip><a href=#skipinvalidsyntax>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 1, in &lt;module>
from chardet.universaldetector import UniversalDetector
File "C:\home\chardet\chardet\universaldetector.py", line 51
self.done = constants.False
^
SyntaxError: invalid syntax</samp></pre>
<p id="skipinvalidsyntax">Hmm, a small snag. In Python 3, <code>False</code> is a reserved word, so you can&#8217;t use it as a variable name. Let&#8217;s look at <code class="filename">constants.py</code> to see where it&#8217;s defined. Here&#8217;s the original version from <code class="filename">constants.py</code>, before the <code class="filename">2to3</code> script changed it:
<p class="skip"><a href="#skipbuiltincode">skip over this</a>
<p id=skipinvalidsyntax>Hmm, a small snag. In Python 3, <code>False</code> is a reserved word, so you can&#8217;t use it as a variable name. Let&#8217;s look at <code class=filename>constants.py</code> to see where it&#8217;s defined. Here&#8217;s the original version from <code class=filename>constants.py</code>, before the <code class=filename>2to3</code> script changed it:
<p class=skip><a href=#skipbuiltincode>skip over this</a>
<pre><code>import __builtin__
if not hasattr(__builtin__, 'False'):
False = 0
@@ -620,84 +619,84 @@ if not hasattr(__builtin__, 'False'):
else:
False = __builtin__.False
True = __builtin__.True</code></pre>
<p id="skipbuiltincode">This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in <code>Boolean</code> type. This code detects the absence of the built-in constants <code>True</code> and <code>False</code>, and defines them if necessary.
<p>However, Python 3 will always have a <code>Boolean</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of <code>constants.True</code> and <code>constants.False</code> with <code>True</code> and <code>False</code>, respectively, then delete this dead code from <code class="filename">constants.py</code>.
<p>So this line in <code class="filename">universaldetector.py</code>:
<p id=skipbuiltincode>This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in <code>Boolean</code> type. This code detects the absence of the built-in constants <code>True</code> and <code>False</code>, and defines them if necessary.
<p>However, Python 3 will always have a <code>Boolean</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of <code>constants.True</code> and <code>constants.False</code> with <code>True</code> and <code>False</code>, respectively, then delete this dead code from <code class=filename>constants.py</code>.
<p>So this line in <code class=filename>universaldetector.py</code>:
<pre><code>self.done = constants.False</code></pre>
<p>Becomes
<pre><code>self.done = False</code></pre>
<p>Ah, wasn&#8217;t that satisfying? The code is shorter and more readable already.
<h3 id="nomodulenamedconstants">No module named <code class="filename">constants</code></h3>
<p>Time to run <code class="filename">test.py</code> again and see how far it gets.
<p class="skip"><a href="#skipnomodulenamedconstants">skip over this</a>
<pre class="screen"><samp class="prompt">C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp class="traceback">Traceback (most recent call last):
<h3 id=nomodulenamedconstants>No module named <code class=filename>constants</code></h3>
<p>Time to run <code class=filename>test.py</code> again and see how far it gets.
<p class=skip><a href=#skipnomodulenamedconstants>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 1, in &lt;module>
from chardet.universaldetector import UniversalDetector
File "C:\home\chardet\chardet\universaldetector.py", line 29, in &lt;module>
import constants, sys
ImportError: No module named constants</samp></pre>
<p id="skipnomodulenamedconstants">What&#8217;s that you say? No module named <code class="filename">constants</code>? Of course there&#8217;s a module named <code class="filename">constants</code>. ... Oh wait, no there isn&#8217;t. Remember when the <code class="filename">2to3</code> script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
<p id=skipnomodulenamedconstants>What&#8217;s that you say? No module named <code class=filename>constants</code>? Of course there&#8217;s a module named <code class=filename>constants</code>. ... Oh wait, no there isn&#8217;t. Remember when the <code class=filename>2to3</code> script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
<pre><code>from . import constants</code></pre>
<p>But wait. Wasn&#8217;t the <code class="filename">2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code class="filename">constants</code> module within the library, and an absolute import of the <code class="filename">sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can&#8217;t, and the <code class="filename">2to3</code> script is not smart enough to split the import statement into two.
<p>The solution is to split the import statement manually. So this two-in-one import:
<p>But wait. Wasn&#8217;t the <code class=filename>2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code class=filename>constants</code> module within the library, and an absolute import of the <code class=filename>sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can&#8217;t, and the <code class=filename>2to3</code> script is not smart enough to split the import statement into two.
<p>The solution is to split the import statement manually. So this two-in-one import:
<pre><code>import constants, sys</code></pre>
<p>Needs to become two separate imports:
<pre><code>from . import constants
import sys</code></pre>
<p>There are variations of this problem scattered throughout the <code class="filename">chardet</code> library. In some places it&#8217;s "<code>import constants, sys</code>"; in other places, it&#8217;s "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
<p>There are variations of this problem scattered throughout the <code class=filename>chardet</code> library. In some places it&#8217;s "<code>import constants, sys</code>"; in other places, it&#8217;s "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
<p>Onward!
<h3 id="namefileisnotdefined">Name '<var>file</var>' is not defined</h3>
<h3 id=namefileisnotdefined>Name '<var>file</var>' is not defined</h3>
<p>FIXME intro
<p class="skip"><a href="#skipnamefileisnotdefined">skip over this</a>
<pre class="screen"><samp class="prompt">C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=skip><a href=#skipnamefileisnotdefined>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class="traceback">Traceback (most recent call last):
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 9, in &lt;module>
for line in file(f, 'rb'):
NameError: name 'file' is not defined</samp></pre>
<p id="skipnamefileisnotdefined">This one surprised me, because I&#8217;ve been using this idiom as long as I can remember. In Python 2, the global <var>file()</var> function was an alias for <var>open()</var>, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the <code class="filename">io</code> module. [FIXME-LINK PEP 3116] I&#8217;ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global <var>file()</var> function no longer exists. However, the <var>open()</var> function does still exist. (Technically, it&#8217;s an alias for <var>io.open()</var>, but never mind that right now.)
<p id=skipnamefileisnotdefined>This one surprised me, because I&#8217;ve been using this idiom as long as I can remember. In Python 2, the global <var>file()</var> function was an alias for <var>open()</var>, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the <code class=filename>io</code> module. [FIXME-LINK PEP 3116] I&#8217;ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global <var>file()</var> function no longer exists. However, the <var>open()</var> function does still exist. (Technically, it&#8217;s an alias for <var>io.open()</var>, but never mind that right now.)
<p>Thus, the simplest solution to the problem of the missing <var>file()</var> is to call <var>open()</var> instead:
<pre><code>for line in open(f, 'rb'):</code></pre>
<p>And that&#8217;s all I have to say about that.
<h3 id="cantuseastringpattern">Can&#8217;t use a string pattern on a bytes-like object</h3>
<h3 id=cantuseastringpattern>Can&#8217;t use a string pattern on a bytes-like object</h3>
<p>FIXME intro
<p class="skip"><a href="#skipcantuseastringpattern">skip over this</a>
<pre class="screen"><samp class="prompt">C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=skip><a href=#skipcantuseastringpattern>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class="traceback">Traceback (most recent call last):
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 10, in &lt;module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 98, in feed
if self._highBitDetector.search(aBuf):
TypeError: can't use a string pattern on a bytes-like object</samp></pre>
<p id="skipcantuseastringpattern">Now things are starting to get interesting. And by &#8220;interesting,&#8221; I mean &#8220;confusing as all hell.&#8221;
<p>First, let&#8217;s see what <var>self._highBitDetector</var> is. It&#8217;s defined in the <var>__init__</var> method of the <var>UniversalDetector</var> class:
<p class="skip"><a href="#skiphighbitdetectorcode">skip over this</a>
<p id=skipcantuseastringpattern>Now things are starting to get interesting. And by &#8220;interesting,&#8221; I mean &#8220;confusing as all hell.&#8221;
<p>First, let&#8217;s see what <var>self._highBitDetector</var> is. It&#8217;s defined in the <var>__init__</var> method of the <var>UniversalDetector</var> class:
<p class=skip><a href=#skiphighbitdetectorcode>skip over this</a>
<pre><code>class UniversalDetector:
def __init__(self):
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
<p id="skiphighbitdetectorcode">This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that&#8217;s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII <em>bytes</em> in the range 128-255.
<p id=skiphighbitdetectorcode>This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that&#8217;s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII <em>bytes</em> in the range 128-255.
<p>And therein lies the problem.
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we&#8217;re searching is not a string, it&#8217;s a byte array. Looking at the traceback, this error occurred in <code class="filename">universaldetector.py</code>:
<p class="skip"><a href="#skipfeedhighbitdetectorcode">skip over this</a>
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we&#8217;re searching is not a string, it&#8217;s a byte array. Looking at the traceback, this error occurred in <code class=filename>universaldetector.py</code>:
<p class=skip><a href=#skipfeedhighbitdetectorcode>skip over this</a>
<pre><code>def feed(self, aBuf):
.
.
.
if self._mInputState == ePureAscii:
if self._highBitDetector.search(aBuf):</code></pre>
<p id="skipfeedhighbitdetectorcode">And what is <var>aBuf</var>? Let&#8217;s backtrack further to a place that calls <var>UniversalDetector.feed()</var>. One place that calls it is the test harness, <code class="filename">test.py</code>.
<p class="skip"><a href="#skiptestharnessfeedcode">skip over this</a>
<p id=skipfeedhighbitdetectorcode>And what is <var>aBuf</var>? Let&#8217;s backtrack further to a place that calls <var>UniversalDetector.feed()</var>. One place that calls it is the test harness, <code class=filename>test.py</code>.
<p class=skip><a href=#skiptestharnessfeedcode>skip over this</a>
<pre><code>u = UniversalDetector()
.
.
.
for line in open(f, 'rb'):
u.feed(line)</code></pre>
<p id="skiptestharnessfeedcode">And here we find our answer: in the <var>UniversalDetector.feed()</var> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <var>UniversalDetector.feed()</var>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit... characters. But we don&#8217;t have characters; we have bytes. Oops.
<p id=skiptestharnessfeedcode>And here we find our answer: in the <var>UniversalDetector.feed()</var> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for &#8220;read&#8221;; OK, big deal, we&#8217;re reading the file. Ah, but <code>'b'</code> is for &#8220;binary.&#8221; Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <var>UniversalDetector.feed()</var>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit... characters. But we don&#8217;t have characters; we have bytes. Oops.
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. So instead of this:
<p>Once you realize that, the solution is not difficult. Regular expressions defined with strings can search strings. Regular expressions defined with byte arrays can search byte arrays. To define a byte array pattern, we simply change the type of the argument we use to define the regular expression to a byte array. So instead of this:
<pre><code>self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
<p>We now have this:
<pre><code>self._highBitDetector = re.compile(b'[\x80-\xFF]')</code></pre>
@@ -705,20 +704,18 @@ for line in open(f, 'rb'):
<pre><code>self._escDetector = re.compile(r'(\033|~{)')</code></pre>
<p>Again, this is going to be used to search a byte array (the same <var>aBuf</var> variable, in fact), so the regular expression pattern needs to be defined as a byte array:
<pre><code>self._escDetector = re.compile(b'(\033|~{)')</code></pre>
<h3 id="cantconvertbytesobject">Can't convert '<code>bytes</code>' object to <code>str</code> implicitly</h3>
<h3 id=cantconvertbytesobject>Can't convert '<code>bytes</code>' object to <code>str</code> implicitly</h3>
<p>Curiouser and curiouser...
<p class="skip"><a href="#skipcantconvertbytesobject">skip over this</a>
<pre class="screen"><samp class="prompt">C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<p class=skip><a href=#skipcantconvertbytesobject>skip over this</a>
<pre class=screen><samp class=prompt>C:\home\chardet> </samp><kbd>python test.py tests\*\*</kbd>
<samp>tests\ascii\howto.diveintomark.org.xml</samp>
<samp class="traceback">Traceback (most recent call last):
<samp class=traceback>Traceback (most recent call last):
File "test.py", line 10, in &lt;module>
u.feed(line)
File "C:\home\chardet\chardet\universaldetector.py", line 100, in feed
elif (self._mInputState == ePureAscii) and self._escDetector.search(self._mLastChar + aBuf):
TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
<p id="skipcantconvertbytesobject">...
<p class="c">&copy; 2001-4, 2009 <span>&#x2133;</span>ark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a>
<script type="text/javascript" src="http://www.google.com/jsapi"></script>
<script type="text/javascript" src="dip3.js"></script>
</body>
</html>
<p id=skipcantconvertbytesobject>...
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim, <a href=http://creativecommons.org/licenses/by-sa/3.0/ rel=license>CC-BY-SA-3.0</a>
<script type=text/javascript src=jquery.js></script>
<script type=text/javascript src=dip3.js></script>
+1580 -2057
View File
File diff suppressed because it is too large Load Diff
+6 -7
View File
@@ -1,11 +1,10 @@
/*
var LANGS = {'python2': 'Python 2', 'java': 'Java', 'perl5': 'Perl 5', 'clang': 'C'};
*/
var HIDESHOW = {'visible': 'hide', 'hidden': 'show'};
google.load("jquery", "1.3");
google.setOnLoadCallback(function() {
//google.load("jquery", "1.3");
//google.setOnLoadCallback(function() {
$(document).ready(function() {
var HS = {'visible': 'hide', 'hidden': 'show'};
/*
// toggle-able language comparisons
for (var lang in LANGS) {
@@ -26,7 +25,7 @@ $(document).ready(function() {
$("pre.code, pre.screen").each(function(i) {
this.id = "autopre" + i;
$(this).wrapInner('<div class="block"></div>');
$(this).prepend('<div class="widgets">[<a class="toggle" href="javascript:toggleCodeBlock(\'' + this.id + '\')">' + HIDESHOW['visible'] + '</a>] [<a href="javascript:plainTextOnClick(\'' + this.id + '\')">open in new window</a>]</div>');
$(this).prepend('<div class="widgets">[<a class="toggle" href="javascript:toggleCodeBlock(\'' + this.id + '\')">' + HS['visible'] + '</a>] [<a href="javascript:plainTextOnClick(\'' + this.id + '\')">open in new window</a>]</div>');
$(this).prev("p.download").each(function(i) {
$(this).next("pre").find("div.widgets").append(" " + $(this).html());
@@ -57,7 +56,7 @@ $(document).ready(function() {
});
}); /* document.ready */
}); /* google.setOnLoadCallback */
//}); /* google.setOnLoadCallback */
/*
function toggleComparisonNotes(lang) {
@@ -70,7 +69,7 @@ function toggleComparisonNotes(lang) {
function toggleCodeBlock(id) {
$("#" + id).find("div.block").toggle();
var a = $("#" + id).find("a.toggle");
a.text(a.text() == HIDESHOW['visible'] ? HIDESHOW['hidden'] : HIDESHOW['visible']);
a.text(a.text() == HS['visible'] ? HS['hidden'] : HS['visible']);
}
function plainTextOnClick(id) {
+20 -23
View File
@@ -1,38 +1,35 @@
<!DOCTYPE html>
<html lang="en">
<html lang=en>
<head>
<meta charset="utf-8">
<meta charset=utf-8>
<title>Dive Into Python 3</title>
<link rel="stylesheet" type="text/css" href="dip3.css">
<link rel="shortcut icon" href="data:image/ico,">
<link rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">
<style type="text/css">
<link rel=stylesheet type=text/css href=dip3.css>
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<style type=text/css>
p.first{clear:both;margin-top:0;padding-top:1.75em}
ul{list-style:none}
</style>
</head>
<body>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8"><input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div></form>
<p class="first"><cite>Dive Into Python 3</cite> will cover Python 3 and its differences from Python 2. Compared to the original <cite><a href="http://diveintopython.org/">Dive Into Python</a></cite>, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final version will be published on paper by Apress. The book will remain online under the <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a> license.
<form action=http://www.google.com/cse id=search><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p class=first><cite>Dive Into Python 3</cite> will cover Python 3 and its differences from Python 2. Compared to the original <cite><a href=http://diveintopython.org/>Dive Into Python</a></cite>, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final version will be published on paper by Apress. The book will remain online under the <a rel=license href=http://creativecommons.org/licenses/by-sa/3.0/>CC-BY-SA-3.0</a> license.
<p>Here&#8217;s what I&#8217;ve written so far:</p>
<ul>
<li><a href="table-of-contents.html">Table of contents</a> (<strong>not finalized</strong>)
<li><a href="your-first-python-program.html">Chapter 1. Your first Python program</a>
<li><a href="native-datatypes.html">Chapter 2. Native datatypes</a>
<li><a href="case-study-porting-chardet-to-python-3.html">Chapter 20. Case study: porting <code>chardet</code> to Python 3</a>
<li><a href="porting-code-to-python-3-with-2to3.html">Appendix A. Porting code to Python 3 with <code>2to3</code></a>
<li><a href=table-of-contents.html>Table of contents</a> (<strong>not finalized</strong>)
<li><a href=your-first-python-program.html>Chapter 1. Your first Python program</a>
<li><a href=native-datatypes.html>Chapter 2. Native datatypes</a>
<li><a href=case-study-porting-chardet-to-python-3.html>Chapter 20. Case study: porting <code>chardet</code> to Python 3</a>
<li><a href=porting-code-to-python-3-with-2to3.html>Appendix A. Porting code to Python 3 with <code>2to3</code></a>
</ul>
<p>There is a <a href="http://hg.diveintopython3.org/">changelog</a>, a <a rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">feed</a>, and <a href="http://www.reddit.com/search?q=%22Dive+Into+Python+3%22&amp;sort=new">discussion on Reddit</a>. During development, you can download the book by cloning the Mercurial repository:
<pre><samp class="prompt">you@localhost:~$ </samp><kbd>hg clone http://hg.diveintopython3.org/ diveintopython3</kbd></pre>
<p>There is a <a href=http://hg.diveintopython3.org/>changelog</a>, a <a rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>feed</a>, and <a href=http://www.reddit.com/search?q=%22Dive+Into+Python+3%22&amp;sort=new>discussion on Reddit</a>. During development, you can download the book by cloning the Mercurial repository:
<pre><samp class=prompt>you@localhost:~$ </samp><kbd>hg clone http://hg.diveintopython3.org/ diveintopython3</kbd></pre>
<p>The final version will be downloadable as <abbr>HTML</abbr> and <abbr>PDF</abbr>.
<p class="c">This site is optimized for Lynx just because fuck you.<br>I&#8217;m told it also looks good in graphical browsers.
<p class="c">&copy; 2001-4, 2009 <span>&#x2133;</span>ark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a>
<p class=c>This site is optimized for Lynx just because fuck you.<br>I&#8217;m told it also looks good in graphical browsers.
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim, <a href=http://creativecommons.org/licenses/by-sa/3.0/ rel=license>CC-BY-SA-3.0</a>
<!--
As I write this, the year is 2009, and the internet is STILL a battleground of so-called "intellectual property" disputes. Some people would have you believe that without proper financial incentives, music, literature, and software would disappear. After all, who would make music if they can't make money on it? Who would write? Who would program?
As I write this, the year is 2009, and the internet is STILL a battleground of so-called intellectual property disputes. Some people would have you believe that without proper financial incentives, music, literature, and software would disappear. After all, who would make music if they can't make money on it? Who would write? Who would program?
I know the answer. The answer is that musicians will make music, not because they can make money, but because musicians are the people who can't not make music. Writers will write because they can't not write. Most of the people you think of as artists are really just showmen. They collect a paycheck and go home at 5 o'clock. That's not art, that's commerce.
I know the answer. The answer is that musicians will make music, not because they can make money, but because musicians are the people who can't not make music. Writers will write because they can't not write. Most of the people you think of as artists are really just showmen. They collect a paycheck and go home at 5 o'clock. That's not art, that's commerce.
I've been programming since 1983 and releasing my code under Free Software licenses since 1993. I've been writing and publishing under Free Content licenses since 2000. I can't imagine not doing this. If you can imagine yourself not doing what you're doing, do something else. Do whatever it is you can't not do.
I've been programming since 1983 and releasing my code under Free Software licenses since 1993. I've been writing and publishing under Free Content licenses since 2000. I can't imagine not doing this. If you can imagine yourself not doing what you're doing, do something else. Do whatever it is you can't not do.
-->
</body>
</html>
Vendored
+4241
View File
File diff suppressed because it is too large Load Diff
+525 -132
View File
@@ -1,38 +1,41 @@
<!DOCTYPE html>
<html lang="en">
<html lang=en>
<head>
<meta charset="utf-8">
<meta charset=utf-8>
<title>Native datatypes - Dive into Python 3</title>
<link rel="stylesheet" type="text/css" href="dip3.css">
<link rel="shortcut icon" href="data:image/ico,">
<link rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">
<style type="text/css">
<link rel=stylesheet type=text/css href=dip3.css>
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<style type=text/css>
body{counter-reset:h1 2}
</style>
</head>
<body>
<p class="skip"><a href="#divingin">skip to main content</a>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="root" value="Search"></div></form>
<p class="nav">You are here: <a href="/">Home</a> <span>&#8227;</span> <a href="table-of-contents.html">Dive Into Python 3</a> <span>&#8227;</span>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse id=search><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=root value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Native datatypes</h1>
<blockquote class="q">
<blockquote class=q>
<p><span>&#x275D;</span> Wonder is the foundation of all philosophy, research its progress, ignorance its end. <span>&#x275E;</span><br>&mdash; <cite>Michel de Montaigne</cite>
</blockquote>
<ol>
<li><a href="#divingin">Diving in</a>
<li><a href="#booleans">Booleans</a>
<li><a href="#numbers">Numbers</a>
<li><a href=#divingin>Diving in</a>
<li><a href=#booleans>Booleans</a>
<li><a href=#numbers>Numbers</a>
<!--
<ol>
<li><a href="#integers">Integers</a>
<li><a href="#floats">Floating point numbers</a>
<li><a href="#fractions">Fractions</a>
<li><a href="#complexnumbers">Complex numbers</a>
<li><a href="#numberoperations">Common operations on numbers</a>
<li><a href="#math">The <code>math</code> module</a>
<li><a href=#integers>Integers</a>
<li><a href=#floats>Floating point numbers</a>
<li><a href=#fractions>Fractions</a>
<li><a href=#complexnumbers>Complex numbers</a>
<li><a href=#numberoperations>Common operations on numbers</a>
<li><a href=#math>The <code>math</code> module</a>
</ol>
-->
<li><a href="#lists">Lists</a>
<li><a href=#lists>Lists</a>
<ol>
<li><a href=#creatinglists>Creating lists</a>
<li><a href=#slicinglists>Slicing lists</a>
</ol>
<!--
<ol>
<li>Creating new a list
@@ -42,7 +45,7 @@ body{counter-reset:h1 2}
<li>Common operations on lists
</ol>
-->
<li><a href="#sets">Sets</a>
<li><a href=#sets>Sets</a>
<!--
<ol>
<li>Creating a new set
@@ -52,13 +55,13 @@ body{counter-reset:h1 2}
<li>Frozen sets
</ol>
-->
<li><a href="#dictionaries">Dictionaries</a>
<li><a href="#none"><code>None</code></a>
<li><a href="#furtherreading">Further reading</a>
<li><a href=#dictionaries>Dictionaries</a>
<li><a href=#none><code>None</code></a>
<li><a href=#furtherreading>Further reading</a>
</ol>
<h2 id="divingin">Diving in</h2>
<p class="fancy">A short digression is in order. Put aside <a href="your-first-python-program.html">your first Python program</a> for just a minute, and let's talk about datatypes. <a href="your-first-python-program.html#datatypes">Every variable has a datatype</a>, even though you don't declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally.
<p>Python has many native datatypes. Here are the important ones:
<h2 id=divingin>Diving in</h2>
<p class=fancy>A short digression is in order. Put aside <a href=your-first-python-program.html>your first Python program</a> for just a minute, and let's talk about datatypes. <a href=your-first-python-program.html#datatypes>Every variable has a datatype</a>, even though you don't declare it explicitly. Based on each variable's original assignment, Python figures out what type it is and keeps tracks of that internally.
<p>Python has many native datatypes. Here are the important ones:
<ol>
<li><b>Booleans</b> are either <code>True</code> or <code>False</code>.
<li><b>Numbers</b> can be integers (<code>1</code> and <code>2</code>), floats (<code>1.1</code> and <code>1.2</code>), fractions (<code>1/2</code> and <code>2/3</code>), or even complex numbers (<code><var>i</var></code>, the square root of <code>-1</code>).
@@ -68,187 +71,579 @@ body{counter-reset:h1 2}
<li><b>Sets</b> are unordered bags of values.
<li><b>Dictionaries</b> are unordered bags of key-value pairs.
</ol>
<p>Of course, there are a lot more types than these seven. <a href="your-first-python-program.html#everythingisanobject">Everything is an object</a> in Python, so there are types like <i>module</i>, <i>function</i>, <i>class</i>, <i>method</i>, <i>file</i>, and even <i>compiled code</i>. You've already seen some of these: <a href="your-first-python-program.html#runningscripts">modules have names</a>, <a href="your-first-python-program.html#docstrings">functions have <code>docstrings</code></a>, <i class="baa">&amp;</i>c. You'll learn about classes in [FIXME xref] and files in [FIXME xref].
<p>Strings and bytes are important enough &mdash; and complicated enough &mdash; that they get their own chapter. Let's look at the others first.
<h2 id="booleans">Booleans</h2>
<p>Booleans are either true or false. Python has two constants, <code>True</code> and <code>False</code>, which can be used to assign boolean values directly. Expressions can also evaluate to a boolean value. In certain places (like <code>if</code> statements), Python expects an expression to evaluate to a boolean value. These places are called <i>boolean contexts</i>. You can use virtually any expression in a boolean context, and Python will try to determine its truth value. Different datatypes have different rules about which values are true or false in a boolean context. (This will make more sense once you see some concrete examples later in this chapter.)
<p>For example, take this snippet from <a href="your-first-python-program.html#divingin"><code>humansize.py</code></a>:
<p>Of course, there are a lot more types than these seven. <a href=your-first-python-program.html#everythingisanobject>Everything is an object</a> in Python, so there are types like <i>module</i>, <i>function</i>, <i>class</i>, <i>method</i>, <i>file</i>, and even <i>compiled code</i>. You've already seen some of these: <a href=your-first-python-program.html#runningscripts>modules have names</a>, <a href=your-first-python-program.html#docstrings>functions have <code>docstrings</code></a>, <i class=baa>&amp;</i>c. You'll learn about classes in [FIXME xref] and files in [FIXME xref].
<p>Strings and bytes are important enough &mdash; and complicated enough &mdash; that they get their own chapter. Let's look at the others first.
<h2 id=booleans>Booleans</h2>
<p>Booleans are either true or false. Python has two constants, <code>True</code> and <code>False</code>, which can be used to assign boolean values directly. Expressions can also evaluate to a boolean value. In certain places (like <code>if</code> statements), Python expects an expression to evaluate to a boolean value. These places are called <i>boolean contexts</i>. You can use virtually any expression in a boolean context, and Python will try to determine its truth value. Different datatypes have different rules about which values are true or false in a boolean context. (This will make more sense once you see some concrete examples later in this chapter.)
<p>For example, take this snippet from <a href=your-first-python-program.html#divingin><code>humansize.py</code></a>:
<pre><code>if size &lt; 0:
raise ValueError('number must be non-negative')</code></pre>
<p><var>size</var> is an integer, <code>0</code> is an integer, and <code>&lt;</code> is a numerical operator. The result of the expression <code>size &lt; 0</code> is always a boolean. You can test this yourself in the Python interactive shell:
<pre class="screen">
<samp class="prompt">>>> </samp><kbd>size = 1</kbd>
<samp class="prompt">>>> </samp><kbd>size &lt; 0</kbd>
<p><var>size</var> is an integer, <code>0</code> is an integer, and <code>&lt;</code> is a numerical operator. The result of the expression <code>size &lt; 0</code> is always a boolean. You can test this yourself in the Python interactive shell:
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>size = 1</kbd>
<samp class=prompt>>>> </samp><kbd>size &lt; 0</kbd>
<samp>False</samp>
<samp class="prompt">>>> </samp><kbd>size = 0</kbd>
<samp class="prompt">>>> </samp><kbd>size &lt; 0</kbd>
<samp class=prompt>>>> </samp><kbd>size = 0</kbd>
<samp class=prompt>>>> </samp><kbd>size &lt; 0</kbd>
<samp>False</samp>
<samp class="prompt">>>> </samp><kbd>size = -1</kbd>
<samp class="prompt">>>> </samp><kbd>size &lt; 0</kbd>
<samp class=prompt>>>> </samp><kbd>size = -1</kbd>
<samp class=prompt>>>> </samp><kbd>size &lt; 0</kbd>
<samp>True</samp></pre>
<h2 id="numbers">Numbers</h2>
<p>Numbers are awesome. There are so many to choose from. Python supports both integers and floating point numbers. There's no type declaration to distinguish them; Python tells them apart by the presence or absence of a decimal point.
<pre class="screen">
<a><samp class="prompt">>>> </samp><kbd>type(1)</kbd> <span>&#x2460;</span></a>
<h2 id=numbers>Numbers</h2>
<p>Numbers are awesome. There are so many to choose from. Python supports both integers and floating point numbers. There's no type declaration to distinguish them; Python tells them apart by the presence or absence of a decimal point.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>type(1)</kbd> <span>&#x2460;</span></a>
<samp>&lt;class 'int'></samp>
<a><samp class="prompt">>>> </samp><kbd>1 + 1</kbd> <span>&#x2461;</span></a>
<a><samp class=prompt>>>> </samp><kbd>1 + 1</kbd> <span>&#x2461;</span></a>
<samp>2</samp>
<a><samp class="prompt">>>> </samp><kbd>1 + 1.0</kbd> <span>&#x2462;</span></a>
<a><samp class=prompt>>>> </samp><kbd>1 + 1.0</kbd> <span>&#x2462;</span></a>
<samp>2.0</samp>
<samp class="prompt">>>> </samp><kbd>type(2.0)</kbd>
<samp class=prompt>>>> </samp><kbd>type(2.0)</kbd>
<samp>&lt;class 'float'></samp></pre>
<ol>
<li>You can use the <code>type()</code> function to check the type of any value or variable. As you might expect, <code>1</code> is an <code>int</code>.
<li>You can use the <code>type()</code> function to check the type of any value or variable. As you might expect, <code>1</code> is an <code>int</code>.
<li>Adding an <code>int</code> to an <code>int</code> yields an <code>int</code>.
<li>Adding an <code>int</code> to a <code>float</code> yields a <code>float</code>. Python coerces the <code>int</code> into a <code>float</code> to perform the addition, then returns a <code>float</code> as the result.
<li>Adding an <code>int</code> to a <code>float</code> yields a <code>float</code>. Python coerces the <code>int</code> into a <code>float</code> to perform the addition, then returns a <code>float</code> as the result.
</ol>
<p>As you just saw, some operators (like addition) will coerce integers to floating point numbers as needed. You can also coerce them by yourself.
<pre class="screen">
<a><samp class="prompt">>>> </samp><kbd>float(2)</kbd> <span>&#x2460;</span></a>
<p>As you just saw, some operators (like addition) will coerce integers to floating point numbers as needed. You can also coerce them by yourself.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>float(2)</kbd> <span>&#x2460;</span></a>
<samp>2.0</samp>
<a><samp class="prompt">>>> </samp><kbd>int(2.0)</kbd> <span>&#x2461;</span></a>
<a><samp class=prompt>>>> </samp><kbd>int(2.0)</kbd> <span>&#x2461;</span></a>
<samp>2</samp>
<a><samp class="prompt">>>> </samp><kbd>int(2.5)</kbd> <span>&#x2462;</span></a>
<a><samp class=prompt>>>> </samp><kbd>int(2.5)</kbd> <span>&#x2462;</span></a>
<samp>2</samp>
<a><samp class="prompt">>>> </samp><kbd>int(-2.5)</kbd> <span>&#x2463;</span></a>
<a><samp class=prompt>>>> </samp><kbd>int(-2.5)</kbd> <span>&#x2463;</span></a>
<samp>-2</samp>
<a><samp class="prompt">>>> </samp><kbd>1.12345678901234567890</kbd> <span>&#x2464;</span></a>
<a><samp class=prompt>>>> </samp><kbd>1.12345678901234567890</kbd> <span>&#x2464;</span></a>
<samp>1.1234567890123457</samp>
<a><samp class="prompt">>>> </samp><kbd>type(1000000000000000)</kbd> <span>&#x2465;</span></a>
<a><samp class=prompt>>>> </samp><kbd>type(1000000000000000)</kbd> <span>&#x2465;</span></a>
<samp>&lt;class 'int'></samp></pre>
<ol>
<li>You can explicitly coerce an <code>int</code> to a <code>float</code> by calling the <code>float()</code> function.
<li>Unsurprisingly, you can also coerce a <code>float</code> to an <code>int</code> by calling <code>int()</code>.
<li>The <code>int()</code> function will truncate, not round.
<li>The <code>int()</code> function truncates negative numbers towards <code>0</code>. It's a true truncate function, not a a floor function.
<li>The <code>int()</code> function truncates negative numbers towards <code>0</code>. It's a true truncate function, not a a floor function.
<li>Floating point numbers are accurate to 15 decimal places.
<li>Integers can be arbitrarily large.
</ol>
<blockquote class="note compare python2">
<p><span>&#x261E;</span>Python 2 had separate types for <code>int</code> and <code>long</code>. The <code>int</code> datatype was limited by <code>sys.maxint</code>, which varied by platform but was usually <code>2<sup>32</sup>-1</code>. Python 3 has just one integer type, which behaves mostly like the old <code>long</code> type from Python 2. See <a href="http://www.python.org/dev/peps/pep-0237">PEP 237</a> for details.
<p><span>&#x261E;</span>Python 2 had separate types for <code>int</code> and <code>long</code>. The <code>int</code> datatype was limited by <code>sys.maxint</code>, which varied by platform but was usually <code>2<sup>32</sup>-1</code>. Python 3 has just one integer type, which behaves mostly like the old <code>long</code> type from Python 2. See <a href=http://www.python.org/dev/peps/pep-0237>PEP 237</a> for details.
</blockquote>
<p>You can do all kinds of things with numbers.
<pre class="screen">
<a><samp class="prompt">>>> </samp><kbd>11 / 2</kbd> <span>&#x2460;</span></a>
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>11 / 2</kbd> <span>&#x2460;</span></a>
<samp>5.5</samp>
<a><samp class="prompt">>>> </samp><kbd>11 // 2</kbd> <span>&#x2461;</span></a>
<a><samp class=prompt>>>> </samp><kbd>11 // 2</kbd> <span>&#x2461;</span></a>
<samp>5</samp>
<a><samp class="prompt">>>> </samp><kbd>&minus;11 // 2</kbd> <span>&#x2462;</span></a>
<a><samp class=prompt>>>> </samp><kbd>&minus;11 // 2</kbd> <span>&#x2462;</span></a>
<samp>&minus;6</samp>
<a><samp class="prompt">>>> </samp><kbd>11.0 // 2</kbd> <span>&#x2463;</span></a>
<a><samp class=prompt>>>> </samp><kbd>11.0 // 2</kbd> <span>&#x2463;</span></a>
<samp>5.0</samp>
<a><samp class="prompt">>>> </samp><kbd>11 ** 2</kbd> <span>&#x2464;</span></a>
<a><samp class=prompt>>>> </samp><kbd>11 ** 2</kbd> <span>&#x2464;</span></a>
<samp>121</samp>
<a><samp class="prompt">>>> </samp><kbd>11 % 2</kbd> <span>&#x2465;</span></a>
<a><samp class=prompt>>>> </samp><kbd>11 % 2</kbd> <span>&#x2465;</span></a>
<samp>1</samp>
</pre>
<ol>
<li>The <code>/</code> operator performs floating point division. It returns a <code>float</code> even if both the numerator and denominator are <code>int</code>s.
<li>The <code>//</code> operator performs a quirky kind of integer division. When the result is positive, you can think of it as truncating (not rounding) to <code>0</code> decimal places, but be careful with that.
<li>When integer-dividing negative numbers, the <code>//</code> operator rounds &#8220;up&#8221; to the nearest integer. Mathematically speaking, it's rounding &#8220;down&#8221; since <code>&minus;6</code> is less than <code>&minus;5</code>, but it could trip you up if you expecting it to truncate to <code>&minus;5</code>.
<li>The <code>//</code> operator doesn't always return an integer. If either the numerator or denominator is a <code>float</code>, it will still round to the nearest integer, but the actual return value will be a <code>float</code>.
<li>The <code>/</code> operator performs floating point division. It returns a <code>float</code> even if both the numerator and denominator are <code>int</code>s.
<li>The <code>//</code> operator performs a quirky kind of integer division. When the result is positive, you can think of it as truncating (not rounding) to <code>0</code> decimal places, but be careful with that.
<li>When integer-dividing negative numbers, the <code>//</code> operator rounds &#8220;up&#8221; to the nearest integer. Mathematically speaking, it's rounding &#8220;down&#8221; since <code>&minus;6</code> is less than <code>&minus;5</code>, but it could trip you up if you expecting it to truncate to <code>&minus;5</code>.
<li>The <code>//</code> operator doesn't always return an integer. If either the numerator or denominator is a <code>float</code>, it will still round to the nearest integer, but the actual return value will be a <code>float</code>.
<li>The <code>**</code> operator means &#8220;raised to the power of.&#8221; <code>11<sup>2</sup></code> is <code>121</code>.
<li>The <code>%</code> operator gives the remainder after performing integer division. <code>11</code> divided by <code>2</code> is <code>5</code> with a remainder of <code>1</code>, so the result here is <code>1</code>.
<li>The <code>%</code> operator gives the remainder after performing integer division. <code>11</code> divided by <code>2</code> is <code>5</code> with a remainder of <code>1</code>, so the result here is <code>1</code>.
</ol>
<blockquote class="note compare python2">
<p><span>&#x261E;</span>In Python 2, the <code>/</code> operator usually meant integer division, but you could make it behave like floating point division by including a special directive in your code. In Python 3, the <code>/</code> operator always means floating point division. See <a href="http://www.python.org/dev/peps/pep-0238/">PEP 238</a> for details.
<p><span>&#x261E;</span>In Python 2, the <code>/</code> operator usually meant integer division, but you could make it behave like floating point division by including a special directive in your code. In Python 3, the <code>/</code> operator always means floating point division. See <a href=http://www.python.org/dev/peps/pep-0238/>PEP 238</a> for details.
</blockquote>
<p>FIXME fractions, math module, numbers in a boolean context
<h2 id="lists">Lists</h2>
<h2 id=lists>Lists</h2>
<p>Lists are Python's workhorse datatype. When I say "list," you might be thinking "array whose size I have to declare in advance, that can only contain items of the same type, <i class=baa>&amp;</i>c." Don't think that. Lists are much cooler than that.
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>A list in Python is like an array in Perl 5. In Perl 5, variables that store arrays always start with the <code>@</code> character; in Python, variables can be named anything, and Python keeps track of the datatype internally.
</blockquote>
<blockquote class="note compare java">
<p><span>&#x261E;</span>A list in Python is much more than an array in Java (although it can be used as one if that's really all you want out of life). A better analogy would be to the <code>ArrayList</code> class, which can hold arbitrary objects and can expand dynamically as new items are added.
</blockquote>
<h3 id=creatinglists>Creating lists</h3>
<p>Creating a list is easy: use square brackets to wrap a comma-separated list of values.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>a_list = ["a", "b", "mpilgrim", "z", "example"]</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
['a', 'b', 'mpilgrim', 'z', 'example']
<a><samp class=prompt>>>> </samp><kbd>a_list[0]</kbd> <span>&#x2461;</span></a>
<samp>'a'</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[4]</kbd> <span>&#x2462;</span></a>
<samp>'example'</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[-1]</kbd> <span>&#x2463;</span></a>
<samp>'example'</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[-3]</kbd> <span>&#x2464;</span></a>
<samp>'mpilgrim'</samp></pre>
<ol>
<li>First, you define a list of five items. Note that they retain their original order. This is not an accident. A list is an ordered set of items.
<li>A list can be used like a zero-based array. The first element of any non-empty list is always <code>a_list[0]</code>.
<li>The last element of this five-element list is <code>a_list[4]</code>, because lists are always zero-based.
<li>A negative index accesses elements from the end of the list counting backwards. The last element of any non-empty list is always <code>a_list[-1]</code>.
<li>If the negative index is confusing to you, think of it this way: <code>a_list[-<var>n</var>] == a_list[len(a_list) - <var>n</var>]</code>. So in this list, <code>a_list[-3] == a_list[5 - 3] == a_list[2]</code>.
</ol>
<h3 id=slicinglists>Slicing lists</h3>
<p>Once you've defined a list, you can get any part of it as a new list. This is called <i>slicing</i> the list.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>a_list</kbd>
<samp>['a', 'b', 'mpilgrim', 'z', 'example']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[1:3]</kbd> <span>&#x2460;</span></a>
<samp>['b', 'mpilgrim']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[1:-1]</kbd> <span>&#x2461;</span></a>
<samp>['b', 'mpilgrim', 'z']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[0:3]</kbd> <span>&#x2462;</span></a>
<samp>['a', 'b', 'mpilgrim']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[:3]</kbd> <span>&#x2463;</span></a>
<samp>['a', 'b', 'mpilgrim']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[3:]</kbd> <span>&#x2464;</span></a>
<samp>['z', 'example']</samp>
<a><samp class=prompt>>>> </samp><kbd>a_list[:]</kbd> <span>&#x2465;</span></a>
['a', 'b', 'mpilgrim', 'z', 'example']</pre>
<ol>
<li>You can get a part of a list, called a &#8220;slice&#8221;, by specifying two indices. The return value is a new list containing all the elements of the list, in order, starting with the first slice index (in this case <code>a_list[1]</code>), up to but not including the second slice index (in this case <code>a_list[3]</code>).
<li>Slicing works if one or both of the slice indices is negative. If it helps, you can think of it this way: reading the list from left to right, the first slice index specifies the first element you want, and the second slice index specifies the first element you don't want. The return value is everything in between.
<li>Lists are zero-based, so <code>a_list[0:3]</code> returns the first three elements of the list, starting at <code>a_list[0]</code>, up to but not including <code>a_list[3]</code>.
<li>If the left slice index is <code>0</code>, you can leave it out, and <code>0</code> is implied. So <code>a_list[:3]</code> is the same as <code>a_list[0:3]</code>, because the starting <code>0</code> is implied.
<li>Similarly, if the right slice index is the length of the list, you can leave it out. So <code>a_list[3:]</code> is the same as <code>a_list[3:5]</code>, because this list has five elements. There is a pleasing symmetry here. In this five-element list, <code>a_list[:3]</code> returns the first 3 elements, and <code>a_list[3:]</code> returns the last two elements. In fact, <code>a_list[:<var>n</var>]</code> will always return the first <var>n</var> elements, and <code>a_list[<var>n</var>:]</code> will return the rest, regardless of the length of the list.
<li>If both slice indices are left out, all elements of the list are included. But this is not the same as the original <var>a_list</var> variable. It is a new list that happens to have all the same elements. <code>a_list[:]</code> is shorthand for making a complete copy of a list.
</ol>
<!--
<h3>3.2.2. Adding Elements to Lists</h3>
<div class="example"><h3>Example 3.10. Adding Elements to a List</h3><pre class=screen><samp class=prompt>>>> </samp>li
['a', 'b', 'mpilgrim', 'z', 'example']
<samp class=prompt>>>> </samp>li.append("new") <img id=odbchelper.list.5.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'mpilgrim', 'z', 'example', 'new']
<samp class=prompt>>>> </samp>li.insert(2, "new") <img id=odbchelper.list.5.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new']
<samp class=prompt>>>> </samp>li.extend(["two", "elements"]) <img id=odbchelper.list.5.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.5.1><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>append</code> adds a single element to the end of the list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.5.2><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>insert</code> inserts a single element into a list. The numeric argument is the index of the first element that gets bumped out of position.
Note that list elements do not need to be unique; there are now two separate elements with the value <code>'new'</code>, <code>a_list[2]</code> and <code>a_list[6]</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.5.3><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>extend</code> concatenates lists. Note that you do not call <code>extend</code> with multiple arguments; you call it with one argument, a list. In this case, that list has two elements.
</td>
</tr>
</table>
<div class="example"><h3 id=odbchelper.list.append.vs.extend>Example 3.11. The Difference between <code>extend</code> and <code>append</code></h3><pre class=screen>
<samp class=prompt>>>> </samp>a_list = ['a', 'b', 'c']
<samp class=prompt>>>> </samp>li.extend(['d', 'e', 'f']) <img id=odbchelper.list.5.4" src="images/callouts/1.png" alt="1" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'c', 'd', 'e', 'f']
<samp class=prompt>>>> </samp>len(li) <img id=odbchelper.list.5.5" src="images/callouts/2.png" alt="2" border="0" width="12" height="12>
6
<samp class=prompt>>>> </samp>a_list[-1]
'f'
<samp class=prompt>>>> </samp>a_list = ['a', 'b', 'c']
<samp class=prompt>>>> </samp>li.append(['d', 'e', 'f']) <img id=odbchelper.list.5.6" src="images/callouts/3.png" alt="3" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'c', ['d', 'e', 'f']]
<samp class=prompt>>>> </samp>len(li) <img id=odbchelper.list.5.7" src="images/callouts/4.png" alt="4" border="0" width="12" height="12>
4
<samp class=prompt>>>> </samp>a_list[-1]
['d', 'e', 'f']
</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.5.4><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Lists have two methods, <code>extend</code> and <code>append</code>, that look like they do the same thing, but are in fact completely different. <code>extend</code> takes a single argument, which is always a list, and adds each of the elements of that list to the original list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.5.5><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Here you started with a list of three elements (<code>'a'</code>, <code>'b'</code>, and <code>'c'</code>), and you extended the list with a list of another three elements (<code>'d'</code>, <code>'e'</code>, and <code>'f'</code>), so you now have a list of six elements.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.5.6><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">On the other hand, <code>append</code> takes one argument, which can be any data type, and simply adds it to the end of the list. Here, you're calling the <code>append</code> method with a single argument, which is a list of three elements.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.5.7><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Now the original list, which started as a list of three elements, contains four elements. Why four? Because the last element
that you just appended <em>is itself a list</em>. Lists can contain any type of data, including other lists. That may be what you want, or maybe not. Don't use <code>append</code> if you mean <code>extend</code>.
</td>
</tr>
</table>
<h3>3.2.3. Searching Lists</h3>
<div class="example"><h3 id=odbchelper.list.search>Example 3.12. Searching a List</h3><pre class=screen><samp class=prompt>>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']
<samp class=prompt>>>> </samp>li.index("example") <img id=odbchelper.list.6.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12>
5
<samp class=prompt>>>> </samp>li.index("new") <img id=odbchelper.list.6.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12>
2
<samp class=prompt>>>> </samp>li.index("c") <img id=odbchelper.list.6.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12>
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
ValueError: list.index(x): x not in list</samp>
<samp class=prompt>>>> </samp>"c" in a_list <img id=odbchelper.list.6.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12>
False</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.6.1><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>index</code> finds the first occurrence of a value in the list and returns the index.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.6.2><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>index</code> finds the <em>first</em> occurrence of a value in the list. In this case, <code>'new'</code> occurs twice in the list, in <code>a_list[2]</code> and <code>a_list[6]</code>, but <code>index</code> will return only the first index, <code>2</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.6.3><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the value is not found in the list, Python raises an exception. This is notably different from most languages, which will return some invalid index. While this may
seem annoying, it is a good thing, because it means your program will crash at the source of the problem, rather than later
on when you try to use the invalid index.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.6.4><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">To test whether a value is in the list, use <code>in</code>, which returns <code>True</code> if the value is found or <code>False</code> if it is not.
</td>
</tr>
</table>
</div><table id=tip.boolean" class="note" border="0" summary=">
<tr>
<td colspan="2" align="left" valign="top" width="99%">Before version 2.2.1, Python had no separate boolean datatype. To compensate for this, Python accepted almost anything in a boolean context (like an <code>if</code> statement), according to the following rules:
<div class="itemizedlist">
<ul>
<li><code>0</code> is false; all other numbers are true.
<li>An empty string (<code>""</code>) is false, all other strings are true.
<li>An empty list (<code>[]</code>) is false; all other lists are true.
<li>An empty tuple (<code>()</code>) is false; all other tuples are true.
<li>An empty dictionary (<code>{}</code>) is false; all other dictionaries are true.
</ul>
</div>These rules still apply in Python 2.2.1 and beyond, but now you can also use an actual boolean, which has a value of <code>True</code> or <code>False</code>. Note the capitalization; these values, like everything else in Python, are case-sensitive.
</td>
</tr>
</table>
<h3>3.2.4. Deleting List Elements</h3>
<div class="example"><h3 id=odbchelper.list.removingelements>Example 3.13. Removing Elements from a List</h3><pre class=screen><samp class=prompt>>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'z', 'example', 'new', 'two', 'elements']
<samp class=prompt>>>> </samp>li.remove("z") <img id=odbchelper.list.7.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'new', 'mpilgrim', 'example', 'new', 'two', 'elements']
<samp class=prompt>>>> </samp>li.remove("new") <img id=odbchelper.list.7.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'mpilgrim', 'example', 'new', 'two', 'elements']
<samp class=prompt>>>> </samp>li.remove("c") <img id=odbchelper.list.7.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12>
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
ValueError: list.remove(x): x not in list</samp>
<samp class=prompt>>>> </samp>li.pop() <img id=odbchelper.list.7.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12>
'elements'
<samp class=prompt>>>> </samp>li
['a', 'b', 'mpilgrim', 'example', 'new', 'two']</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.7.1><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>remove</code> removes the first occurrence of a value from a list.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.7.2><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>remove</code> removes <em>only</em> the first occurrence of a value. In this case, <code>'new'</code> appeared twice in the list, but <code>li.remove("new")</code> removed only the first occurrence.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.7.3><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">If the value is not found in the list, Python raises an exception. This mirrors the behavior of the <code>index</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.7.4><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left"><code>pop</code> is an interesting beast. It does two things: it removes the last element of the list, and it returns the value that it removed.
Note that this is different from <code>a_list[-1]</code>, which returns a value but does not change the list, and different from <code>li.remove(<var>value</var>)</code>, which changes the list but does not return a value.
</td>
</tr>
</table>
<h3>3.2.5. Using List Operators</h3>
<div class="example"><h3 id=odbchelper.list.operators>Example 3.14. List Operators</h3><pre class=screen><samp class=prompt>>>> </samp>a_list = ['a', 'b', 'mpilgrim']
<samp class=prompt>>>> </samp>a_list = a_list + ['example', 'new'] <img id=odbchelper.list.8.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'mpilgrim', 'example', 'new']
<samp class=prompt>>>> </samp>a_list += ['two'] <img id=odbchelper.list.8.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
['a', 'b', 'mpilgrim', 'example', 'new', 'two']
<samp class=prompt>>>> </samp>a_list = [1, 2] * 3 <img id=odbchelper.list.8.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>li
[1, 2, 1, 2, 1, 2]</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.8.1><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Lists can also be concatenated with the <code>+</code> operator. <code><var>list</var> = <var>list</var> + <var>otherlist</var></code> has the same result as <code><var>list</var>.extend(<var>otherlist</var>)</code>. But the <code>+</code> operator returns a new (concatenated) list as a value, whereas <code>extend</code> only alters an existing list. This means that <code>extend</code> is faster, especially for large lists.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.8.2><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Python supports the <code>+=</code> operator. <code>a_list += ['two']</code> is equivalent to <code>li.extend(['two'])</code>. The <code>+=</code> operator works for lists, strings, and integers, and it can be overloaded to work for user-defined classes as well. (More
on classes in <a href=#fileinfo>Chapter 5</a>.)
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.list.8.3><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The <code>*</code> operator works on lists as a repeater. <code>a_list = [1, 2] * 3</code> is equivalent to <code>a_list = [1, 2] + [1, 2] + [1, 2]</code>, which concatenates the three lists into one.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Lists</h3>
<ul>
<li><a href=http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors><i class="citetitle">How to Think Like a Computer Scientist</i></a> teaches about lists and makes an important point about <a href=http://www.ibiblio.org/obp/thinkCSpy/chap08.htm>passing lists as function arguments</a>.
<li><a href=http://www.python.org/doc/current/tut/tut.html><i class="citetitle">Python Tutorial</i></a> shows how to <a href=http://www.python.org/doc/current/tut/node7.html#SECTION007110000000000000000>use lists as stacks and queues</a>.
<li><a href=http://www.faqts.com/knowledge-base/index.phtml/fid/199/>Python Knowledge Base</a> answers <a href=http://www.faqts.com/knowledge-base/index.phtml/fid/534>common questions about lists</a> and has a lot of <a href=http://www.faqts.com/knowledge-base/index.phtml/fid/540>example code using lists</a>.
<li><a href=http://www.python.org/doc/current/lib/><i class="citetitle">Python Library Reference</i></a> summarizes <a href=http://www.python.org/doc/current/lib/typesseq-mutable.html>all the list methods</a>.
</ul>
<h2 id=odbchelper.tuple>3.3. Introducing Tuples</h2>
<p>A tuple is an immutable list. A tuple can not be changed in any way once it is created.
<div class="example"><h3>Example 3.15. Defining a tuple</h3><pre class=screen><samp class=prompt>>>> </samp>t = ("a", "b", "mpilgrim", "z", "example") <img id=odbchelper.tuple.1.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12>
<samp class=prompt>>>> </samp>t
('a', 'b', 'mpilgrim', 'z', 'example')
<samp class=prompt>>>> </samp>t[0] <img id=odbchelper.tuple.1.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12>
'a'
<samp class=prompt>>>> </samp>t[-1] <img id=odbchelper.tuple.1.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12>
'example'
<samp class=prompt>>>> </samp>t[1:3] <img id=odbchelper.tuple.1.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12>
('b', 'mpilgrim')</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.tuple.1.1><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">A tuple is defined in the same way as a list, except that the whole set of elements is enclosed in parentheses instead of
square brackets.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.tuple.1.2><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">The elements of a tuple have a defined order, just like a list. Tuples indices are zero-based, just like a list, so the first
element of a non-empty tuple is always <code>t[0]</code>.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.tuple.1.3><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Negative indices count from the end of the tuple, just as with a list.</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.tuple.1.4><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">Slicing works too, just like a list. Note that when you slice a list, you get a new list; when you slice a tuple, you get
a new tuple.
</td>
</tr>
</table>
<div class="example"><h3 id=odbchelper.tuplemethods>Example 3.16. Tuples Have No Methods</h3><pre class=screen><samp class=prompt>>>> </samp>t
('a', 'b', 'mpilgrim', 'z', 'example')
<samp class=prompt>>>> </samp>t.append("new") <img id=odbchelper.tuple.2.1" src="images/callouts/1.png" alt="1" border="0" width="12" height="12>
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'append'</samp>
<samp class=prompt>>>> </samp>t.remove("z") <img id=odbchelper.tuple.2.2" src="images/callouts/2.png" alt="2" border="0" width="12" height="12>
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'remove'</samp>
<samp class=prompt>>>> </samp>t.index("example") <img id=odbchelper.tuple.2.3" src="images/callouts/3.png" alt="3" border="0" width="12" height="12>
<samp class="traceback">Traceback (innermost last):
File "&lt;interactive input>", line 1, in ?
AttributeError: 'tuple' object has no attribute 'index'</samp>
<samp class=prompt>>>> </samp>"z" in t <img id=odbchelper.tuple.2.4" src="images/callouts/4.png" alt="4" border="0" width="12" height="12>
True</pre><div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.tuple.2.1><img src="images/callouts/1.png" alt="1" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can't add elements to a tuple. Tuples have no <code>append</code> or <code>extend</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.tuple.2.2><img src="images/callouts/2.png" alt="2" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can't remove elements from a tuple. Tuples have no <code>remove</code> or <code>pop</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.tuple.2.3><img src="images/callouts/3.png" alt="3" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can't find elements in a tuple. Tuples have no <code>index</code> method.
</td>
</tr>
<tr>
<td width="12" valign="top" align="left"><a href=#odbchelper.tuple.2.4><img src="images/callouts/4.png" alt="4" border="0" width="12" height="12"></a>
</td>
<td valign="top" align="left">You can, however, use <code>in</code> to see if an element exists in the tuple.
</td>
</tr>
</table>
<p>So what are tuples good for?
<div class="itemizedlist">
<ul>
<li>Tuples are faster than lists. If you're defining a constant set of values and all you're ever going to do with it is iterate
through it, use a tuple instead of a list.
<li>It makes your code safer if you &#8220;write-protect&#8221; data that does not need to be changed. Using a tuple instead of a list is like having an implied <code>assert</code> statement that shows this data is constant, and that special thought (and a specific function) is required to override that.
<li>Remember that I said that <a href=#odbchelper.dictionarytypes" title="Example 3.4. Mixing Datatypes in a Dictionary>dictionary keys</a> can be integers, strings, and &#8220;a few other types&#8221;? Tuples are one of those types. Tuples can be used as keys in a dictionary, but lists can't be used this way.Actually, it's more complicated than that. Dictionary keys must be immutable. Tuples themselves are immutable, but if you
have a tuple of lists, that counts as mutable and isn't safe to use as a dictionary key. Only tuples of strings, numbers,
or other dictionary-safe tuples can be used as dictionary keys.
<li>Tuples are used in string formatting, as you'll see shortly.
</ul>
</div><table id=tip.tuple" class="note" border="0" summary=">
<tr>
<td colspan="2" align="left" valign="top" width="99%">Tuples can be converted into lists, and vice-versa. The built-in <code>tuple</code> function takes a list and returns a tuple with the same elements, and the <code>list</code> function takes a tuple and returns a list. In effect, <code>tuple</code> freezes a list, and <code>list</code> thaws a tuple.
</td>
</tr>
</table>
<div class="itemizedlist">
<h3>Further Reading on Tuples</h3>
<ul>
<li><a href=http://www.ibiblio.org/obp/thinkCSpy/" title="Python book for computer science majors><i class="citetitle">How to Think Like a Computer Scientist</i></a> teaches about tuples and shows how to <a href=http://www.ibiblio.org/obp/thinkCSpy/chap10.htm>concatenate tuples</a>.
<li><a href=http://www.faqts.com/knowledge-base/index.phtml/fid/199/>Python Knowledge Base</a> shows how to <a href=http://www.faqts.com/knowledge-base/view.phtml/aid/4553/fid/587>sort a tuple</a>.
<li><a href=http://www.python.org/doc/current/tut/tut.html><i class="citetitle">Python Tutorial</i></a> shows how to <a href=http://www.python.org/doc/current/tut/node7.html#SECTION007300000000000000000>define a tuple with one element</a>.
</ul>
-->
<h2 id=sets>Sets</h2>
<p>FIXME
<h2 id="sets">Sets</h2>
<p>FIXME
<h2 id="dictionaries">Dictionaries</h2>
<h2 id=dictionaries>Dictionaries</h2>
<p>One of Python's most important datatypes is the dictionary, which defines one-to-one relationships between keys and values.
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>A dictionary in Python is like a hash in Perl 5. In Perl 5, variables that store hashes always start with a <code>%</code> character. In Python, variables can be named anything, and Python keeps track of the datatype internally.
<p><span>&#x261E;</span>A dictionary in Python is like a hash in Perl 5. In Perl 5, variables that store hashes always start with a <code>%</code> character. In Python, variables can be named anything, and Python keeps track of the datatype internally.
</blockquote>
<p>Creating a dictionary is easy. The syntax is similar to <a href="#sets">sets</a>, but instead of values, you have key-value pairs. Once you have a dictionary, you can look up values by their key.
<pre class="screen">
<a><samp class="prompt">>>> </samp><kbd>a_dict = {"server":"db.diveintopython3.org", "database":"mysql"}</kbd> <span>&#x2460;</span></a>
<samp class="prompt">>>> </samp><kbd>a_dict</kbd>
<p>Creating a dictionary is easy. The syntax is similar to <a href=#sets>sets</a>, but instead of values, you have key-value pairs. Once you have a dictionary, you can look up values by their key.
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>a_dict = {"server":"db.diveintopython3.org", "database":"mysql"}</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<samp>{'server': 'db.diveintopython3.org', 'database': 'mysql'}</samp>
<a><samp class="prompt">>>> </samp><kbd>a_dict["server"]</kbd> <span>&#x2461;</span></a>
<a><samp class=prompt>>>> </samp><kbd>a_dict["server"]</kbd> <span>&#x2461;</span></a>
'db.diveintopython3.org'
<a><samp class="prompt">>>> </samp><kbd>a_dict["database"]</kbd> <span>&#x2462;</span></a>
<a><samp class=prompt>>>> </samp><kbd>a_dict["database"]</kbd> <span>&#x2462;</span></a>
'mysql'
<a><samp class="prompt">>>> </samp><kbd>a_dict["db.diveintopython3.org"]</kbd> <span>&#x2463;</span></a>
<samp class="traceback">Traceback (most recent call last):
<a><samp class=prompt>>>> </samp><kbd>a_dict["db.diveintopython3.org"]</kbd> <span>&#x2463;</span></a>
<samp class=traceback>Traceback (most recent call last):
File "&lt;stdin>", line 1, in &lt;module>
KeyError: 'db.diveintopython3.org'</samp></pre>
<ol>
<li>First, you create a new dictionary with two elements and assign it to the variable <var>a_dict</var>. Each element is a key-value pair, and the whole set of elements is enclosed in curly braces.
<li>First, you create a new dictionary with two elements and assign it to the variable <var>a_dict</var>. Each element is a key-value pair, and the whole set of elements is enclosed in curly braces.
<li><code>'server'</code> is a key, and its associated value, referenced by <code>a_dict["server"]</code>, is <code>'db.diveintopython3.org'</code>.
<li><code>'database'</code> is a key, and its associated value, referenced by <code>a_dict["database"]</code>, is <code>'mysql'</code>.
<li>You can get values by key, but you can't get keys by value. So <code>a_dict["server"]</code> is <code>'db.diveintopython3.org'</code>, but <code>a_dict["db.diveintopython3.org"]</code> raises an exception, because <code>'db.diveintopython3.org'</code> is not a key.
<li>You can get values by key, but you can't get keys by value. So <code>a_dict["server"]</code> is <code>'db.diveintopython3.org'</code>, but <code>a_dict["db.diveintopython3.org"]</code> raises an exception, because <code>'db.diveintopython3.org'</code> is not a key.
</ol>
<p>Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any time, or you can modify the value of an existing key. Continuing from the previous example:
<pre class="screen">
<samp class="prompt">>>> </samp><kbd>a_dict</kbd>
<p>Dictionaries do not have any predefined size limit. You can add new key-value pairs to a dictionary at any time, or you can modify the value of an existing key. Continuing from the previous example:
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<samp>{'server': 'db.diveintopython3.org', 'database': 'mysql'}</samp>
<a><samp class="prompt">>>> </samp><kbd>a_dict["database"] = "blog"</kbd> <span>&#x2460;</span></a>
<samp class="prompt">>>> </samp><kbd>a_dict</kbd>
<a><samp class=prompt>>>> </samp><kbd>a_dict["database"] = "blog"</kbd> <span>&#x2460;</span></a>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<samp>{'server': 'db.diveintopython3.org', 'database': 'blog'}</samp>
<a><samp class="prompt">>>> </samp><kbd>a_dict["user"] = "mark"</kbd> <span>&#x2461;</span></a>
<a><samp class="prompt">>>> </samp><kbd>a_dict</kbd> <span>&#x2462;</span></a>
<a><samp class=prompt>>>> </samp><kbd>a_dict["user"] = "mark"</kbd> <span>&#x2461;</span></a>
<a><samp class=prompt>>>> </samp><kbd>a_dict</kbd> <span>&#x2462;</span></a>
<samp>{'server': 'db.diveintopython3.org', 'user': 'mark', 'database': 'blog'}</samp>
<a><samp class="prompt">>>> </samp><kbd>a_dict["user"] = "dora"</kbd> <span>&#x2463;</span></a>
<samp class="prompt">>>> </samp><kbd>a_dict</kbd>
<a><samp class=prompt>>>> </samp><kbd>a_dict["user"] = "dora"</kbd> <span>&#x2463;</span></a>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<samp>{'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}</samp>
<a><samp class="prompt">>>> </samp><kbd>a_dict["User"] = "mark"</kbd> <span>&#x2464;</span></a>
<samp class="prompt">>>> </samp><kbd>a_dict</kbd>
<a><samp class=prompt>>>> </samp><kbd>a_dict["User"] = "mark"</kbd> <span>&#x2464;</span></a>
<samp class=prompt>>>> </samp><kbd>a_dict</kbd>
<samp>{'User': 'mark', 'server': 'db.diveintopython3.org', 'user': 'dora', 'database': 'blog'}</samp></pre>
<ol>
<li>You can not have duplicate keys in a dictionary. Assigning a value to an existing key will wipe out the old value.
<li>You can add new key-value pairs at any time. This syntax is identical to modifying existing values.
<li>The new dictionary item (key <code>'user'</code>, value <code>'mark'</code>) appears to be in the middle. In fact, it was just a coincidence that the elements appeared to be in order in the first example; it is just as much a coincidence that they appear to be out of order now.
<li>You can not have duplicate keys in a dictionary. Assigning a value to an existing key will wipe out the old value.
<li>You can add new key-value pairs at any time. This syntax is identical to modifying existing values.
<li>The new dictionary item (key <code>'user'</code>, value <code>'mark'</code>) appears to be in the middle. In fact, it was just a coincidence that the elements appeared to be in order in the first example; it is just as much a coincidence that they appear to be out of order now.
<li>Assigning a value to an existing dictionary key simply replaces the old value with the new one.
<li>Will this change the value of the <code>user</code> key back to "mark"? No! Look at the key closely &mdash; that's a capital <kbd>U</kbd> in <kbd>"User"</kbd>. Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it's completely different.
<li>Will this change the value of the <code>user</code> key back to "mark"? No! Look at the key closely &mdash; that's a capital <kbd>U</kbd> in <kbd>"User"</kbd>. Dictionary keys are case-sensitive, so this statement is creating a new key-value pair, not overwriting an existing one. It may look similar to you, but as far as Python is concerned, it's completely different.
</ol>
<p>Dictionaries aren't just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don't all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.
<p>In fact, you've already seen a dictionary with non-string keys and values, in <a href="your-first-python-program.html#divingin">your first Python program</a>.
<p>Dictionaries aren't just for strings. Dictionary values can be any datatype, including integers, booleans, arbitrary objects, or even other dictionaries. And within a single dictionary, the values don't all need to be the same type; you can mix and match as needed. Dictionary keys are more restricted, but they can be strings, integers, and a few other types. You can also mix and match key datatypes within a dictionary.
<p>In fact, you've already seen a dictionary with non-string keys and values, in <a href=your-first-python-program.html#divingin>your first Python program</a>.
<pre><code>SUFFIXES = {1000: ('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'),
1024: ('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')}</code></pre>
<p>Let's tear that apart in the interactive shell.
<pre class="screen">
<samp class="prompt">>>> </samp><kbd>SUFFIXES = {1000: ('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'),</kbd>
<samp class="prompt">... </samp><kbd> 1024: ('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')}</kbd>
<a><samp class="prompt">>>> </samp><kbd>len(SUFFIXES)</kbd> <span>&#x2460;</span></a>
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>SUFFIXES = {1000: ('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'),</kbd>
<samp class=prompt>... </samp><kbd> 1024: ('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')}</kbd>
<a><samp class=prompt>>>> </samp><kbd>len(SUFFIXES)</kbd> <span>&#x2460;</span></a>
<samp>2</samp>
<a><samp class="prompt">>>> </samp><kbd>SUFFIXES[1000]</kbd> <span>&#x2461;</span></a>
<a><samp class=prompt>>>> </samp><kbd>SUFFIXES[1000]</kbd> <span>&#x2461;</span></a>
<samp>('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB')</samp>
<a><samp class="prompt">>>> </samp><kbd>SUFFIXES[1024]</kbd> <span>&#x2462;</span></a>
<a><samp class=prompt>>>> </samp><kbd>SUFFIXES[1024]</kbd> <span>&#x2462;</span></a>
<samp>('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')</samp>
<a><samp class="prompt">>>> </samp><kbd>SUFFIXES[1000][3]</kbd> <span>&#x2463;</span></a>
<a><samp class=prompt>>>> </samp><kbd>SUFFIXES[1000][3]</kbd> <span>&#x2463;</span></a>
<samp>'TB'</samp></pre>
<ol>
<li>As with <a href="#lists">lists</a> and <a href="#sets">sets</a>, the <code>len()</code> function gives you the number of items in a dictionary.
<li>As with <a href=#lists>lists</a> and <a href=#sets>sets</a>, the <code>len()</code> function gives you the number of items in a dictionary.
<li><code>1000</code> is a key in the <code>SUFFIXES</code> dictionary; its value is a tuple of eight items (eight strings, to be precise).
<li>Similarly, <code>1024</code> is a key in the <code>SUFFIXES</code> dictionary; its value is also a tuple of eight items.
<li>Since <code>SUFFIXES[1000]</code> is a tuple, you can address individual items in the tuple by their 0-based index.
</ol>
<h2 id="none"><code>None</code></h2>
<p><code>None</code> is a special constant in Python. It is a null value. <code>None</code> is not <code>False</code>; it is not <code>0</code>; it is not an empty string. Comparing <code>None</code> to anything other than <code>None</code> will always return <code>False</code>.
<p><code>None</code> is the only null value. It has its own datatype (<code>NoneType</code>). You can assign <code>None</code> to any variable, but you can not create other <code>NoneType</code> objects. All variables whose value is <code>None</code> are equal to each other.
<pre class="screen">
<samp class="prompt">>>> </samp><kbd>type(None)</kbd>
<h2 id=none><code>None</code></h2>
<p><code>None</code> is a special constant in Python. It is a null value. <code>None</code> is not <code>False</code>; it is not <code>0</code>; it is not an empty string. Comparing <code>None</code> to anything other than <code>None</code> will always return <code>False</code>.
<p><code>None</code> is the only null value. It has its own datatype (<code>NoneType</code>). You can assign <code>None</code> to any variable, but you can not create other <code>NoneType</code> objects. All variables whose value is <code>None</code> are equal to each other.
<pre class=screen>
<samp class=prompt>>>> </samp><kbd>type(None)</kbd>
<samp>&lt;class 'NoneType'></samp>
<samp class="prompt">>>> </samp><kbd>None == False</kbd>
<samp class=prompt>>>> </samp><kbd>None == False</kbd>
<samp>False</samp>
<samp class="prompt">>>> </samp><kbd>None == 0</kbd>
<samp class=prompt>>>> </samp><kbd>None == 0</kbd>
<samp>False</samp>
<samp class="prompt">>>> </samp><kbd>None == ''</kbd>
<samp class=prompt>>>> </samp><kbd>None == ''</kbd>
<samp>False</samp>
<samp class="prompt">>>> </samp><kbd>None == None</kbd>
<samp class=prompt>>>> </samp><kbd>None == None</kbd>
<samp>True</samp>
<samp class="prompt">>>> </samp><kbd>x = None</kbd>
<samp class="prompt">>>> </samp><kbd>x == None</kbd>
<samp class=prompt>>>> </samp><kbd>x = None</kbd>
<samp class=prompt>>>> </samp><kbd>x == None</kbd>
<samp>True</samp>
<samp class="prompt">>>> </samp><kbd>y = None</kbd>
<samp class="prompt">>>> </samp><kbd>x == y</kbd>
<samp class=prompt>>>> </samp><kbd>y = None</kbd>
<samp class=prompt>>>> </samp><kbd>x == y</kbd>
<samp>True</samp>
</pre>
<h3 id="furtherreading">Further reading</h3>
<h3 id=furtherreading>Further reading</h3>
<ul>
<li>fractions
<li>math module
@@ -257,8 +652,6 @@ KeyError: 'db.diveintopython3.org'</samp></pre>
<li>links to appendix
<li>...etc...
</ul>
<p class="c">&copy; 2001-4, 2009 <span>&#x2133;</span>ark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a>
<script type="text/javascript" src="http://www.google.com/jsapi"></script>
<script type="text/javascript" src="dip3.js"></script>
</body>
</html>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim, <a href=http://creativecommons.org/licenses/by-sa/3.0/ rel=license>CC-BY-SA-3.0</a>
<script type=text/javascript src=jquery.js></script>
<script type=text/javascript src=dip3.js></script>
File diff suppressed because it is too large Load Diff
+21 -5
View File
@@ -1,13 +1,29 @@
#!/bin/sh
set -x
# make build directory and copy original files there for preflighting
rm -rf build
mkdir build
cp *.html *.py *.txt .htaccess build/
cp *.html *.py *.txt .htaccess *.js *.css build/
# replace local jquery reference with Google API loader
sed -i -e "s|jquery\.hs|http://www.google.com/jsapi|g" build/*.html
sed -i -e "s|//google\.|google.|g" build/dip3.js
sed -i -e "s|//}.; /\* google\..*|});|g" build/dip3.js
# minimize JS and CSS
revision=`hg log|grep changeset|cut -d":" -f3|head -1`
java -jar yuicompressor-2.4.2.jar dip3.js > build/dip3.$revision.min.js
java -jar yuicompressor-2.4.2.jar dip3.css > build/dip3.$revision.min.css
java -jar yuicompressor-2.4.2.jar build/dip3.js > build/dip3.$revision.min.js
java -jar yuicompressor-2.4.2.jar build/dip3.css > build/dip3.$revision.min.css
#rm build/dip3.js
#rm build/dip3.css
sed -i -e "s|dip3\.js|http://wearehugh.com/dip3/dip3.${revision}.min.js|g" build/*.html
sed -i -e "s|dip3\.css|http://wearehugh.com/dip3/dip3.${revision}.min.css|g" build/*.html
# set file permissions for public consumption
chmod 644 build/*.html build/*.css build/*.js build/*.py build/*.txt build/.htaccess
rsync -essh -avzP --delete --delete-after build/*.min.css build/*.min.js diveintomark.org:~/web/wearehugh.com/dip3/
rsync -essh -avzP build/*.html build/*.py build/*.txt build/.htaccess diveintomark.org:~/web/diveintopython3.org/
# and push to production
#rsync -essh -avzP --delete --delete-after build/*.min.css build/*.min.js diveintomark.org:~/web/wearehugh.com/dip3/
#rsync -essh -avzP build/*.html build/*.py build/*.txt build/.htaccess diveintomark.org:~/web/diveintopython3.org/
+114 -117
View File
@@ -1,12 +1,12 @@
<!DOCTYPE html>
<html lang="en">
<html lang=en>
<head>
<meta charset="utf-8">
<meta charset=utf-8>
<title>Table of contents - Dive Into Python 3</title>
<link rel="stylesheet" type="text/css" href="dip3.css">
<link rel="shortcut icon" href="data:image/ico,">
<link rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">
<style type="text/css">
<link rel=stylesheet type=text/css href=dip3.css>
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<style type=text/css>
h1:before{content:""}
ol,ul{font-weight:bold}
li ol{font-weight:normal}
@@ -14,11 +14,10 @@ ul{list-style:none;margin:0;padding:0}
ul li ol{margin:0;padding:0 0 0 2.5em}
</style>
</head>
<body>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8"><input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div></form>
<p class="nav">You are here: <a href="/">Home</a> <span>&#8227;</span> Dive Into Python 3 <span>&#8227;</span>
<form action=http://www.google.com/cse id=search><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8><input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> Dive Into Python 3 <span>&#8227;</span>
<h1>Table of contents</h1>
<ol start="0">
<ol start=0>
<li>Installing Python
<ol>
<li>Python on Windows
@@ -27,49 +26,49 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
<li>Python from source
<li>The interactive shell
</ol>
<li><a href="your-first-python-program.html">Your first Python program</a>
<li><a href=your-first-python-program.html>Your first Python program</a>
<ol>
<li><a href="your-first-python-program.html#divingin">Diving in</a>
<li><a href="your-first-python-program.html#declaringfunctions">Declaring functions</a>
<li><a href="your-first-python-program.html#readability">Writing readable code</a>
<li><a href=your-first-python-program.html#divingin>Diving in</a>
<li><a href=your-first-python-program.html#declaringfunctions>Declaring functions</a>
<li><a href=your-first-python-program.html#readability>Writing readable code</a>
<ol>
<li><a href="your-first-python-program.html#docstrings">Docstrings</a>
<li><a href="your-first-python-program.html#functionannotations">Function annotations</a>
<li><a href="your-first-python-program.html#styleconventions">Style conventions</a>
<li><a href=your-first-python-program.html#docstrings>Docstrings</a>
<li><a href=your-first-python-program.html#functionannotations>Function annotations</a>
<li><a href=your-first-python-program.html#styleconventions>Style conventions</a>
</ol>
<li><a href="your-first-python-program.html#everythingisanobject">Everything is an object</a>
<li><a href=your-first-python-program.html#everythingisanobject>Everything is an object</a>
<ol>
<li><a href="your-first-python-program.html#importsearchpath">The <code>import</code> search path</a>
<li><a href="your-first-python-program.html#whatsanobject">What's an object?</a>
<li><a href=your-first-python-program.html#importsearchpath>The <code>import</code> search path</a>
<li><a href=your-first-python-program.html#whatsanobject>What's an object?</a>
</ol>
<li><a href="your-first-python-program.html#indentingcode">Indenting code</a>
<li><a href="your-first-python-program.html#runningscripts">Running scripts</a>
<li><a href="your-first-python-program.html#furtherreading">Further reading</a>
<li><a href=your-first-python-program.html#indentingcode>Indenting code</a>
<li><a href=your-first-python-program.html#runningscripts>Running scripts</a>
<li><a href=your-first-python-program.html#furtherreading>Further reading</a>
</ol>
<li><a href="native-datatypes.html">Native Python datatypes</a>
<li><a href=native-datatypes.html>Native Python datatypes</a>
<ol>
<li><a href="native-datatypes.html#divingin">Diving in</a>
<li><a href="native-datatypes.html#booleans">Booleans</a>
<li><a href="native-datatypes.html#numbers">Numbers</a>
<li><a href=native-datatypes.html#divingin>Diving in</a>
<li><a href=native-datatypes.html#booleans>Booleans</a>
<li><a href=native-datatypes.html#numbers>Numbers</a>
<!--
<ol>
<li><a href="native-datatypes.html#integers">Integers</a>
<li><a href="native-datatypes.html#floats">Floating point numbers</a>
<li><a href="native-datatypes.html#fractions">Fractions</a>
<li><a href="native-datatypes.html#complexnumbers">Complex numbers</a>
<li><a href="native-datatypes.html#numberoperations">Common operations on numbers</a>
<li><a href="native-datatypes.html#math">The <code>math</code> module</a>
<li><a href=native-datatypes.html#integers>Integers</a>
<li><a href=native-datatypes.html#floats>Floating point numbers</a>
<li><a href=native-datatypes.html#fractions>Fractions</a>
<li><a href=native-datatypes.html#complexnumbers>Complex numbers</a>
<li><a href=native-datatypes.html#numberoperations>Common operations on numbers</a>
<li><a href=native-datatypes.html#math>The <code>math</code> module</a>
</ol>
-->
<li><a href="native-datatypes.html#lists">Lists</a>
<li><a href="native-datatypes.html#sets">Sets</a>
<li><a href="native-datatypes.html#dictionaries">Dictionaries
<li><a href="native-datatypes.html#none"><code>None</code></a>
<li><a href="native-datatypes.html#furtherreading">Further reading</a>
<li><a href=native-datatypes.html#lists>Lists</a>
<li><a href=native-datatypes.html#sets>Sets</a>
<li><a href=native-datatypes.html#dictionaries>Dictionaries
<li><a href=native-datatypes.html#none><code>None</code></a>
<li><a href=native-datatypes.html#furtherreading>Further reading</a>
</ol>
<li>Strings
<ol>
<li>There ain't no such thing as "plain text"
<li>There ain't no such thing as plain text
<ol>
<li>A brief history of character encoding
<li>What's a character?
@@ -112,7 +111,7 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
<li>...stuff about decorators...
<li>...stuff about importing modules...
<ol>
<li>...mention why "from module import *" is only allowed at module level
<li>...mention why from module import * is only allowed at module level
</ol>
</ol>
<li>Exceptions
@@ -253,7 +252,7 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
</ol>
<li>Creating graphics with the Python Imaging Library
<ol>
<li>...<a href="http://www.reddit.com/r/Python/comments/7sj39/dive_into_python_3/c07b3cq">will likely get ported in time</a>...
<li>...<a href=http://www.reddit.com/r/Python/comments/7sj39/dive_into_python_3/c07b3cq>will likely get ported in time</a>...
</ol>
<li>Where to go from here (tentative because most of these have not been ported to Python 3 yet)
<ol>
@@ -267,93 +266,93 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
<li>PyPy
<li>Stackless Python
</ol>
<li><a href="case-study-porting-chardet-to-python-3.html">Case study: porting <code>chardet</code> to Python 3</a>
<li><a href=case-study-porting-chardet-to-python-3.html>Case study: porting <code>chardet</code> to Python 3</a>
<ol>
<li><a href="case-study-porting-chardet-to-python-3.html#divingin">Introducing <code class="filename">chardet</code>: a mini-FAQ</a>
<li><a href=case-study-porting-chardet-to-python-3.html#divingin>Introducing <code class=filename>chardet</code>: a mini-FAQ</a>
<ol>
<li><a href="case-study-porting-chardet-to-python-3.html#faq.what">What is character encoding auto-detection?</a>
<li><a href="case-study-porting-chardet-to-python-3.html#faq.impossible">Isn't that impossible?</a>
<li><a href="case-study-porting-chardet-to-python-3.html#faq.who">Who wrote this detection algorithm?</a>
<li><a href="case-study-porting-chardet-to-python-3.html#faq.yippie">Yippie! Screw the standards, I'll just auto-detect everything!</a>
<li><a href="case-study-porting-chardet-to-python-3.html#faq.why">Why bother with auto-detection if it's slow, inaccurate, and non-standard?</a>
<li><a href=case-study-porting-chardet-to-python-3.html#faq.what>What is character encoding auto-detection?</a>
<li><a href=case-study-porting-chardet-to-python-3.html#faq.impossible>Isn't that impossible?</a>
<li><a href=case-study-porting-chardet-to-python-3.html#faq.who>Who wrote this detection algorithm?</a>
<li><a href=case-study-porting-chardet-to-python-3.html#faq.yippie>Yippie! Screw the standards, I'll just auto-detect everything!</a>
<li><a href=case-study-porting-chardet-to-python-3.html#faq.why>Why bother with auto-detection if it's slow, inaccurate, and non-standard?</a>
</ol>
<li><a href="case-study-porting-chardet-to-python-3.html#divingin2">Diving in</a>
<li><a href=case-study-porting-chardet-to-python-3.html#divingin2>Diving in</a>
<ol>
<li><a href="case-study-porting-chardet-to-python-3.html#how.bom"><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></a>
<li><a href="case-study-porting-chardet-to-python-3.html#how.esc">Escaped encodings</a>
<li><a href="case-study-porting-chardet-to-python-3.html#how.mb">Multi-byte encodings</a>
<li><a href="case-study-porting-chardet-to-python-3.html#how.sb">Single-byte encodings</a>
<li><a href="case-study-porting-chardet-to-python-3.html#how.windows1252"><code>windows-1252</code></a>
<li><a href=case-study-porting-chardet-to-python-3.html#how.bom><code>UTF-n</code> with a <abbr title=Byte Order Mark>BOM</abbr></a>
<li><a href=case-study-porting-chardet-to-python-3.html#how.esc>Escaped encodings</a>
<li><a href=case-study-porting-chardet-to-python-3.html#how.mb>Multi-byte encodings</a>
<li><a href=case-study-porting-chardet-to-python-3.html#how.sb>Single-byte encodings</a>
<li><a href=case-study-porting-chardet-to-python-3.html#how.windows1252><code>windows-1252</code></a>
</ol>
<li><a href="case-study-porting-chardet-to-python-3.html#running2to3">Running <code class="filename">2to3</code></a>
<li><a href="case-study-porting-chardet-to-python-3.html#manual">Fixing what <code class="filename">2to3</code> can't</a>
<li><a href=case-study-porting-chardet-to-python-3.html#running2to3>Running <code class=filename>2to3</code></a>
<li><a href=case-study-porting-chardet-to-python-3.html#manual>Fixing what <code class=filename>2to3</code> can't</a>
<ol>
<li><a href="case-study-porting-chardet-to-python-3.html#falseisinvalidsyntax"><code>False</code> is invalid syntax</a>
<li><a href="case-study-porting-chardet-to-python-3.html#nomodulenamedconstants">No module named <code class="filename">constants</code></a>
<li><a href="case-study-porting-chardet-to-python-3.html#namefileisnotdefined">Name '<var>file</var>' is not defined</a>
<li><a href="case-study-porting-chardet-to-python-3.html#cantuseastringpattern">Can't use a string pattern on a bytes-like object</a>
<li><a href="case-study-porting-chardet-to-python-3.html#cantconvertbytesobject">Can't convert '<code>bytes</code>' object to <code>str</code> implicitly</a>
<li><a href=case-study-porting-chardet-to-python-3.html#falseisinvalidsyntax><code>False</code> is invalid syntax</a>
<li><a href=case-study-porting-chardet-to-python-3.html#nomodulenamedconstants>No module named <code class=filename>constants</code></a>
<li><a href=case-study-porting-chardet-to-python-3.html#namefileisnotdefined>Name '<var>file</var>' is not defined</a>
<li><a href=case-study-porting-chardet-to-python-3.html#cantuseastringpattern>Can't use a string pattern on a bytes-like object</a>
<li><a href=case-study-porting-chardet-to-python-3.html#cantconvertbytesobject>Can't convert '<code>bytes</code>' object to <code>str</code> implicitly</a>
</ol>
</ol>
</ol>
<ul>
<li><a href="porting-code-to-python-3-with-2to3.html">Appendix A. Porting code to Python 3 with <code class="filename">2to3</code></a>
<li><a href=porting-code-to-python-3-with-2to3.html>Appendix A. Porting code to Python 3 with <code class=filename>2to3</code></a>
<ol>
<li><a href="porting-code-to-python-3-with-2to3.html#divingin">Diving in</a>
<li><a href="porting-code-to-python-3-with-2to3.html#print"><code>print</code> statement</a>
<li><a href="porting-code-to-python-3-with-2to3.html#unicodeliteral">Unicode string literals</a>
<li><a href="porting-code-to-python-3-with-2to3.html#unicode"><code>unicode()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#long"><code>long</code> data type</a>
<li><a href="porting-code-to-python-3-with-2to3.html#ne">&lt;> comparison</a>
<li><a href="porting-code-to-python-3-with-2to3.html#has_key"><code>has_key()</code> dictionary method</a>
<li><a href="porting-code-to-python-3-with-2to3.html#dict">Dictionary methods that return lists</a>
<li><a href="porting-code-to-python-3-with-2to3.html#imports">Modules that have been renamed or reorganized</a>
<li><a href=porting-code-to-python-3-with-2to3.html#divingin>Diving in</a>
<li><a href=porting-code-to-python-3-with-2to3.html#print><code>print</code> statement</a>
<li><a href=porting-code-to-python-3-with-2to3.html#unicodeliteral>Unicode string literals</a>
<li><a href=porting-code-to-python-3-with-2to3.html#unicode><code>unicode()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#long><code>long</code> data type</a>
<li><a href=porting-code-to-python-3-with-2to3.html#ne>&lt;> comparison</a>
<li><a href=porting-code-to-python-3-with-2to3.html#has_key><code>has_key()</code> dictionary method</a>
<li><a href=porting-code-to-python-3-with-2to3.html#dict>Dictionary methods that return lists</a>
<li><a href=porting-code-to-python-3-with-2to3.html#imports>Modules that have been renamed or reorganized</a>
<ol>
<li><a href="porting-code-to-python-3-with-2to3.html#http"><code>http</code></a>
<li><a href="porting-code-to-python-3-with-2to3.html#urllib"><code>urllib</code></a>
<li><a href="porting-code-to-python-3-with-2to3.html#dbm"><code>dbm</code></a>
<li><a href="porting-code-to-python-3-with-2to3.html#xmlrpc"><code>xmlrpc</code></a>
<li><a href="porting-code-to-python-3-with-2to3.html#othermodules">Other modules</a>
<li><a href=porting-code-to-python-3-with-2to3.html#http><code>http</code></a>
<li><a href=porting-code-to-python-3-with-2to3.html#urllib><code>urllib</code></a>
<li><a href=porting-code-to-python-3-with-2to3.html#dbm><code>dbm</code></a>
<li><a href=porting-code-to-python-3-with-2to3.html#xmlrpc><code>xmlrpc</code></a>
<li><a href=porting-code-to-python-3-with-2to3.html#othermodules>Other modules</a>
</ol>
<li><a href="porting-code-to-python-3-with-2to3.html#import">Relative imports within a package</a>
<li><a href="porting-code-to-python-3-with-2to3.html#next"><code>next()</code> iterator method</a>
<li><a href="porting-code-to-python-3-with-2to3.html#filter"><code>filter()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#map"><code>map()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#reduce"><code>reduce()</code> global function</a> (3.1+)
<li><a href="porting-code-to-python-3-with-2to3.html#apply"><code>apply()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#intern"><code>intern()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#exec"><code>exec</code> statement</a>
<li><a href="porting-code-to-python-3-with-2to3.html#execfile"><code>execfile</code> statement</a> (3.1+)
<li><a href="porting-code-to-python-3-with-2to3.html#repr"><code>repr</code> literals (backticks)</a>
<li><a href="porting-code-to-python-3-with-2to3.html#except"><code>try...except</code> statement</a>
<li><a href="porting-code-to-python-3-with-2to3.html#raise"><code>raise</code> statement</a>
<li><a href="porting-code-to-python-3-with-2to3.html#throw"><code>throw</code> method on generators</a>
<li><a href="porting-code-to-python-3-with-2to3.html#xrange"><code>xrange()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#raw_input"><code>raw_input()</code> and <code>input()</code> global functions</a>
<li><a href="porting-code-to-python-3-with-2to3.html#funcattrs"><code>func_*</code> function attributes</a>
<li><a href="porting-code-to-python-3-with-2to3.html#xreadlines"><code>xreadlines()</code> I/O method</a>
<li><a href="porting-code-to-python-3-with-2to3.html#tuple_params"><code>lambda</code> functions with multiple parameters</a>
<li><a href="porting-code-to-python-3-with-2to3.html#methodattrs">Special method attributes</a>
<li><a href="porting-code-to-python-3-with-2to3.html#nonzero"><code>__nonzero__</code> special class attribute</a>
<li><a href="porting-code-to-python-3-with-2to3.html#numliterals">Octal literals</a>
<li><a href="porting-code-to-python-3-with-2to3.html#renames"><code>sys.maxint</code></a>
<li><a href="porting-code-to-python-3-with-2to3.html#callable"><code>callable()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#zip"><code>zip()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#standarderror"><code>StandardError()</code> exception</a>
<li><a href="porting-code-to-python-3-with-2to3.html#types"><code>types</code> module constants</a>
<li><a href="porting-code-to-python-3-with-2to3.html#isinstance"><code>isinstance()</code> global function</a> (3.1+)
<li><a href="porting-code-to-python-3-with-2to3.html#basestring"><code>basestring</code> datatype</a>
<li><a href="porting-code-to-python-3-with-2to3.html#itertools"><code>itertools</code> module</a>
<li><a href="porting-code-to-python-3-with-2to3.html#sys_exc"><code>sys.exc_type</code>, <code>sys.exc_value</code>, <code>sys.exc_traceback</code></a>
<li><a href="porting-code-to-python-3-with-2to3.html#paren">List comprehensions over tuples</a>
<li><a href="porting-code-to-python-3-with-2to3.html#getcwdu"><code>os.getcwdu()</code> function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#metaclass">Metaclasses</a>
<li><a href="porting-code-to-python-3-with-2to3.html#nitpick">Matters of style</a>
<li><a href=porting-code-to-python-3-with-2to3.html#import>Relative imports within a package</a>
<li><a href=porting-code-to-python-3-with-2to3.html#next><code>next()</code> iterator method</a>
<li><a href=porting-code-to-python-3-with-2to3.html#filter><code>filter()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#map><code>map()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#reduce><code>reduce()</code> global function</a> (3.1+)
<li><a href=porting-code-to-python-3-with-2to3.html#apply><code>apply()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#intern><code>intern()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#exec><code>exec</code> statement</a>
<li><a href=porting-code-to-python-3-with-2to3.html#execfile><code>execfile</code> statement</a> (3.1+)
<li><a href=porting-code-to-python-3-with-2to3.html#repr><code>repr</code> literals (backticks)</a>
<li><a href=porting-code-to-python-3-with-2to3.html#except><code>try...except</code> statement</a>
<li><a href=porting-code-to-python-3-with-2to3.html#raise><code>raise</code> statement</a>
<li><a href=porting-code-to-python-3-with-2to3.html#throw><code>throw</code> method on generators</a>
<li><a href=porting-code-to-python-3-with-2to3.html#xrange><code>xrange()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#raw_input><code>raw_input()</code> and <code>input()</code> global functions</a>
<li><a href=porting-code-to-python-3-with-2to3.html#funcattrs><code>func_*</code> function attributes</a>
<li><a href=porting-code-to-python-3-with-2to3.html#xreadlines><code>xreadlines()</code> I/O method</a>
<li><a href=porting-code-to-python-3-with-2to3.html#tuple_params><code>lambda</code> functions with multiple parameters</a>
<li><a href=porting-code-to-python-3-with-2to3.html#methodattrs>Special method attributes</a>
<li><a href=porting-code-to-python-3-with-2to3.html#nonzero><code>__nonzero__</code> special class attribute</a>
<li><a href=porting-code-to-python-3-with-2to3.html#numliterals>Octal literals</a>
<li><a href=porting-code-to-python-3-with-2to3.html#renames><code>sys.maxint</code></a>
<li><a href=porting-code-to-python-3-with-2to3.html#callable><code>callable()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#zip><code>zip()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#standarderror><code>StandardError()</code> exception</a>
<li><a href=porting-code-to-python-3-with-2to3.html#types><code>types</code> module constants</a>
<li><a href=porting-code-to-python-3-with-2to3.html#isinstance><code>isinstance()</code> global function</a> (3.1+)
<li><a href=porting-code-to-python-3-with-2to3.html#basestring><code>basestring</code> datatype</a>
<li><a href=porting-code-to-python-3-with-2to3.html#itertools><code>itertools</code> module</a>
<li><a href=porting-code-to-python-3-with-2to3.html#sys_exc><code>sys.exc_type</code>, <code>sys.exc_value</code>, <code>sys.exc_traceback</code></a>
<li><a href=porting-code-to-python-3-with-2to3.html#paren>List comprehensions over tuples</a>
<li><a href=porting-code-to-python-3-with-2to3.html#getcwdu><code>os.getcwdu()</code> function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#metaclass>Metaclasses</a>
<li><a href=porting-code-to-python-3-with-2to3.html#nitpick>Matters of style</a>
<ol>
<li><a href="porting-code-to-python-3-with-2to3.html#set_literal"><code>set()</code> literals</a>
<li><a href="porting-code-to-python-3-with-2to3.html#buffer"><code>buffer()</code> global function</a>
<li><a href="porting-code-to-python-3-with-2to3.html#wscomma">Whitespace around commas</a>
<li><a href="porting-code-to-python-3-with-2to3.html#idioms">Common idioms</a>
<li><a href=porting-code-to-python-3-with-2to3.html#set_literal><code>set()</code> literals</a>
<li><a href=porting-code-to-python-3-with-2to3.html#buffer><code>buffer()</code> global function</a>
<li><a href=porting-code-to-python-3-with-2to3.html#wscomma>Whitespace around commas</a>
<li><a href=porting-code-to-python-3-with-2to3.html#idioms>Common idioms</a>
</ol>
</ol>
</ul>
@@ -367,6 +366,4 @@ ul li ol{margin:0;padding:0 0 0 2.5em}
<li>Dictionary comprehensions
<li>Views (several dictionary methods return them, they're dynamic, update when the dictionary changes, etc.)
</ul>
<p class="c">&copy; 2001-4, 2009 <span>&#x2133;</span>ark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a>
</body>
</html>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim, <a href=http://creativecommons.org/licenses/by-sa/3.0/ rel=license>CC-BY-SA-3.0</a>
+102 -105
View File
@@ -1,48 +1,47 @@
<!DOCTYPE html>
<html lang="en">
<html lang=en>
<head>
<meta charset="utf-8">
<meta charset=utf-8>
<title>Your first Python program - Dive into Python 3</title>
<link rel="stylesheet" type="text/css" href="dip3.css">
<link rel="shortcut icon" href="data:image/ico,">
<link rel="alternate" type="application/atom+xml" href="http://hg.diveintopython3.org/atom-log">
<style type="text/css">
<link rel=stylesheet type=text/css href=dip3.css>
<link rel="shortcut icon" href=data:image/ico,>
<link rel=alternate type=application/atom+xml href=http://hg.diveintopython3.org/atom-log>
<style type=text/css>
body{counter-reset:h1 1}
</style>
</head>
<body>
<p class="skip"><a href="#divingin">skip to main content</a>
<form action="http://www.google.com/cse" id="search"><div><input type="hidden" name="cx" value="014021643941856155761:l5eihuescdw"><input type="hidden" name="ie" value="UTF-8">&nbsp;<input name="q" size="31">&nbsp;<input type="submit" name="sa" value="Search"></div></form>
<p class="nav">You are here: <a href="/">Home</a> <span>&#8227;</span> <a href="table-of-contents.html">Dive Into Python 3</a> <span>&#8227;</span>
<p class=skip><a href=#divingin>skip to main content</a>
<form action=http://www.google.com/cse id=search><div><input type=hidden name=cx value=014021643941856155761:l5eihuescdw><input type=hidden name=ie value=UTF-8>&nbsp;<input name=q size=31>&nbsp;<input type=submit name=sa value=Search></div></form>
<p class=nav>You are here: <a href=/>Home</a> <span>&#8227;</span> <a href=table-of-contents.html>Dive Into Python 3</a> <span>&#8227;</span>
<h1>Your first Python program</h1>
<blockquote class="q">
<p><span>&#x275D;</span> Don&#8217;t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate. <span>&#x275E;</span><br>&mdash; <cite>Ven. Henepola Gunararatana</cite>
<blockquote class=q>
<p><span>&#x275D;</span> Don&#8217;t bury your burden in saintly silence. You have a problem? Great. Rejoice, dive in, and investigate. <span>&#x275E;</span><br>&mdash; <cite>Ven. Henepola Gunararatana</cite>
</blockquote>
<ol>
<li><a href="#divingin">Diving in</a>
<li><a href="#declaringfunctions">Declaring functions</a>
<li><a href=#divingin>Diving in</a>
<li><a href=#declaringfunctions>Declaring functions</a>
<ol>
<li><a href="#datatypes">How Python's datatypes compare to other programming languages</a>
<li><a href=#datatypes>How Python's datatypes compare to other programming languages</a>
</ol>
<li><a href="#readability">Writing readable code</a>
<li><a href=#readability>Writing readable code</a>
<ol>
<li><a href="#docstrings">Docstrings</a>
<li><a href="#functionannotations">Function annotations</a>
<li><a href="#styleconventions">Style conventions</a>
<li><a href=#docstrings>Docstrings</a>
<li><a href=#functionannotations>Function annotations</a>
<li><a href=#styleconventions>Style conventions</a>
</ol>
<li><a href="#everythingisanobject">Everything is an object</a>
<li><a href=#everythingisanobject>Everything is an object</a>
<ol>
<li><a href="#importsearchpath">The <code>import</code> search path</a>
<li><a href="#whatsanobject">What's an object?</a>
<li><a href=#importsearchpath>The <code>import</code> search path</a>
<li><a href=#whatsanobject>What's an object?</a>
</ol>
<li><a href="#indentingcode">Indenting code</a>
<li><a href="#runningscripts">Running scripts</a>
<li><a href="#furtherreading">Further reading</a>
<li><a href=#indentingcode>Indenting code</a>
<li><a href=#runningscripts>Running scripts</a>
<li><a href=#furtherreading>Further reading</a>
</ol>
<h2 id="divingin">Diving in</h2>
<p class="fancy">You know how other books go on and on about programming fundamentals and finally work up to building something useful? Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
<h2 id=divingin>Diving in</h2>
<p class=fancy>You know how other books go on and on about programming fundamentals and finally work up to building something useful? Let's skip all that. Here is a complete, working Python program. It probably makes absolutely no sense to you. Don't worry about that, because you're going to dissect it line by line. But read through it first and see what, if anything, you can make of it.
<!-- FIXME: download link -->
<p class="download">[<a href="humansize.py">download</a>]</p>
<p class=download>[<a href=humansize.py>download</a>]</p>
<pre><code>SUFFIXES = {1000: ('KB', 'MB', 'GB', 'TB', 'PB', 'EB', 'ZB', 'YB'),
1024: ('KiB', 'MiB', 'GiB', 'TiB', 'PiB', 'EiB', 'ZiB', 'YiB')}
@@ -71,54 +70,54 @@ def approximate_size(size, a_kilobyte_is_1024_bytes=True):
if __name__ == "__main__":
print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))</code></pre>
<p>Now let's run this program on the command line. On Windows, it will look something like this:
<pre class="screen"><samp class="prompt">c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<p>Now let's run this program on the command line. On Windows, it will look something like this:
<pre class=screen><samp class=prompt>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<p>On Mac OS X or Linux, it would look something like this:
<pre class="screen"><samp class="prompt">you@localhost:~$ </samp><kbd>python3 humansize.py</kbd>
<pre class=screen><samp class=prompt>you@localhost:~$ </samp><kbd>python3 humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<!-- FIXME: this would be a good place to explain what the program, you know, actually does -->
<h2 id="declaringfunctions">Declaring functions</h2>
<p>Python has functions like most other languages, but it does not have separate header files like <abbr>C++</abbr> or <code>interface</code>/<code>implementation</code> sections like Pascal. When you need a function, just declare it, like this:
<h2 id=declaringfunctions>Declaring functions</h2>
<p>Python has functions like most other languages, but it does not have separate header files like <abbr>C++</abbr> or <code>interface</code>/<code>implementation</code> sections like Pascal. When you need a function, just declare it, like this:
<pre><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):</code></pre>
<p>The keyword <code>def</code> starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments are separated with commas.
<p>Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. (In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.)
<blockquote class="note">
<p><span>&#x261E;</span>In some languages, functions (that return a value) start with <code>function</code>, and subroutines (that do not return a value) start with <code>sub</code>. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's <code>None</code>), and all functions start with <code>def</code>.
<p>The keyword <code>def</code> starts the function declaration, followed by the function name, followed by the arguments in parentheses. Multiple arguments are separated with commas.
<p>Also note that the function doesn't define a return datatype. Python functions do not specify the datatype of their return value; they don't even specify whether or not they return a value. (In fact, every Python function returns a value; if the function ever executes a <code>return</code> statement, it will return that value, otherwise it will return <code>None</code>, the Python null value.)
<blockquote class=note>
<p><span>&#x261E;</span>In some languages, functions (that return a value) start with <code>function</code>, and subroutines (that do not return a value) start with <code>sub</code>. There are no subroutines in Python. Everything is a function, all functions return a value (even if it's <code>None</code>), and all functions start with <code>def</code>.
</blockquote>
<p>The <code>approximate_size</code> function takes the two arguments &mdash; <var>size</var> and <var>a_kilobyte_is_1024_bytes</var> &mdash; but neither argument specifies a datatype. (As you might guess from the <code>=True</code> syntax, the second argument is a boolean. You'll learn what that syntax does in [FIXME xref-was-#apihelper].) In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
<p>The <code>approximate_size</code> function takes the two arguments &mdash; <var>size</var> and <var>a_kilobyte_is_1024_bytes</var> &mdash; but neither argument specifies a datatype. (As you might guess from the <code>=True</code> syntax, the second argument is a boolean. You'll learn what that syntax does in [FIXME xref-was-#apihelper].) In Python, variables are never explicitly typed. Python figures out what type a variable is and keeps track of it internally.
<blockquote class="note compare java">
<p><span>&#x261E;</span>In Java and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
<p><span>&#x261E;</span>In Java and other statically-typed languages, you must specify the datatype of the function return value and each function argument. In Python, you never explicitly specify the datatype of anything. Based on what value you assign, Python keeps track of the datatype internally.
</blockquote>
<h3 id="datatypes">How Python's datatypes compare to other programming languages</h3>
<h3 id=datatypes>How Python's datatypes compare to other programming languages</h3>
<p>An erudite reader sent me this explanation of how Python compares to other programming languages:
<dl>
<dt>statically typed language</dt>
<dd>A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare all variables with their datatypes before using them. Java and <abbr>C</abbr> are statically typed languages.
<dd>A language in which types are fixed at compile time. Most statically typed languages enforce this by requiring you to declare all variables with their datatypes before using them. Java and <abbr>C</abbr> are statically typed languages.
</dd>
<dt>dynamically typed language</dt>
<dd>A language in which types are discovered at execution time; the opposite of statically typed. JavaScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
<dd>A language in which types are discovered at execution time; the opposite of statically typed. JavaScript and Python are dynamically typed, because they figure out what type a variable is when you first assign it a value.
</dd>
<dt>strongly typed language</dt>
<dd>A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
<dd>A language in which types are always enforced. Java and Python are strongly typed. If you have an integer, you can't treat it like a string without explicitly converting it.
</dd>
<dt>weakly typed language</dt>
<dd>A language in which types are &#8220;automagically&#8221; coerced to other types as needed; the opposite of strongly typed. PHP is weakly typed. In PHP, you can concatenate the string <code>'12'</code> and the integer <code>3</code> to get the string <code>'123'</code>, then treat that as the integer <code>123</code>, all without any explicit conversion. [FIXME double-check this]
<dd>A language in which types are &#8220;automagically&#8221; coerced to other types as needed; the opposite of strongly typed. PHP is weakly typed. In PHP, you can concatenate the string <code>'12'</code> and the integer <code>3</code> to get the string <code>'123'</code>, then treat that as the integer <code>123</code>, all without any explicit conversion. [FIXME double-check this]
</dd>
</dl>
<p>So Python is both <em>dynamically typed</em> (because it doesn't use explicit datatype declarations) and <em>strongly typed</em> (because once a variable has a datatype, it actually matters).
<p>If you have experience in other programming languages, this table may help you visualize how Python compares to them:
<table class="simple">
<table class=simple>
<tr><th></th><th>Statically typed</th><th>Dynamically typed</th></tr>
<tr><th>Weakly typed</th><td>C, Objective-C</td><td>JavaScript, Perl 5, PHP</td></tr>
<tr><th>Strongly typed</th><td>Pascal, Java</td><td>Python, Ruby</td></tr>
</table>
<h2 id="readability">Writing readable code</h2>
<p>I won't bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you've forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You'll thank me in six months.
<h3 id="docstrings">Documentation strings</h3>
<p>You can document a Python function by giving it a documentation string (<code>docstring</code> for short). In this program, the <code>approximate_size</code> function has a <code>docstring</code>:
<h2 id=readability>Writing readable code</h2>
<p>I won't bore you with a long finger-wagging speech about the importance of documenting your code. Just know that code is written once but read many times, and the most important audience for your code is yourself, six months after writing it (i.e. after you've forgotten everything but need to fix something). Python makes it easy to write readable code, so take advantage of it. You'll thank me in six months.
<h3 id=docstrings>Documentation strings</h3>
<p>You can document a Python function by giving it a documentation string (<code>docstring</code> for short). In this program, the <code>approximate_size</code> function has a <code>docstring</code>:
<pre><code>def approximate_size(size, a_kilobyte_is_1024_bytes=True):
"""Convert a file size to human-readable form.
@@ -130,26 +129,26 @@ if __name__ == "__main__":
Returns: string
"""</code></pre>
<p>Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you'll see them most often used when defining a <code>docstring</code>.
<p>Triple quotes signify a multi-line string. Everything between the start and end quotes is part of a single string, including carriage returns, leading white space, and other quote characters. You can use them anywhere, but you'll see them most often used when defining a <code>docstring</code>.
<blockquote class="note compare perl5">
<p><span>&#x261E;</span>Triple quotes are also an easy way to define a string with both single and double quotes, like <code>qq/.../</code> in Perl 5.
</blockquote>
<p>Everything between the triple quotes is the function's <code>docstring</code>, which documents what the function does. A <code>docstring</code>, if it exists, must be the first thing defined in a function (that is, on the next line after the function declaration). You don't technically need to give your function a <code>docstring</code>, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the <code>docstring</code> is available at runtime as an attribute of the function.
<blockquote class="note">
<p><span>&#x261E;</span>Many Python <abbr>IDE</abbr>s use the <code>docstring</code> to provide context-sensitive documentation, so that when you type a function name, its <code>docstring</code> appears as a tooltip. This can be incredibly helpful, but it's only as good as the <code>docstring</code>s you write.
<p>Everything between the triple quotes is the function's <code>docstring</code>, which documents what the function does. A <code>docstring</code>, if it exists, must be the first thing defined in a function (that is, on the next line after the function declaration). You don't technically need to give your function a <code>docstring</code>, but you always should. I know you've heard this in every programming class you've ever taken, but Python gives you an added incentive: the <code>docstring</code> is available at runtime as an attribute of the function.
<blockquote class=note>
<p><span>&#x261E;</span>Many Python <abbr>IDE</abbr>s use the <code>docstring</code> to provide context-sensitive documentation, so that when you type a function name, its <code>docstring</code> appears as a tooltip. This can be incredibly helpful, but it's only as good as the <code>docstring</code>s you write.
</blockquote>
<h3 id="functionannotations">Function annotations</h3>
<h3 id=functionannotations>Function annotations</h3>
<p>FIXME
<h3 id="styleconventions">Style conventions</h3>
<h3 id=styleconventions>Style conventions</h3>
<p>FIXME
<h2 id="everythingisanobject">Everything is an object</h2>
<p>In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. A function, like everything else in Python, is an object.
<h2 id=everythingisanobject>Everything is an object</h2>
<p>In case you missed it, I just said that Python functions have attributes, and that those attributes are available at runtime. A function, like everything else in Python, is an object.
<p>Run the interactive Python shell and follow along:
<pre class="screen">
<a><samp class="prompt">>>> </samp><kbd>import humansize</kbd> <span>&#x2460;</span></a>
<a><samp class="prompt">>>> </samp><kbd>print(humansize.approximate_size(4096, True))</kbd> <span>&#x2461;</span></a>
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>import humansize</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>print(humansize.approximate_size(4096, True))</kbd> <span>&#x2461;</span></a>
<samp>4.0 KiB</samp>
<a><samp class="prompt">>>> </samp><kbd>print(humansize.approximate_size.__doc__)</kbd> <span>&#x2462;</span></a>
<a><samp class=prompt>>>> </samp><kbd>print(humansize.approximate_size.__doc__)</kbd> <span>&#x2462;</span></a>
<samp>Convert a file size to human-readable form.
Keyword arguments:
@@ -161,34 +160,34 @@ if __name__ == "__main__":
</samp></pre>
<ol>
<li>The first line imports the <code>humansize</code> program as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You'll see examples of multi-module Python programs in [FIXME xref].) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this to access functionality in other modules, and you can do it in the Python interactive shell too. This is an important concept, and you'll see a lot more of it throughout this book.
<li>When you want to use functions defined in imported modules, you need to include the module name. So you can't just say <code>approximate_size</code>; it must be <code>humansize.approximate_size</code>. If you've used classes in Java, this should feel vaguely familiar.
<li>The first line imports the <code>humansize</code> program as a module -- a chunk of code that you can use interactively, or from a larger Python program. (You'll see examples of multi-module Python programs in [FIXME xref].) Once you import a module, you can reference any of its public functions, classes, or attributes. Modules can do this to access functionality in other modules, and you can do it in the Python interactive shell too. This is an important concept, and you'll see a lot more of it throughout this book.
<li>When you want to use functions defined in imported modules, you need to include the module name. So you can't just say <code>approximate_size</code>; it must be <code>humansize.approximate_size</code>. If you've used classes in Java, this should feel vaguely familiar.
<li>Instead of calling the function as you would expect to, you asked for one of the function's attributes, <code>__doc__</code>.
</ol>
<blockquote class="note compare perl5">
<p><span>&#x261E;</span><code>import</code> in Python is like <code>require</code> in Perl. Once you <code>import</code> a Python module, you access its functions with <code><var>module</var>.<var>function</var></code>; once you <code>require</code> a Perl module, you access its functions with <code><var>module</var>::<var>function</var></code>.
<p><span>&#x261E;</span><code>import</code> in Python is like <code>require</code> in Perl. Once you <code>import</code> a Python module, you access its functions with <code><var>module</var>.<var>function</var></code>; once you <code>require</code> a Perl module, you access its functions with <code><var>module</var>::<var>function</var></code>.
</blockquote>
<h3 id="importsearchpath">The <code>import</code> search path</h3>
<p>Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in <code>sys.path</code>. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.)
<pre class="screen">
<a><samp class="prompt">>>> </samp><kbd>import sys</kbd> <span>&#x2460;</span></a>
<a><samp class="prompt">>>> </samp><kbd>sys.path</kbd> <span>&#x2461;</span></a>
<h3 id=importsearchpath>The <code>import</code> search path</h3>
<p>Before this goes any further, I want to briefly mention the library search path. Python looks in several places when you try to import a module. Specifically, it looks in all the directories defined in <code>sys.path</code>. This is just a list, and you can easily view it or modify it with standard list methods. (You'll learn more about lists later in this chapter.)
<pre class=screen>
<a><samp class=prompt>>>> </samp><kbd>import sys</kbd> <span>&#x2460;</span></a>
<a><samp class=prompt>>>> </samp><kbd>sys.path</kbd> <span>&#x2461;</span></a>
<samp>['', '/usr/lib/python30.zip', '/usr/lib/python3.0', '/usr/lib/python3.0/plat-linux2@EXTRAMACHDEPPATH@', '/usr/lib/python3.0/lib-dynload', '/usr/lib/python3.0/dist-packages', '/usr/local/lib/python3.0/dist-packages']</samp>
<a><samp class="prompt">>>> </samp><kbd>sys</kbd> <span>&#x2462;</span></a>
<a><samp class=prompt>>>> </samp><kbd>sys</kbd> <span>&#x2462;</span></a>
<samp>&lt;module 'sys' (built-in)></samp>
<a><samp class="prompt">>>> </samp><kbd>sys.path.append('/my/new/path')</kbd> <span>&#x2463;</span></a></pre>
<a><samp class=prompt>>>> </samp><kbd>sys.path.append('/my/new/path')</kbd> <span>&#x2463;</span></a></pre>
<ol>
<li>Importing the <code>sys</code> module makes all of its functions and attributes available.
<li><code>sys.path</code> is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a <code>.py</code> file whose name matches what you're trying to import.
<li>Actually, I lied; the truth is more complicated than that, because not all modules are stored as <code>.py</code> files. Some, like the <code>sys</code> module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The <code>sys</code> module is written in <abbr>C</abbr>.)
<li>You can add a new directory to Python's search path at runtime by appending the directory name to <code>sys.path</code>, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll learn more about <code>append()</code> and other list methods in [FIXME xref-was-#datatypes].)
<li><code>sys.path</code> is a list of directory names that constitute the current search path. (Yours will look different, depending on your operating system, what version of Python you're running, and where it was originally installed.) Python will look through these directories (in this order) for a <code>.py</code> file whose name matches what you're trying to import.
<li>Actually, I lied; the truth is more complicated than that, because not all modules are stored as <code>.py</code> files. Some, like the <code>sys</code> module, are "built-in modules"; they are actually baked right into Python itself. Built-in modules behave just like regular modules, but their Python source code is not available, because they are not written in Python! (The <code>sys</code> module is written in <abbr>C</abbr>.)
<li>You can add a new directory to Python's search path at runtime by appending the directory name to <code>sys.path</code>, and then Python will look in that directory as well, whenever you try to import a module. The effect lasts as long as Python is running. (You'll learn more about <code>append()</code> and other list methods in [FIXME xref-was-#datatypes].)
</ol>
<h3 id="whatsanobject">What's an object?</h3>
<p>Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute <code>__doc__</code>, which returns the <var>docstring</var> defined in the function's source code. The <code>sys</code> module is an object which has (among other things) an attribute called <var>path</var>. And so forth.
<p>Still, this doesn't answer the more fundamental question: what is an object? Different programming languages define &#8220;object&#8221; in different ways. In some, it means that <em>all</em> objects <em>must</em> have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in [FIXME xref-was-#datatypes]), and not all objects are subclassable (more on this in [FIXME xref-was-#fileinfo]). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function (more in this in [FIXME xref-was-#apihelp]).
<p>This is so important that I'm going to repeat it in case you missed it the first few times: <em>everything in Python is an object</em>. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
<h2 id="indentingcode">Indenting code</h2>
<p>Python functions have no explicit <code>begin</code> or <code>end</code>, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (<code>:</code>) and the indentation of the code itself.
<h3 id=whatsanobject>What's an object?</h3>
<p>Everything in Python is an object, and almost everything has attributes and methods. All functions have a built-in attribute <code>__doc__</code>, which returns the <var>docstring</var> defined in the function's source code. The <code>sys</code> module is an object which has (among other things) an attribute called <var>path</var>. And so forth.
<p>Still, this doesn't answer the more fundamental question: what is an object? Different programming languages define &#8220;object&#8221; in different ways. In some, it means that <em>all</em> objects <em>must</em> have attributes and methods; in others, it means that all objects are subclassable. In Python, the definition is looser; some objects have neither attributes nor methods (more on this in [FIXME xref-was-#datatypes]), and not all objects are subclassable (more on this in [FIXME xref-was-#fileinfo]). But everything is an object in the sense that it can be assigned to a variable or passed as an argument to a function (more in this in [FIXME xref-was-#apihelp]).
<p>This is so important that I'm going to repeat it in case you missed it the first few times: <em>everything in Python is an object</em>. Strings are objects. Lists are objects. Functions are objects. Even modules are objects.
<h2 id=indentingcode>Indenting code</h2>
<p>Python functions have no explicit <code>begin</code> or <code>end</code>, and no curly braces to mark where the function code starts and stops. The only delimiter is a colon (<code>:</code>) and the indentation of the code itself.
<pre><code>
<a>def approximate_size(size, a_kilobyte_is_1024_bytes=True): <span>&#x2460;</span></a>
<a> if size &lt; 0: <span>&#x2461;</span></a>
@@ -202,42 +201,40 @@ if __name__ == "__main__":
raise ValueError('number too large')</code></pre>
<ol>
<li>Code blocks are defined by their indentation. By "code block," I mean functions, <code>if</code> statements, <code>for</code> loops, <code>while</code> loops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords. This means that whitespace is significant, and must be consistent. In this example, the function code is indented four spaces. It doesn't need to be four spaces, it just needs to be consistent. The first line that is not indented marks the end of the function.
<li>In Python, an <code>if</code> statement is followed by a code block. If the <code>if</code> expression evaluates to true, the indented block is executed, otherwise it falls to the <code>else</code> block (if any). (Note the lack of parentheses around the expression.)
<li>This line is inside the <code>if</code> code block. This <code>raise</code> statement will raise an exception (of type <code>ValueError</code>), but only if <code>size &lt; 0</code>.
<li>This is <em>not</em> the end of the function. Completely blank lines don't count. The function continues on the next line.
<li>The <code>for</code> loop also marks the start of a code block. Code blocks can contain multiple lines, as long as they are all indented the same amount. This <code>for</code> loop has three lines of code in it. There is no other special syntax for multi-line code blocks. Just indent and get on with your life.
<li>Code blocks are defined by their indentation. By "code block," I mean functions, <code>if</code> statements, <code>for</code> loops, <code>while</code> loops, and so forth. Indenting starts a block and unindenting ends it. There are no explicit braces, brackets, or keywords. This means that whitespace is significant, and must be consistent. In this example, the function code is indented four spaces. It doesn't need to be four spaces, it just needs to be consistent. The first line that is not indented marks the end of the function.
<li>In Python, an <code>if</code> statement is followed by a code block. If the <code>if</code> expression evaluates to true, the indented block is executed, otherwise it falls to the <code>else</code> block (if any). (Note the lack of parentheses around the expression.)
<li>This line is inside the <code>if</code> code block. This <code>raise</code> statement will raise an exception (of type <code>ValueError</code>), but only if <code>size &lt; 0</code>.
<li>This is <em>not</em> the end of the function. Completely blank lines don't count. The function continues on the next line.
<li>The <code>for</code> loop also marks the start of a code block. Code blocks can contain multiple lines, as long as they are all indented the same amount. This <code>for</code> loop has three lines of code in it. There is no other special syntax for multi-line code blocks. Just indent and get on with your life.
</ol>
<p>After some initial protests and several snide analogies to Fortran, you will make peace with this and start seeing its benefits. One major benefit is that all Python programs look similar, since indentation is a language requirement and not a matter of style. This makes it easier to read and understand other people's Python code.
<p>After some initial protests and several snide analogies to Fortran, you will make peace with this and start seeing its benefits. One major benefit is that all Python programs look similar, since indentation is a language requirement and not a matter of style. This makes it easier to read and understand other people's Python code.
<blockquote class="note compare java">
<p><span>&#x261E;</span>Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. <abbr>C++</abbr> and Java use semicolons to separate statements and curly braces to separate code blocks.
<p><span>&#x261E;</span>Python uses carriage returns to separate statements and a colon and indentation to separate code blocks. <abbr>C++</abbr> and Java use semicolons to separate statements and curly braces to separate code blocks.
</blockquote>
<h2 id="runningscripts">Running scripts</h2>
<p>Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of <code>humansize.py</code>:
<h2 id=runningscripts>Running scripts</h2>
<p>Python modules are objects and have several useful attributes. You can use this to easily test your modules as you write them, by including a special block of code that executes when you run the Python file on the command line. Take the last few lines of <code>humansize.py</code>:
<pre><code>
if __name__ == "__main__":
print(approximate_size(1000000000000, False))
print(approximate_size(1000000000000))</code></pre>
<blockquote class="note compare clang">
<p><span>&#x261E;</span>Like <abbr>C</abbr>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <abbr>C</abbr>, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
<p><span>&#x261E;</span>Like <abbr>C</abbr>, Python uses <code>==</code> for comparison and <code>=</code> for assignment. Unlike <abbr>C</abbr>, Python does not support in-line assignment, so there's no chance of accidentally assigning the value you thought you were comparing.
</blockquote>
<p>So what makes this <code>if</code> statement special? Well, modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module's <code>__name__</code> depends on how you're using the module. If you <code>import</code> the module, then <code>__name__</code> is the module's filename, without a directory path or file extension.
<pre class="screen"><samp class="prompt">>>> </samp><kbd>import humansize</kbd>
<samp class="prompt">>>> </samp><kbd>humansize.__name__</kbd>
<p>So what makes this <code>if</code> statement special? Well, modules are objects, and all modules have a built-in attribute <code>__name__</code>. A module's <code>__name__</code> depends on how you're using the module. If you <code>import</code> the module, then <code>__name__</code> is the module's filename, without a directory path or file extension.
<pre class=screen><samp class=prompt>>>> </samp><kbd>import humansize</kbd>
<samp class=prompt>>>> </samp><kbd>humansize.__name__</kbd>
<samp>'humansize'</samp></pre>
<p>But you can also run the module directly as a standalone program, in which case <code>__name__</code> will be a special default value, <code>__main__</code>. Python will evaluate this <code>if</code> statement, find a true expression, and execute the <code>if</code> code block. In this case, to print two values.
<pre class="screen"><samp class="prompt">c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<p>But you can also run the module directly as a standalone program, in which case <code>__name__</code> will be a special default value, <code>__main__</code>. Python will evaluate this <code>if</code> statement, find a true expression, and execute the <code>if</code> code block. In this case, to print two values.
<pre class=screen><samp class=prompt>c:\home\diveintopython3> </samp><kbd>c:\python30\python.exe humansize.py</kbd>
<samp>1.0 TB
931.3 GiB</samp></pre>
<h3 id="furtherreading">Further reading</h3>
<h3 id=furtherreading>Further reading</h3>
<ul>
<li><a href="http://www.python.org/dev/peps/pep-0257/">PEP 257: Docstring Conventions</a> explains what distinguishes a good <code>docstring</code> from a great <code>docstring</code>.
<li><a href="http://docs.python.org/3.0/tutorial/controlflow.html#documentation-strings">Python Tutorial: Documentation Strings</a> also touches on the subject.
<li><a href="http://www.python.org/dev/peps/pep-0008/">PEP 8: Style Guide for Python Code</a> discusses good indentation style.
<li><a href="http://docs.python.org/3.0/reference/"><cite>Python Reference Manual</cite></a> explains what it means to say that <a href="http://docs.python.org/3.0/reference/datamodel.html#objects-values-and-types">everything in Python is an object</a>, because some people are pedantic and like to discuss that sort of thing at great length.
<li><a href=http://www.python.org/dev/peps/pep-0257/>PEP 257: Docstring Conventions</a> explains what distinguishes a good <code>docstring</code> from a great <code>docstring</code>.
<li><a href=http://docs.python.org/3.0/tutorial/controlflow.html#documentation-strings>Python Tutorial: Documentation Strings</a> also touches on the subject.
<li><a href=http://www.python.org/dev/peps/pep-0008/>PEP 8: Style Guide for Python Code</a> discusses good indentation style.
<li><a href=http://docs.python.org/3.0/reference/><cite>Python Reference Manual</cite></a> explains what it means to say that <a href=http://docs.python.org/3.0/reference/datamodel.html#objects-values-and-types>everything in Python is an object</a>, because some people are pedantic and like to discuss that sort of thing at great length.
</ul>
<p class="c">&copy; 2001-4, 2009 <span>&#x2133;</span>ark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA-3.0</a>
<script type="text/javascript" src="http://www.google.com/jsapi"></script>
<script type="text/javascript" src="dip3.js"></script>
</body>
</html>
<p class=c>&copy; 2001&ndash;4, 2009 <span>&#x2133;</span>ark Pilgrim, <a href=http://creativecommons.org/licenses/by-sa/3.0/ rel=license>CC-BY-SA-3.0</a>
<script type=text/javascript src=jquery.js></script>
<script type=text/javascript src=dip3.js></script>
Binary file not shown.