mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
several more 2to3 sections completed
This commit is contained in:
@@ -12,17 +12,17 @@ body{counter-reset:h1 19}
|
||||
<h1>Case study: porting <code class="filename">chardet</code> to Python 3</h1>
|
||||
|
||||
<blockquote class="q">
|
||||
<p><span>❝</span> Words, words. They're all we have to go on. <span>❞</span><br>— <cite>Rosencrantz and Guildenstern are Dead</cite>
|
||||
<p><span>❝</span> Words, words. They’re all we have to go on. <span>❞</span><br>— <cite>Rosencrantz and Guildenstern are Dead</cite>
|
||||
</blockquote>
|
||||
|
||||
<ol>
|
||||
<li><a href="#faq">Introducing <code class="filename">chardet</code>: a mini-FAQ</a>
|
||||
<ol>
|
||||
<li><a href="#faq.what">What is character encoding auto-detection?</a>
|
||||
<li><a href="#faq.impossible">Isn't that impossible?</a>
|
||||
<li><a href="#faq.impossible">Isn’t that impossible?</a>
|
||||
<li><a href="#faq.who">Who wrote this detection algorithm?</a>
|
||||
<li><a href="#faq.yippie">Yippie! Screw the standards, I'll just auto-detect everything!</a>
|
||||
<li><a href="#faq.why">Why bother with auto-detection if it's slow, inaccurate, and non-standard?</a>
|
||||
<li><a href="#faq.yippie">Yippie! Screw the standards, I’ll just auto-detect everything!</a>
|
||||
<li><a href="#faq.why">Why bother with auto-detection if it’s slow, inaccurate, and non-standard?</a>
|
||||
</ol>
|
||||
<li><a href="#divingin">Diving in</a>
|
||||
<ol>
|
||||
@@ -33,40 +33,40 @@ body{counter-reset:h1 19}
|
||||
<li><a href="#how.windows1252"><code>windows-1252</code></a>
|
||||
</ol>
|
||||
<li><a href="#running2to3">Running <code class="filename">2to3</code></a>
|
||||
<li><a href="#manual">Fixing what <code class="filename">2to3</code> can't</a>
|
||||
<li><a href="#manual">Fixing what <code class="filename">2to3</code> can’t</a>
|
||||
<ol>
|
||||
<li><a href="#falseisinvalidsyntax"><code>False</code> is invalid syntax</a>
|
||||
<li><a href="#nomodulenamedconstants">No module named <code class="filename">constants</code></a>
|
||||
<li><a href="#namefileisnotdefined">Name '<var>file</var>' is not defined</a>
|
||||
<li><a href="#cantuseastringpattern">Can't use a string pattern on a bytes-like object</a>
|
||||
<li><a href="#cantconvertbytesobject">Can't convert '<code>bytes</code>' object to <code>str</code> implicitly</a>
|
||||
<li><a href="#cantuseastringpattern">Can’t use a string pattern on a bytes-like object</a>
|
||||
<li><a href="#cantconvertbytesobject">Can’t convert '<code>bytes</code>' object to <code>str</code> implicitly</a>
|
||||
</ol>
|
||||
</ol>
|
||||
|
||||
<h2 id="faq">Introducing <code class="filename">chardet</code>: a mini-FAQ</h2>
|
||||
|
||||
<p class="fancy">When you think of "text," you probably think of "characters and symbols I see on my computer screen." But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular <em>character encoding</em>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
|
||||
<p class="fancy">When you think of “text,” you probably think of “characters and symbols I see on my computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular <em>character encoding</em>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
|
||||
|
||||
<p>In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text", you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
|
||||
<p>In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
|
||||
|
||||
<h3 id="faq.what">What is character encoding auto-detection?</h3>
|
||||
|
||||
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It's like cracking a code when you don't have the decryption key.
|
||||
<p>It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key.
|
||||
|
||||
<h3 id="faq.impossible">Isn't that impossible?</h3>
|
||||
<h3 id="faq.impossible">Isn’t that impossible?</h3>
|
||||
|
||||
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds "txzqJv 2!dasd0a QqdKjvz" will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of "typical" text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
|
||||
<p>In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.
|
||||
<p>In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
|
||||
|
||||
<h3 id="faq.who">Who wrote this detection algorithm?</h3>
|
||||
|
||||
<p>This library is a port of <a href="http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/">the auto-detection code in Mozilla</a>. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors' comments, which are quite extensive and informative.
|
||||
<p>This library is a port of <a href="http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/base/">the auto-detection code in Mozilla</a>. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors’ comments, which are quite extensive and informative.
|
||||
|
||||
<p>You may also be interested in the research paper which led to the Mozilla implementation, <a href="http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html">A composite approach to language/encoding detection</a>.
|
||||
|
||||
<h3 id="faq.yippie">Yippie! Screw the standards, I'll just auto-detect everything!</h3>
|
||||
<h3 id="faq.yippie">Yippie! Screw the standards, I’ll just auto-detect everything!</h3>
|
||||
|
||||
<p>Don't do that. Virtually every format and protocol contains a method for specifying character encoding.
|
||||
<p>Don’t do that. Virtually every format and protocol contains a method for specifying character encoding.
|
||||
|
||||
<ul>
|
||||
<li>HTTP can define a <code>charset</code> parameter in the <code>Content-type</code> header.
|
||||
@@ -76,11 +76,11 @@ body{counter-reset:h1 19}
|
||||
|
||||
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
|
||||
|
||||
<p>Despite the complexity, it's worthwhile to follow standards and <a href="http://www.w3.org/2001/tag/doc/mime-respect">respect explicit character encoding information</a>. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
|
||||
<p>Despite the complexity, it’s worthwhile to follow standards and <a href="http://www.w3.org/2001/tag/doc/mime-respect">respect explicit character encoding information</a>. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
|
||||
|
||||
<h3 id="faq.why">Why bother with auto-detection if it's slow, inaccurate, and non-standard?</h3>
|
||||
<h3 id="faq.why">Why bother with auto-detection if it’s slow, inaccurate, and non-standard?</h3>
|
||||
|
||||
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn't work. There are also some poorly designed standards that have no way to specify encoding at all.
|
||||
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all.
|
||||
|
||||
<p>If following the relevant standards gets you nowhere, <em>and</em> you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my <a href="http://feedparser.org/">Universal Feed Parser</a>, which calls this auto-detection library <a href="http://feedparser.org/docs/character-encoding.html">only after exhausting all other options</a>.
|
||||
|
||||
@@ -88,7 +88,7 @@ body{counter-reset:h1 19}
|
||||
|
||||
<p>This is a brief guide to navigating the code itself.
|
||||
|
||||
<p>The main entry point for the detection algorithm is <code class="filename">universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code class="filename">chardet/__init__.py</code>, but that's really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
|
||||
<p>The main entry point for the detection algorithm is <code class="filename">universaldetector.py</code>, which has one class, <code>UniversalDetector</code>. (You might think the main entry point is the <code>detect</code> function in <code class="filename">chardet/__init__.py</code>, but that’s really just a convenience function that creates a <code>UniversalDetector</code> object, calls it, and returns its result.)
|
||||
|
||||
<p>There are 5 categories of encodings that <code>UniversalDetector</code> handles:
|
||||
|
||||
@@ -97,12 +97,12 @@ body{counter-reset:h1 19}
|
||||
<li>Escaped encodings, which are entirely 7-bit <abbr>ASCII</abbr> compatible, where non-<abbr>ASCII</abbr> characters start with an escape sequence. Examples: <code>ISO-2022-JP</code> (Japanese) and <code>HZ-GB-2312</code> (Chinese).
|
||||
<li>Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: <code>Big5</code> (Chinese), <code>SHIFT_JIS</code> (Japanese), <code>EUC-KR</code> (Korean), and <code>UTF-8</code> without a <abbr title="Byte Order Mark">BOM</abbr>.
|
||||
<li>Single-byte encodings, where each character is represented by one byte. Examples: <code>KOI8-R</code> (Russian), <code>windows-1255</code> (Hebrew), and <code>TIS-620</code> (Thai).
|
||||
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn't know a character encoding from a hole in the ground.
|
||||
<li><code>windows-1252</code>, which is used primarily on Microsoft Windows by middle managers who wouldn’t know a character encoding from a hole in the ground.
|
||||
</ol>
|
||||
|
||||
<h3 id="how.bom"><code>UTF-n</code> with a <abbr title="Byte Order Mark">BOM</abbr></h3>
|
||||
|
||||
<p>If the text starts with a <abbr title="Byte Order Mark">BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr title="Byte Order Mark">BOM</abbr> will tell us exactly which one; that's what it's for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
|
||||
<p>If the text starts with a <abbr title="Byte Order Mark">BOM</abbr>, we can reasonably assume that the text is encoded in <code>UTF-8</code>, <code>UTF-16</code>, or <code>UTF-32</code>. (The <abbr title="Byte Order Mark">BOM</abbr> will tell us exactly which one; that’s what it’s for.) This is handled inline in <code>UniversalDetector</code>, which returns the result immediately without any further processing.
|
||||
|
||||
<h3 id="how.esc">Escaped encodings</h3>
|
||||
|
||||
@@ -112,7 +112,7 @@ body{counter-reset:h1 19}
|
||||
|
||||
<h3 id="how.mb">Multi-byte encodings</h3>
|
||||
|
||||
<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of "<span class="quote">probers</span>" for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
|
||||
<p>Assuming no <abbr title="Byte Order Mark">BOM</abbr>, <code>UniversalDetector</code> checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, <code>windows-1252</code>.
|
||||
|
||||
<p>The multi-byte encoding prober, <code>MBCSGroupProber</code> (defined in <code class="filename">mbcsgroupprober.py</code>), is really just a shell that manages a group of other probers, one for each multi-byte encoding: <code>Big5</code>, <code>GB2312</code>, <code>EUC-TW</code>, <code>EUC-KR</code>, <code>EUC-JP</code>, <code>SHIFT_JIS</code>, and <code>UTF-8</code>. <code>MBCSGroupProber</code> feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to <code>UniversalDetector</code>.<code>feed()</code> will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, <code>MBCSGroupProber</code> reports this positive result to <code>UniversalDetector</code>, which reports the result to the caller.
|
||||
|
||||
@@ -136,7 +136,7 @@ body{counter-reset:h1 19}
|
||||
|
||||
<h2 id="running2to3">Running <code class="filename">2to3</code></h2>
|
||||
|
||||
<p>We're going to migrate the <code class="filename">chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code class="filename">2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href="porting-code-to-python-3-with-2to3.html">Porting code to Python 3 with <code class="filename">2to3</code></a>. In this chapter, we'll start by running <code class="filename">2to3</code> on the <code class="filename">chardet</code> package, but as you'll see, there will still be a lot of work to do after the automated tools have performed their magic.
|
||||
<p>We’re going to migrate the <code class="filename">chardet</code> module from Python 2 to Python 3. Python 3 comes with a utility script called <code class="filename">2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. In some cases this is easy -- a function was renamed or moved to a different modules -- but in other cases it can get pretty complex. To get a sense of all that it <em>can</em> do, refer to the appendix, <a href="porting-code-to-python-3-with-2to3.html">Porting code to Python 3 with <code class="filename">2to3</code></a>. In this chapter, we’ll start by running <code class="filename">2to3</code> on the <code class="filename">chardet</code> package, but as you’ll see, there will still be a lot of work to do after the automated tools have performed their magic.
|
||||
|
||||
<p>The main <code class="filename">chardet</code> package is split across several different files, all in the same directory. The <code class="filename">2to3</code> script makes it easy to convert multiple files at once: just pass a directory as a command line argument, and <code class="filename">2to3</code> will convert each of the files in turn.
|
||||
|
||||
@@ -642,13 +642,13 @@ RefactoringTool: Skipping implicit fixer: ws_comma
|
||||
RefactoringTool: Files that were modified:
|
||||
RefactoringTool: test.py</samp></pre>
|
||||
|
||||
<p id="skip2to3outputtest">Well, that wasn't so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it'll work?
|
||||
<p id="skip2to3outputtest">Well, that wasn’t so hard. Just a few imports and print statements to convert. Time to run the new version. Do you think it’ll work?
|
||||
|
||||
<h2 id="manual">Fixing what <code class="filename">2to3</code> can't</h2>
|
||||
<h2 id="manual">Fixing what <code class="filename">2to3</code> can’t</h2>
|
||||
|
||||
<h3 id="falseisinvalidsyntax"><code>False</code> is invalid syntax</h3>
|
||||
|
||||
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it's a good way to test our ported code to make sure there aren't any bugs lurking anywhere.
|
||||
<p>Now for the real test: running the test harness against the test suite. Since the test suite is designed to cover all the possible code paths, it’s a good way to test our ported code to make sure there aren’t any bugs lurking anywhere.
|
||||
|
||||
<p class="skip"><a href="#skipinvalidsyntax">skip over this</a>
|
||||
<pre class="screen"><samp class="prompt">C:\home\chardet></samp><kbd>python test.py tests\*\*</kbd>
|
||||
@@ -660,7 +660,7 @@ RefactoringTool: test.py</samp></pre>
|
||||
^
|
||||
SyntaxError: invalid syntax</samp></pre>
|
||||
|
||||
<p id="skipinvalidsyntax">Hmm, a small snag. In Python 3, <code>False</code> is a reserved word, so you can't use it as a variable name. Let's look at <code class="filename">constants.py</code> to see where it's defined. Here's the original version from <code class="filename">constants.py</code>, before the <code class="filename">2to3</code> script changed it:
|
||||
<p id="skipinvalidsyntax">Hmm, a small snag. In Python 3, <code>False</code> is a reserved word, so you can’t use it as a variable name. Let’s look at <code class="filename">constants.py</code> to see where it’s defined. Here’s the original version from <code class="filename">constants.py</code>, before the <code class="filename">2to3</code> script changed it:
|
||||
|
||||
<p class="skip"><a href="#skipbuiltincode">skip over this</a>
|
||||
<pre><code>import __builtin__
|
||||
@@ -673,7 +673,7 @@ else:
|
||||
|
||||
<p id="skipbuiltincode">This piece of code is designed to allow this library to run under older versions of Python 2. Prior to Python 2.3 [FIXME-LINK], Python had no built-in <code>Boolean</code> type. This code detects the absence of the built-in constants <code>True</code> and <code>False</code>, and defines them if necessary.
|
||||
|
||||
<p>However, Python 3 will always have a <code>Boolean</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of "<code>constants.True</code>" and "<code>constants.False</code>" with "<code>True</code>" and "<code>False</code>", respectively, then delete this dead code from <code class="filename">constants.py</code>.
|
||||
<p>However, Python 3 will always have a <code>Boolean</code> type, so this entire code snippet is unnecessary. The simplest solution is to replace all instances of <code>constants.True</code> and <code>constants.False</code> with <code>True</code> and <code>False</code>, respectively, then delete this dead code from <code class="filename">constants.py</code>.
|
||||
|
||||
<p>So this line in <code class="filename">universaldetector.py</code>:
|
||||
|
||||
@@ -683,7 +683,7 @@ else:
|
||||
|
||||
<pre><code>self.done = False</code></pre>
|
||||
|
||||
<p>Ah, wasn't that satisfying? The code is shorter and more readable already.
|
||||
<p>Ah, wasn’t that satisfying? The code is shorter and more readable already.
|
||||
|
||||
<h3 id="nomodulenamedconstants">No module named <code class="filename">constants</code></h3>
|
||||
|
||||
@@ -698,11 +698,11 @@ else:
|
||||
import constants, sys
|
||||
ImportError: No module named constants</samp></pre>
|
||||
|
||||
<p id="skipnomodulenamedconstants">What's that you say? No module named <code class="filename">constants</code>? Of course there's a module named <code class="filename">constants</code>. ... Oh wait, no there isn't. Remember when the <code class="filename">2to3</code> script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
|
||||
<p id="skipnomodulenamedconstants">What’s that you say? No module named <code class="filename">constants</code>? Of course there’s a module named <code class="filename">constants</code>. ... Oh wait, no there isn’t. Remember when the <code class="filename">2to3</code> script fixed up all those import statements? This library has a lot of relative imports -- that is, modules that import other modules within the library. In Python 3, all import statements are absolute by default [FIXME-LINK PEP 0328]. To do relative imports, you need to do something like this instead:
|
||||
|
||||
<pre><code>from . import constants</code></pre>
|
||||
|
||||
<p>But wait. Wasn't the <code class="filename">2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code class="filename">constants</code> module within the library, and an absolute import of the <code class="filename">sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can't, and the <code class="filename">2to3</code> script is not smart enough to split the import statement into two.
|
||||
<p>But wait. Wasn’t the <code class="filename">2to3</code> script supposed to take care of these for you? Well, it did, but this particular import statement combines two different types of imports into one line: a relative import of the <code class="filename">constants</code> module within the library, and an absolute import of the <code class="filename">sys</code> module that is pre-installed in the Python standard library. In Python 2, you could combine these into one import statement. In Python 3, you can’t, and the <code class="filename">2to3</code> script is not smart enough to split the import statement into two.
|
||||
|
||||
<p>The solution is to split the import statement manually. So this two-in-one import:
|
||||
|
||||
@@ -713,7 +713,7 @@ ImportError: No module named constants</samp></pre>
|
||||
<pre><code>from . import constants
|
||||
import sys</code></pre>
|
||||
|
||||
<p>There are variations of this problem scattered throughout the <code class="filename">chardet</code> library. In some places it's "<code>import constants, sys</code>"; in other places, it's "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
|
||||
<p>There are variations of this problem scattered throughout the <code class="filename">chardet</code> library. In some places it’s "<code>import constants, sys</code>"; in other places, it’s "<code>import constants, re</code>". The fix is the same: manually split the import statement into two lines, one for the relative import, the other for the absolute import.
|
||||
|
||||
<p>Onward!
|
||||
|
||||
@@ -729,15 +729,15 @@ import sys</code></pre>
|
||||
for line in file(f, 'rb'):
|
||||
NameError: name 'file' is not defined</samp></pre>
|
||||
|
||||
<p id="skipnamefileisnotdefined">This one surprised me, because I've been using this idiom as long as I can remember. In Python 2, the global <var>file()</var> function was an alias for <var>open()</var>, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the <code class="filename">io</code> module. [FIXME-LINK PEP 3116] I'll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global <var>file()</var> function no longer exists. However, the <var>open()</var> function does still exist. (Technically, it's an alias for <var>io.open()</var>, but never mind that right now.)
|
||||
<p id="skipnamefileisnotdefined">This one surprised me, because I’ve been using this idiom as long as I can remember. In Python 2, the global <var>file()</var> function was an alias for <var>open()</var>, which was the standard way of opening files for reading. In Python 3, the entire system for reading and writing files has been refactored into the <code class="filename">io</code> module. [FIXME-LINK PEP 3116] I’ll cover the new I/O module in more detail in Chapter FIXME, but for now, the important bit is that the global <var>file()</var> function no longer exists. However, the <var>open()</var> function does still exist. (Technically, it’s an alias for <var>io.open()</var>, but never mind that right now.)
|
||||
|
||||
<p>Thus, the simplest solution to the problem of the missing <var>file()</var> is to call <var>open()</var> instead:
|
||||
|
||||
<pre><code>for line in open(f, 'rb'):</code></pre>
|
||||
|
||||
<p>And that's all I have to say about that.
|
||||
<p>And that’s all I have to say about that.
|
||||
|
||||
<h3 id="cantuseastringpattern">Can't use a string pattern on a bytes-like object</h3>
|
||||
<h3 id="cantuseastringpattern">Can’t use a string pattern on a bytes-like object</h3>
|
||||
|
||||
<p>FIXME intro
|
||||
|
||||
@@ -751,20 +751,20 @@ NameError: name 'file' is not defined</samp></pre>
|
||||
if self._highBitDetector.search(aBuf):
|
||||
TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
||||
|
||||
<p id="skipcantuseastringpattern">Now things are starting to get interesting. And by "interesting," I mean "confusing as all hell."
|
||||
<p id="skipcantuseastringpattern">Now things are starting to get interesting. And by “interesting,” I mean “confusing as all hell.”
|
||||
|
||||
<p>First, let's see what <var>self._highBitDetector</var> is. It's defined in the <var>__init__</var> method of the <var>UniversalDetector</var> class:
|
||||
<p>First, let’s see what <var>self._highBitDetector</var> is. It’s defined in the <var>__init__</var> method of the <var>UniversalDetector</var> class:
|
||||
|
||||
<p class="skip"><a href="#skiphighbitdetectorcode">skip over this</a>
|
||||
<pre><code>class UniversalDetector:
|
||||
def __init__(self):
|
||||
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
|
||||
|
||||
<p id="skiphighbitdetectorcode">This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that's not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII <em>bytes</em> in the range 128-255.
|
||||
<p id="skiphighbitdetectorcode">This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII <em>bytes</em> in the range 128-255.
|
||||
|
||||
<p>And therein lies the problem.
|
||||
|
||||
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we're searching is not a string, it's a byte array. Looking at the traceback, this error occurred in <code class="filename">universaldetector.py</code>:
|
||||
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in <code class="filename">universaldetector.py</code>:
|
||||
|
||||
<p class="skip"><a href="#skipfeedhighbitdetectorcode">skip over this</a>
|
||||
<pre><code>def feed(self, aBuf):
|
||||
@@ -774,7 +774,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
||||
if self._mInputState == ePureAscii:
|
||||
if self._highBitDetector.search(aBuf):</code></pre>
|
||||
|
||||
<p id="skipfeedhighbitdetectorcode">And what is <var>aBuf</var>? Let's backtrack further to a place that calls <var>UniversalDetector.feed()</var>. One place that calls it is the test harness, <code class="filename">test.py</code>.
|
||||
<p id="skipfeedhighbitdetectorcode">And what is <var>aBuf</var>? Let’s backtrack further to a place that calls <var>UniversalDetector.feed()</var>. One place that calls it is the test harness, <code class="filename">test.py</code>.
|
||||
|
||||
<p class="skip"><a href="#skiptestharnessfeedcode">skip over this</a>
|
||||
<pre><code>u = UniversalDetector()
|
||||
@@ -784,7 +784,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
||||
for line in open(f, 'rb'):
|
||||
u.feed(line)</code></pre>
|
||||
|
||||
<p id="skiptestharnessfeedcode">And here we find our answer: in the <var>UniversalDetector.feed()</var> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for "read"; OK, big deal, we're reading the file. Ah, but <code>'b'</code> is for "bytes." Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <var>UniversalDetector.feed()</var>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit... characters. But we don't have characters; we have bytes. Oops.
|
||||
<p id="skiptestharnessfeedcode">And here we find our answer: in the <var>UniversalDetector.feed()</var> method, <var>aBuf</var> is a line read from a file on disk. Look carefully at the parameters used to open the file: <code>'rb'</code>. <code>'r'</code> is for “read”; OK, big deal, we’re reading the file. Ah, but <code>'b'</code> is for “binary.” Without the <code>'b'</code> flag, this <code>for</code> loop would read the file, line by line, and convert each line into a string -- an array of Unicode characters -- according to the system default character encoding. (You could override the system encoding with another parameter to <var>open()</var>, but never mind that for now.) But with the <code>'b'</code> flag, this <code>for</code> loop reads the file, line by line, and stores each line exactly as it appears in the file, as an array of bytes. That byte array gets passed to <var>UniversalDetector.feed()</var>, and eventually gets passed to the pre-compiled regular expression, <var>self._highBitDetector</var>, to search for high-bit... characters. But we don’t have characters; we have bytes. Oops.
|
||||
|
||||
<p>What we need this regular expression to search is not an array of characters, but an array of bytes.
|
||||
|
||||
@@ -821,7 +821,7 @@ TypeError: Can't convert 'bytes' object to str implicitly</samp></pre>
|
||||
<p id="skipcantconvertbytesobject">...
|
||||
|
||||
<footer>
|
||||
<p class="c">© 2001-4, 2009 Mark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">CC-BY-3.0</a>
|
||||
<p class="c">© 2001-4, 2009 <span>ℳ</span>ark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">CC-BY-3.0</a>
|
||||
</footer>
|
||||
|
||||
</body>
|
||||
|
||||
@@ -4,7 +4,7 @@ a{background:transparent;text-decoration:none;border-bottom:1px dotted}
|
||||
a:hover{border-bottom:1px solid}
|
||||
a:link{color:#1b67c9}
|
||||
a:visited{color:darkorchid}
|
||||
a[href^="http:"]:before,a[href^="https:"]:before{content:"\27A6 "}
|
||||
/*a[href^="http:"]:before,a[href^="https:"]:before{content:"\27A6 "}*/
|
||||
h1 a,h2 a,h3 a,#nav a{color:inherit !important}
|
||||
abbr,.p{border:0;letter-spacing:0.1em;text-transform:lowercase;font-variant:small-caps}
|
||||
h1,h2,h3,p,ul,ol,#nav{margin:1.75em 0}
|
||||
@@ -18,7 +18,8 @@ img{border:0}
|
||||
.framed{border:1px solid}
|
||||
pre{line-height:2.154;margin:2.154em 0;padding:0 0 0 2.154em;border-left:1px dotted}
|
||||
td pre{margin:0;padding:0;border:0}
|
||||
.c{text-align:center;clear:both;font-size:small}
|
||||
.c/*,.z*/{text-align:center;clear:both;font-size:small}
|
||||
/*.z{font-size:xx-large;line-height:0.875em;margin:0;padding:0}*/
|
||||
p.fancy:first-letter{float:left;background:transparent;color:gainsboro;padding:0.11em 4px 0 0;font:normal 4em/0.68 serif}
|
||||
blockquote.q{margin:auto;text-align:right;font-style:oblique}
|
||||
figure{display:block;text-align:center;margin:1.75em 0}
|
||||
@@ -31,7 +32,7 @@ table{width:100%;border-collapse:collapse}
|
||||
th{text-align:left;padding:0 0.5em;vertical-align:baseline;border:1px dotted}
|
||||
th,td{width:45%;vertical-align:top}
|
||||
th:first-child{width:10%;text-align:center}
|
||||
.q span,.note p:first-child,tr + tr th:first-child{font-family:'Arial Unicode MS',sans-serif;font-style:normal}
|
||||
.q span,.c span,.note p:first-child,tr + tr th:first-child{font-family:'Arial Unicode MS',sans-serif;font-style:normal}
|
||||
.note p:first-child{float:left;font-size:xx-large;line-height:0.875em;margin:0 0.22em 0 0}
|
||||
.q span{font-size:large}
|
||||
td{border:1px dotted;padding:0 0.5em}
|
||||
|
||||
+6
-6
@@ -9,9 +9,9 @@
|
||||
<meta name="description" content="Python 3 from novice to pro">
|
||||
</head>
|
||||
<body id="index">
|
||||
<p><cite>Dive Into Python 3</cite> will cover Python 3 and its differences from Python 2. Compared to the original <cite><a href="http://diveintopython.org/">Dive Into Python</a></cite>, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final book will be published on paper by Apress. The book will remain online under the <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">CC-BY-3.0</a> license.</p>
|
||||
<p>Below is the draft table of contents. It is <b>not finalized</b>. Only a few chapters have been written so far. The rest is just stubs and random notes to myself.</p>
|
||||
<p>Yes, that is <code>PapayaWhip</code>. All hail <code>PapayaWhip</code>.</p>
|
||||
<p><cite>Dive Into Python 3</cite> will cover Python 3 and its differences from Python 2. Compared to the original <cite><a href="http://diveintopython.org/">Dive Into Python</a></cite>, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final book will be published on paper by Apress. The book will remain online under the <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">CC-BY-3.0</a> license.
|
||||
<p>Below is the draft table of contents. It is <b>not finalized</b>. Only a few chapters have been written so far. The rest is just stubs and random notes to myself.
|
||||
<p>Yes, that is <code>PapayaWhip</code>. All hail <code>PapayaWhip</code>.
|
||||
<h1>Installing Python</h1>
|
||||
<h2>Python on Windows</h2>
|
||||
<h2>Python on Mac OS X</h2>
|
||||
@@ -253,7 +253,7 @@
|
||||
<h2>...<a href="http://www.reddit.com/r/Python/comments/7sj39/dive_into_python_3/c07b3cq">will likely get ported in time</a>...</h2>
|
||||
|
||||
<h1>Where to go from here</h1>
|
||||
<p>Tentative because most of these have not been ported to Python 3 yet.</p>
|
||||
<p>Tentative because most of these have not been ported to Python 3 yet.
|
||||
<h2>WSGI</h2>
|
||||
<h2>Django</h2>
|
||||
<h2>Pylons</h2>
|
||||
@@ -329,8 +329,8 @@
|
||||
</div>
|
||||
|
||||
<footer>
|
||||
<p class="c">This site is optimized for Lynx just because fuck you.<br>I'm told it also looks good in graphical browsers.</p>
|
||||
<p class="c">© 2001-4, 2009 Mark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">CC-BY-3.0</a></p>
|
||||
<p class="c">This site is optimized for Lynx just because fuck you.<br>I’m told it also looks good in graphical browsers.
|
||||
<p class="c">© 2001-4, 2009 <span>ℳ</span>ark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">CC-BY-3.0</a>
|
||||
</footer>
|
||||
</body>
|
||||
</html>
|
||||
|
||||
@@ -11,7 +11,6 @@ h3:before{counter-increment:h3;content:"A." counter(h2) "." counter(h3) ". "}
|
||||
</style>
|
||||
<script type="text/javascript">
|
||||
window.onload = function() {
|
||||
if (!window.addEventListener) { return; }
|
||||
var arTables = document.getElementsByTagName('table');
|
||||
for (var i = arTables.length - 1; i >= 0; i--) {
|
||||
var elmTable = arTables[i];
|
||||
@@ -20,36 +19,34 @@ for (var i = arTables.length - 1; i >= 0; i--) {
|
||||
var arNotes = olNotes.getElementsByTagName('li');
|
||||
var arTableRows = elmTable.getElementsByTagName('tr');
|
||||
if (arNotes.length == 0) { continue; }
|
||||
//if (arNotes.length != arTableRows.length - 1) { alert(elmTable.id + "table has " + arTableRows.length + " rows but the list below it has " + arNotes.length + " items!"); }
|
||||
for (var j = arTableRows.length - 1; j >= 1; j--) {
|
||||
var elmTableRow = arTableRows[j];
|
||||
var elmNote = arNotes[j - 1];
|
||||
elmTableRow._li = elmNote;
|
||||
elmNote._tr = elmTableRow;
|
||||
elmTableRow.addEventListener('mouseover', function() {
|
||||
elmTableRow.onmouseover = function() {
|
||||
this.className = 'hover';
|
||||
this._li.className = 'hover';
|
||||
}, true);
|
||||
elmNote.addEventListener('mouseover', function() {
|
||||
};
|
||||
elmNote.onmouseover = function() {
|
||||
this.className = 'hover';
|
||||
this._tr.className = 'hover';
|
||||
}, true);
|
||||
elmTableRow.addEventListener('mouseout', function() {
|
||||
};
|
||||
elmTableRow.onmouseout = function() {
|
||||
this.className = '';
|
||||
this._li.className = '';
|
||||
}, true);
|
||||
elmNote.addEventListener('mouseout', function() {
|
||||
};
|
||||
elmNote.onmouseout = function() {
|
||||
this.className = '';
|
||||
this._tr.className = '';
|
||||
}, true);
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
</script>
|
||||
</head>
|
||||
<body>
|
||||
<h1>Porting code to Python 3 with <code>2to3</code></h1>
|
||||
<h1>Porting code to Python 3 with <code class="filename">2to3</code></h1>
|
||||
|
||||
<blockquote class="q">
|
||||
<p><span>❝</span> Life is pleasant. Death is peaceful. It's the transition that's troublesome. <span>❞</span><br>— Isaac Asimov (attributed)
|
||||
@@ -58,6 +55,8 @@ for (var i = arTables.length - 1; i >= 0; i--) {
|
||||
<ol>
|
||||
<li><a href="#divingin">Diving in</a>
|
||||
<li><a href="#print"><code>print</code> statement</a>
|
||||
<li><a href="#unicodeliteral">Unicode string literals</a>
|
||||
<li><a href="#long"><code>long</code> data type</a>
|
||||
<li><a href="#ne"><> comparison</a>
|
||||
<li><a href="#has_key"><code>has_key()</code> dictionary method</a>
|
||||
<li><a href="#dict">Dictionary methods that return lists</a>
|
||||
@@ -82,7 +81,6 @@ for (var i = arTables.length - 1; i >= 0; i--) {
|
||||
<li><a href="#except"><code>try...except</code> statement</a>
|
||||
<li><a href="#raise"><code>raise</code> statement</a>
|
||||
<li><a href="#throw"><code>throw</code> statement</a>
|
||||
<li><a href="#long"><code>long</code> data type</a>
|
||||
<li><a href="#xrange"><code>xrange()</code> global function</a>
|
||||
<li><a href="#raw_input"><code>raw_input()</code> and <code>input()</code> global functions</a>
|
||||
<li><a href="#funcattrs"><code>func_*</code> function attributes</a>
|
||||
@@ -94,7 +92,6 @@ for (var i = arTables.length - 1; i >= 0; i--) {
|
||||
<li><a href="#numliterals">Number literals</a>
|
||||
<li><a href="#renames"><code>sys.maxint</code></a>
|
||||
<li><a href="#unicode"><code>unicode()</code> global function</a>
|
||||
<li><a href="#unicodeliteral">Unicode string literals</a>
|
||||
<li><a href="#callable"><code>callable()</code> global function</a>
|
||||
<li><a href="#zip"><code>zip()</code> global function</a>
|
||||
<li><a href="#standarderror"><code>StandardError()</code> exception</a>
|
||||
@@ -114,9 +111,7 @@ for (var i = arTables.length - 1; i >= 0; i--) {
|
||||
|
||||
<h2 id="divingin">Diving in</h2>
|
||||
|
||||
<p class="fancy">FIXME intro
|
||||
|
||||
<p>...
|
||||
<p class="fancy">Python 3 comes with a utility script called <code class="filename">2to3</code>, which takes your actual Python 2 source code as input and auto-converts as much as it can to Python 3. <a href="case-study-porting-chardet-to-python-3.html#running2to3">Case study: porting <code class="filename">chardet</code> to Python 3</a> describes how to run the <code class="filename">2to3</code> script, then shows some things it can't fix automatically. This appendix documents what it <em>can</em> fix automatically.
|
||||
|
||||
<h2 id="print"><code>print</code> statement</h2>
|
||||
|
||||
@@ -164,6 +159,84 @@ for (var i = arTables.length - 1; i >= 0; i--) {
|
||||
<li>In Python 2, you could redirect the output to a pipe -- like <code>sys.stderr</code> -- by using the <code>>>pipe_name</code> syntax. In Python 3, the way to do this is to pass the pipe in the <code>file</code> keyword argument. The <code>file</code> argument defaults to <code>sys.stdout</code> (standard out), so overriding it will output to a different pipe instead.
|
||||
</ol>
|
||||
|
||||
<h2 id="unicodeliteral">Unicode string literals</h2>
|
||||
|
||||
<p>Python 2 had two string types: Unicode strings and non-Unicode strings. Python 3 has one string type: Unicode strings.
|
||||
|
||||
<p class="skip"><a href="#skipcompareunicodeliteral">skip over this table</a>
|
||||
<table id="compareunicodeliteral">
|
||||
<tr>
|
||||
<th>Notes</th>
|
||||
<th>Python 2</th>
|
||||
<th>Python 3</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<td><code>u"PapayaWhip"</code></td>
|
||||
<td><code>"PapayaWhip"</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<td><code>ur"PapayaWhip\foo"</code></td>
|
||||
<td><code>r"PapayaWhip\foo"</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<ol id="skipcompareunicodeliteral">
|
||||
<li>Unicode string literals are simply converted into string literals, which, in Python 3, are always Unicode.
|
||||
<li>Unicode "raw" strings (in which Python does not auto-escape backslashes) are converted to raw strings. In Python 3, "raw" strings are also Unicode.
|
||||
</ol>
|
||||
|
||||
<h2 id="long"><code>long</code> data type</h2>
|
||||
|
||||
<p>Python 2 had separate <code>int</code> and <code>long</code> types for non-floating-point numbers. An <code>int</code> could not be any larger than <a href="#renames"><code>sys.maxint</code></a>, which varied by platform. Longs were defined by appending an <code>L</code> to the end of the number, and they could be, well, longer than ints. In Python 3, there is only one integer type, called <code>int</code>, which mostly behaves like the <code>long</code> type in Python 2.
|
||||
|
||||
<p>Since there are no longer two types, there is no need for special syntax to distinguish them.
|
||||
|
||||
<p>Further reading: <a href="http://www.python.org/dev/peps/pep-0237/">PEP 237: Unifying Long Integers and Integers</a>.
|
||||
|
||||
<p class="skip"><a href="#skipcomparelong">skip over this table</a>
|
||||
<table id="comparelong">
|
||||
<tr>
|
||||
<th>Notes</th>
|
||||
<th>Python 2</th>
|
||||
<th>Python 3</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<td><code>x = 1000000000000L</code></td>
|
||||
<td><code>x = 1000000000000</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>②</th>
|
||||
<td><code>x = 0xFFFFFFFFFFFFL</code></td>
|
||||
<td><code>x = 0xFFFFFFFFFFFF</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>③</th>
|
||||
<td><code>long(x)</code></td>
|
||||
<td><code>int(x)</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>④</th>
|
||||
<td><code>type(x) is long</code></td>
|
||||
<td><code>type(x) is int</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>⑤</th>
|
||||
<td><code>isinstance(x, long)</code></td>
|
||||
<td><code>isinstance(x, int)</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<ol id="skipcomparelong">
|
||||
<li>Base 10 long integer literals become base 10 integer literals.
|
||||
<li>Base 16 long integer literals become base 16 integer literals.
|
||||
<li>In Python 3, the old <code>long()</code> function no longer exists, since longs don't exist. To coerce a variable to an integer, use the <code>int()</code> function.
|
||||
<li>To check whether a variable is an integer, get its type and compare it to <code>int</code>, not <code>long</code>.
|
||||
<li>You can also use the <code>isinstance()</code> function to check data types; again, use <code>int</code>, not <code>long</code>, to check for integers.
|
||||
</ol>
|
||||
|
||||
<h2 id="ne"><> comparison</h2>
|
||||
|
||||
<p>Python 2 supported <code><></code> as a synonym for <code>!=</code>, the not-equals comparison operator. Python 3 supports the <code>!=</code> operator, but not <code><></code>.
|
||||
@@ -277,11 +350,11 @@ for (var i = arTables.length - 1; i >= 0; i--) {
|
||||
</table>
|
||||
|
||||
<ol id="skipcomparedict">
|
||||
<li><code>2to3</code> errs on the side of safety, converting the return value from <code>keys()</code> to a static list with the <code>list()</code> function. This will always work, but it will be less efficient than using a view. You should examine the converted code to see if a list is absolutely necessary, or if a view would do.
|
||||
<li>Another view-to-list conversion, with the <code>items()</code> method. <code>2to3</code> will do the same thing with the <code>values()</code> method.
|
||||
<li><code class="filename">2to3</code> errs on the side of safety, converting the return value from <code>keys()</code> to a static list with the <code>list()</code> function. This will always work, but it will be less efficient than using a view. You should examine the converted code to see if a list is absolutely necessary, or if a view would do.
|
||||
<li>Another view-to-list conversion, with the <code>items()</code> method. <code class="filename">2to3</code> will do the same thing with the <code>values()</code> method.
|
||||
<li>Python 3 does not support the <code>iterkeys()</code> method anymore. Use <code>keys()</code>, and if necessary, convert the view to an iterator with the <code>iter()</code> function.
|
||||
<li><code>2to3</code> recognizes when the <code>iterkeys()</code> method is used inside a list comprehension, and converts it to the <code>keys()</code> method (without wrapping it in an extra call to <code>iter()</code>). This works because views are iterable.
|
||||
<li><code>2to3</code> recognizes that the <code>keys()</code> method is immediately passed to a function which iterates through an entire sequence, so there is no need to convert the return value to a list first. The <code>min()</code> function will happily iterate through the view instead. This applies to <code>min()</code>, <code>max()</code>, <code>sum()</code>, <code>list()</code>, <code>tuple()</code>, <code>set()</code>, <code>sorted()</code>, <code>any()</code>, and <code>all()</code>.
|
||||
<li><code class="filename">2to3</code> recognizes when the <code>iterkeys()</code> method is used inside a list comprehension, and converts it to the <code>keys()</code> method (without wrapping it in an extra call to <code>iter()</code>). This works because views are iterable.
|
||||
<li><code class="filename">2to3</code> recognizes that the <code>keys()</code> method is immediately passed to a function which iterates through an entire sequence, so there is no need to convert the return value to a list first. The <code>min()</code> function will happily iterate through the view instead. This applies to <code>min()</code>, <code>max()</code>, <code>sum()</code>, <code>list()</code>, <code>tuple()</code>, <code>set()</code>, <code>sorted()</code>, <code>any()</code>, and <code>all()</code>.
|
||||
</ol>
|
||||
|
||||
<h2 id="imports">Modules that have been renamed or reorganized</h2>
|
||||
@@ -378,7 +451,7 @@ from urllib.error import HTTPError</code></pre></td>
|
||||
</table>
|
||||
|
||||
<ol id="skipcompareimporturllib">
|
||||
<li>The old <code>urllib</code> module in Python 2 had a variety of functions, including <code>urlopen()</code> for fetching data and <code>splittype()</code>, <code>splithost()</code>, and <code>splituser()</code> for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new <code>urllib</code> package. <code>2to3</code> will also change all calls to these functions so they use the new naming scheme.
|
||||
<li>The old <code>urllib</code> module in Python 2 had a variety of functions, including <code>urlopen()</code> for fetching data and <code>splittype()</code>, <code>splithost()</code>, and <code>splituser()</code> for splitting a URL into its constituent parts. These functions have been reorganized more logically within the new <code>urllib</code> package. <code class="filename">2to3</code> will also change all calls to these functions so they use the new naming scheme.
|
||||
<li>The old <code>urllib2</code> module in Python 2 has been folded into into the <code>urllib</code> package in Python 3. All your <code>urllib2</code> favorites -- the <code>build_opener()</code> method, <code>Request</code> objects, and <code>HTTPBasicAuthHandler</code> and friends -- are still available.
|
||||
<li>The <code>urllib.parse</code> module in Python 3 contains all the parsing functions from the old <code>urlparse</code> module in Python 2.
|
||||
<li>The <code>urllib.robotparser</code> module parses <a href="http://www.robotstxt.org/"><code>robots.txt</code> files</a>.
|
||||
@@ -610,9 +683,9 @@ except ImportError:
|
||||
</table>
|
||||
|
||||
<ol id="skipcomparefilter">
|
||||
<li>In the most basic case, <code>2to3</code> will wrap a call to <code>filter()</code> with a call to <code>list()</code>, which simply iterates through its argument and returns a real list.
|
||||
<li>However, if the call to <code>filter()</code> is <em>already</em> wrapped in <code>list()</code>, <code>2to3</code> will do nothing, since the fact that <code>filter()</code> is returning an iterator is irrelevant.
|
||||
<li>For the special syntax of <code>filter(None, ...)</code>, <code>2to3</code> will transform the call into a semantically equivalent list comprehension.
|
||||
<li>In the most basic case, <code class="filename">2to3</code> will wrap a call to <code>filter()</code> with a call to <code>list()</code>, which simply iterates through its argument and returns a real list.
|
||||
<li>However, if the call to <code>filter()</code> is <em>already</em> wrapped in <code>list()</code>, <code class="filename">2to3</code> will do nothing, since the fact that <code>filter()</code> is returning an iterator is irrelevant.
|
||||
<li>For the special syntax of <code>filter(None, ...)</code>, <code class="filename">2to3</code> will transform the call into a semantically equivalent list comprehension.
|
||||
<li>In contexts like <code>for</code> loops, which iterate through the entire sequence anyway, no changes are necessary.
|
||||
<li>Again, no changes are necessary, because the list comprehension will iterate through the entire sequence, and it can do that just as well if <code>filter()</code> returns an iterator as if it returns a list.
|
||||
</ol>
|
||||
@@ -656,9 +729,9 @@ except ImportError:
|
||||
</table>
|
||||
|
||||
<ol id="skipcomparemap">
|
||||
<li>As with <code>filter()</code>, in the most basic case, <code>2to3</code> will wrap a call to <code>map()</code> with a call to <code>list()</code>.
|
||||
<li>For the special syntax of <code>map(None, ...)</code>, the identity function, <code>2to3</code> will convert it to an equivalent call to <code>list()</code>.
|
||||
<li>If the first argument to <code>map()</code> is a lambda function, <code>2to3</code> will convert it to an equivalent list comprehension.
|
||||
<li>As with <code>filter()</code>, in the most basic case, <code class="filename">2to3</code> will wrap a call to <code>map()</code> with a call to <code>list()</code>.
|
||||
<li>For the special syntax of <code>map(None, ...)</code>, the identity function, <code class="filename">2to3</code> will convert it to an equivalent call to <code>list()</code>.
|
||||
<li>If the first argument to <code>map()</code> is a lambda function, <code class="filename">2to3</code> will convert it to an equivalent list comprehension.
|
||||
<li>In contexts like <code>for</code> loops, which iterate through the entire sequence anyway, no changes are necessary.
|
||||
<li>Again, no changes are necessary, because the list comprehension will iterate through the entire sequence, and it can do that just as well if <code>map()</code> returns an iterator as if it returns a list.
|
||||
</ol>
|
||||
@@ -668,8 +741,8 @@ except ImportError:
|
||||
<p>In Python 3, the <code>reduce()</code> function has been removed from the global namespace and placed in the <code class="filename">functools</code> module.
|
||||
|
||||
<blockquote class="note">
|
||||
<p>☞</p>
|
||||
<p>The version of <code class="filename">2to3</code> that shipped with Python 3.0 would not fix this case automatically. The fix first appeared in the <code class="filename">2to3</code> script that shipped with Python 3.1.
|
||||
<p>☞
|
||||
<p>The version of <code class="filename">2to3</code> that shipped with Python 3.0 would not fix the <code>reduce()</code> function automatically. The fix first appeared in the <code class="filename">2to3</code> script that shipped with Python 3.1.
|
||||
</blockquote>
|
||||
|
||||
<p class="skip"><a href="#skipcomparereduce">skip over this table</a>
|
||||
@@ -691,7 +764,7 @@ reduce(a, b, c)</code></pre></td>
|
||||
|
||||
<h2 id="apply"><code>apply()</code> global function</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
<p>Python 2 had a global function called <code>apply()</code>, which took a function <var>f</var> and a list <code>[a, b, c]</code> and returned <code>f(a, b, c)</code>. In Python 3, the <code>apply()</code> function no longer exists. Instead, there is a new function calling syntax that allows you to pass a list and have Python apply the list as the function's arguments.
|
||||
|
||||
<p class="skip"><a href="#skipcompareapply">skip over this table</a>
|
||||
<table id="compareapply">
|
||||
@@ -702,36 +775,36 @@ reduce(a, b, c)</code></pre></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<td><code>apply(a_function, args)</code></td>
|
||||
<td><code>a_function(*args)</code></td>
|
||||
<td><code>apply(a_function, a_list_of_args)</code></td>
|
||||
<td><code>a_function(*a_list_of_args)</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>②</th>
|
||||
<td><code>apply(a_function, args, kwds)</code></td>
|
||||
<td><code>a_function(*args, **kwds)</code></td>
|
||||
<td><code>apply(a_function, a_list_of_args, a_dictionary_of_named_args)</code></td>
|
||||
<td><code>a_function(*a_list_of_args, **a_dictionary_of_named_args)</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>③</th>
|
||||
<td><code>apply(a_function, args + z)</code></td>
|
||||
<td><code>a_function(*args + z)</code></td>
|
||||
<td><code>apply(a_function, a_list_of_args + z)</code></td>
|
||||
<td><code>a_function(*a_list_of_args + z)</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>④</th>
|
||||
<td><code>apply(aModule.a_function, args)</code></td>
|
||||
<td><code>aModule.a_function(*args)</code></td>
|
||||
<td><code>apply(aModule.a_function, a_list_of_args)</code></td>
|
||||
<td><code>aModule.a_function(*a_list_of_args)</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<ol id="skipcompareapply">
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>In the simplest form, you can call a function with a list of arguments (an actual list like <code>[a, b, c]</code>) by prepending the list with an asterisk (<code>*</code>). This is exactly equivalent to the old <code>apply()</code> function in Python 2.
|
||||
<li>In Python 2, the <code>apply()</code> function could actually take three parameters: a function, a list of arguments, and a dictionary of named arguments. In Python 3, you can accomplish the same thing by prepending the list of arguments with an asterisk (<code>*</code>) and the dictionary of named arguments with two asterisks (<code>**</code>).
|
||||
<li>The <code>+</code> operator, used here for list concatenation, takes precedence over the <code>*</code> operator, so there is no need for extra parentheses around <code>a_list_of_args + z</code>.
|
||||
<li>The <code class="filename">2to3</code> script is smart enough to convert complex <code>apply()</code> calls, including calling functions within imported modules.
|
||||
</ol>
|
||||
|
||||
<h2 id="intern"><code>intern()</code> global function</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
<p>In Python 2, you could call the <code>intern()</code> function on a string to intern it as a performance optimization. In Python 3, the <code>intern()</code> function has been moved to the <code class="filename">sys</code> module.
|
||||
|
||||
<p class="skip"><a href="#skipcompareintern">skip over this table</a>
|
||||
<table id="compareintern">
|
||||
@@ -741,19 +814,17 @@ reduce(a, b, c)</code></pre></td>
|
||||
<th>Python 3</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<th></th>
|
||||
<td><code>intern(aString)</code></td>
|
||||
<td><code>sys.intern(aString)</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<ol id="skipcompareintern">
|
||||
<li>...
|
||||
</ol>
|
||||
<p id="skipcompareintern">
|
||||
|
||||
<h2 id="exec"><code>exec</code> statement</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
<p>Just as <a href="#print">the <code>print</code> statement</a> became a function in Python 3, so too has the <code>exec</code> statement. The <code>exec()</code> function takes a string which contains arbitrary Python code and executes it as if it were just another statement or expression.
|
||||
|
||||
<p class="skip"><a href="#skipcompareexec">skip over this table</a>
|
||||
<table id="compareexec">
|
||||
@@ -780,14 +851,19 @@ reduce(a, b, c)</code></pre></td>
|
||||
</table>
|
||||
|
||||
<ol id="skipcompareexec">
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>In the simplest form, the <code class="filename">2to3</code> script simply encloses the code-as-a-string in parentheses, since <code>exec()</code> is now a function instead of a statement.
|
||||
<li>The old <code>exec</code> statement could take a namespace, a private environment of globals in which the code-as-a-string would be executed. Python 3 can also do this; just pass the namespace as the second argument to the <code>exec()</code> function.
|
||||
<li>Even fancier, the old <code>exec</code> statement could also take a local namespace (like the variables defined within a function). In Python 3, the <code>exec()</code> function can do that too.
|
||||
</ol>
|
||||
|
||||
<h2 id="execfile"><code>execfile</code> statement (3.1+)</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
<p>Like the old <a href="#exec"><code>exec</code> statement</a>, the old <code>execfile</code> statement will execute strings as if they were Python code. Where <code>exec</code> took a string, <code>execfile</code> took a filename. In Python 3, the <code>execfile</code> statement has been eliminated. If you really need to take a file of Python code and execute it (but you're not willing to simply import it), you can accomplish the same thing by opening the file, reading its contents, calling the global <code>compile()</code> function to force the Python interpreter to compile the code, and then call the new <code>exec()</code> function.
|
||||
|
||||
<blockquote class="note">
|
||||
<p>☞
|
||||
<p>The version of <code class="filename">2to3</code> that shipped with Python 3.0 would not fix the <code>execfile</code> statement automatically. The fix first appeared in the <code class="filename">2to3</code> script that shipped with Python 3.1.
|
||||
</blockquote>
|
||||
|
||||
<p class="skip"><a href="#skipcompareexecfile">skip over this table</a>
|
||||
<table id="compareexecfile">
|
||||
@@ -797,19 +873,17 @@ reduce(a, b, c)</code></pre></td>
|
||||
<th>Python 3</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<th></th>
|
||||
<td><code>execfile("a_filename")</code></td>
|
||||
<td><code>execfile(compile(open("a_filename").read(), "a_filename", "exec"))</code></td>
|
||||
<td><code>exec(compile(open("a_filename").read(), "a_filename", "exec"))</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<ol id="skipcompareexecfile">
|
||||
<li>...
|
||||
</ol>
|
||||
<p id="skipcompareexecfile">
|
||||
|
||||
<h2 id="repr"><code>repr</code> literals (backticks)</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
<p>In Python 2, there was a special syntax of wrapping any object in backticks (like <code>`x`</code>) to get a representation of the object. In Python 3, this capability still exists, but you can no longer use backticks to get it. Instead, use the global <code>repr()</code> function.
|
||||
|
||||
<p class="skip"><a href="#skipcomparerepr">skip over this table</a>
|
||||
<table id="comparerepr">
|
||||
@@ -825,25 +899,19 @@ reduce(a, b, c)</code></pre></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>②</th>
|
||||
<td><code>`1 + 2`</code></td>
|
||||
<td><code>repr(1 + 2)</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>③</th>
|
||||
<td><code>`"PapayaWhip" + `2``</code></td>
|
||||
<td><code>repr("PapayaWhip" + repr(2))</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<ol id="skipcomparerepr">
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>Remember, <var>x</var> can be anything -- a class, a function, a module, a primitive data type, etc. The <code>repr()</code> function works on everything.
|
||||
<li>In Python 2, backticks could be nested, leading to this sort of confusing (but valid) expression. The <code class="filename">2to3</code> tool is smart enough to convert this into nested calls to <code>repr()</code>.
|
||||
</ol>
|
||||
|
||||
<h2 id="except"><code>try...except</code> statement</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
<p>The syntax for catching exceptions has changed slightly between Python 2 and 3.
|
||||
|
||||
<p class="skip"><a href="#skipcompareexcept">skip over this table</a>
|
||||
<table id="compareexcept">
|
||||
@@ -893,12 +961,17 @@ except:
|
||||
</table>
|
||||
|
||||
<ol id="skipcompareexcept">
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>Instead of a comma after the exception type, Python 3 uses a new keyword, <code>as</code>.
|
||||
<li>The <code>as</code> keyword also works for catching multiple types of exceptions at once.
|
||||
<li>If you catch an exception but don't actually care about accessing the exception object itself, the syntax is identical between Python 2 and 3.
|
||||
<li>Similarly, if you use a fallback to catch <em>all</em> exceptions, the syntax is identical.
|
||||
</ol>
|
||||
|
||||
<blockquote class="note">
|
||||
<p>☞
|
||||
<p>You should never use a fallback to catch <em>all</em> exceptions when importing modules (or most other times), because it will also catch things like <code>KeyboardInterrupt</code> (if the user pressed <kbd>Ctrl-C</kbd> to interrupt the program) and can make it more difficult to debug errors.
|
||||
</blockquote>
|
||||
|
||||
<h2 id="raise"><code>raise</code> statement</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
@@ -967,55 +1040,9 @@ except:
|
||||
<li>...
|
||||
</ol>
|
||||
|
||||
<h2 id="long"><code>long</code> data type</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
|
||||
<p class="skip"><a href="#skipcomparelong">skip over this table</a>
|
||||
<table id="comparelong">
|
||||
<tr>
|
||||
<th>Notes</th>
|
||||
<th>Python 2</th>
|
||||
<th>Python 3</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<td><code>x = 1000000000000L</code></td>
|
||||
<td><code>x = 1000000000000</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>②</th>
|
||||
<td><code>x = 0xFFFFFFFFFFFFL</code></td>
|
||||
<td><code>x = 0xFFFFFFFFFFFF</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>③</th>
|
||||
<td><code>long(x)</code></td>
|
||||
<td><code>int(x)</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>④</th>
|
||||
<td><code>type(x) is long</code></td>
|
||||
<td><code>type(x) is int</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>⑤</th>
|
||||
<td><code>isinstance(x, long)</code></td>
|
||||
<td><code>isinstance(x, int)</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<ol id="skipcomparelong">
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
</ol>
|
||||
|
||||
<h2 id="xrange"><code>xrange()</code> global function</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
<p>In Python 2, there were two ways to get a range of numbers: <code>range()</code>, which returned a list, and <code>xrange()</code>, which returned an iterator. In Python 3, <code>range()</code> returns an iterator, and <code>xrange()</code> doesn't exist.
|
||||
|
||||
<p class="skip"><a href="#skipcomparexrange">skip over this table</a>
|
||||
<table id="comparexrange">
|
||||
@@ -1031,8 +1058,8 @@ except:
|
||||
</tr>
|
||||
<tr>
|
||||
<th>②</th>
|
||||
<td><code>a_sequence = range(10)</code></td>
|
||||
<td><code>a_sequence = list(range(10))</code></td>
|
||||
<td><code>a_list = range(10)</code></td>
|
||||
<td><code>a_list = list(range(10))</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>③</th>
|
||||
@@ -1052,11 +1079,11 @@ except:
|
||||
</table>
|
||||
|
||||
<ol id="skipcomparexrange">
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>...
|
||||
<li>In the simplest case, the <code>2to3</code> script will simply convert <code>xrange()</code> to <code>range()</code>.
|
||||
<li>If your Python 2 code used <code>range()</code>, the <code>2to3</code> script does not know whether you needed a list, or whether an iterator would do. It errs on the side of caution and coerces the return value into a list by calling the <code>list()</code> function.
|
||||
<li>If the <code>xrange()</code> function was inside a list comprehension, there is no need to coerce the result to a list, since the list comprehension will work just fine with an iterator.
|
||||
<li>Similarly, a <code>for</code> loop will work just fine with an iterator, so there is no need to change anything here.
|
||||
<li>The <code>sum()</code> function will also work with an iterator, so <code>2to3</code> makes no changes here either. Like <a href="#dict">dictionary methods that return views instead of lists</a>, this applies to <code>min()</code>, <code>max()</code>, <code>sum()</code>, <code>list()</code>, <code>tuple()</code>, <code>set()</code>, <code>sorted()</code>, <code>any()</code>, and <code>all()</code>.
|
||||
</ol>
|
||||
|
||||
<h2 id="raw_input"><code>raw_input()</code> and <code>input()</code> global functions</h2>
|
||||
@@ -1423,34 +1450,6 @@ a_function(sys.maxsize)</code></pre></td>
|
||||
<li>...
|
||||
</ol>
|
||||
|
||||
<h2 id="unicodeliteral">Unicode string literals</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
|
||||
<p class="skip"><a href="#skipcompareunicodeliteral">skip over this table</a>
|
||||
<table id="compareunicodeliteral">
|
||||
<tr>
|
||||
<th>Notes</th>
|
||||
<th>Python 2</th>
|
||||
<th>Python 3</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<td><code>u"PapayaWhip"</code></td>
|
||||
<td><code>"PapayaWhip"</code></td>
|
||||
</tr>
|
||||
<tr>
|
||||
<th>①</th>
|
||||
<td><code>ur"PapayaWhip\foo"</code></td>
|
||||
<td><code>r"PapayaWhip\foo"</code></td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
<ol id="skipcompareunicodeliteral">
|
||||
<li>...
|
||||
<li>...
|
||||
</ol>
|
||||
|
||||
<h2 id="callable"><code>callable()</code> global function</h2>
|
||||
|
||||
<p>FIXME intro
|
||||
@@ -1585,6 +1584,11 @@ a_function(sys.maxsize)</code></pre></td>
|
||||
|
||||
<p>FIXME intro
|
||||
|
||||
<blockquote class="note">
|
||||
<p>☞
|
||||
<p>The version of <code class="filename">2to3</code> that shipped with Python 3.0 would not fix these cases of <code>isinstance()</code> automatically. The fix first appeared in the <code class="filename">2to3</code> script that shipped with Python 3.1.
|
||||
</blockquote>
|
||||
|
||||
<p class="skip"><a href="#skipcompareisinstance">skip over this table</a>
|
||||
<table id="compareisinstance">
|
||||
<tr>
|
||||
@@ -1912,7 +1916,7 @@ do_stuff(a_list)</code></pre></td>
|
||||
<p>FIXME: once the rest of the book is written, this appendix should contain copious links back to any chapter or section that touches on these features.
|
||||
|
||||
<footer>
|
||||
<p class="c">© 2001-4, 2009 Mark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">CC-BY-3.0</a>
|
||||
<p class="c">© 2001-4, 2009 <span>ℳ</span>ark Pilgrim, <a rel="license" href="http://creativecommons.org/licenses/by/3.0/">CC-BY-3.0</a>
|
||||
</footer>
|
||||
|
||||
</body>
|
||||
|
||||
Reference in New Issue
Block a user