mirror of
https://github.com/kennethreitz/dive-into-python3.git
synced 2026-06-05 23:10:17 +00:00
couple of sections of new-and-improved "unit testing" chapter
This commit is contained in:
@@ -44,7 +44,7 @@ body{counter-reset:h1 20}
|
||||
<li><a href=#cantconvertbytesobject>Can’t convert '<code>bytes</code>' object to <code>str</code> implicitly</a>
|
||||
</ol>
|
||||
</ol>
|
||||
<h2 id=divingin>Introducing <code class=filename>chardet</code>: a mini-FAQ</h2>
|
||||
<h2 id=divingin>Introducing <code class=filename>chardet</code>: a mini-<abbr>FAQ</abbr></h2>
|
||||
<p class=fancy>When you think of “text,” you probably think of “characters and symbols I see on my computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular <em>character encoding</em>. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk.
|
||||
<p>In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
|
||||
<h3 id=faq.what>What is character encoding auto-detection?</h3>
|
||||
@@ -58,11 +58,11 @@ body{counter-reset:h1 20}
|
||||
<h3 id=faq.yippie>Yippie! Screw the standards, I’ll just auto-detect everything!</h3>
|
||||
<p>Don’t do that. Virtually every format and protocol contains a method for specifying character encoding.
|
||||
<ul>
|
||||
<li>HTTP can define a <code>charset</code> parameter in the <code>Content-type</code> header.
|
||||
<li>HTML documents can define a <code><meta http-equiv="content-type"></code> element in the <code><head></code> of a web page.
|
||||
<li>XML documents can define an <code>encoding</code> attribute in the XML prolog.
|
||||
<li><abbr>HTTP</abbr> can define a <code>charset</code> parameter in the <code>Content-type</code> header.
|
||||
<li><abbr>HTML</abbr> documents can define a <code><meta http-equiv="content-type"></code> element in the <code><head></code> of a web page.
|
||||
<li><abbr>XML</abbr> documents can define an <code>encoding</code> attribute in the <abbr>XML</abbr> prolog.
|
||||
</ul>
|
||||
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
|
||||
<p>If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over <abbr>HTTP</abbr>, you need to support both standards <em>and</em> figure out which one wins if they give you conflicting information.)
|
||||
<p>Despite the complexity, it’s worthwhile to follow standards and <a href=http://www.w3.org/2001/tag/doc/mime-respect>respect explicit character encoding information</a>. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.
|
||||
<h3 id=faq.why>Why bother with auto-detection if it’s slow, inaccurate, and non-standard?</h3>
|
||||
<p>Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn’t work. There are also some poorly designed standards that have no way to specify encoding at all.
|
||||
@@ -676,7 +676,7 @@ TypeError: can't use a string pattern on a bytes-like object</samp></pre>
|
||||
<pre><code>class UniversalDetector:
|
||||
def __init__(self):
|
||||
self._highBitDetector = re.compile(r'[\x80-\xFF]')</code></pre>
|
||||
<p id=skiphighbitdetectorcode>This pre-compiles a regular expression designed to find non-ASCII characters in the range 128-255 (0x80-0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-ASCII <em>bytes</em> in the range 128-255.
|
||||
<p id=skiphighbitdetectorcode>This pre-compiles a regular expression designed to find non-<abbr>ASCII</abbr> characters in the range 128–255 (0x80–0xFF). Wait, that’s not quite right; I need to be more precise with my terminology. This pattern is designed to find non-<abbr>ASCII</abbr> <em>bytes</em> in the range 128-255.
|
||||
<p>And therein lies the problem.
|
||||
<p>In Python 2, a string was an array of bytes whose character encoding was tracked separately. If you wanted Python 2 to keep track of the character encoding, you had to use a Unicode string (<code>u''</code>) instead. But in Python 3, a string is always what Python 2 called a Unicode string -- that is, an array of Unicode characters (of possibly varying byte lengths). Since this regular expression is defined by a string pattern, it can only be used to search a string -- again, an array of characters. But what we’re searching is not a string, it’s a byte array. Looking at the traceback, this error occurred in <code class=filename>universaldetector.py</code>:
|
||||
<p class=skip><a href=#skipfeedhighbitdetectorcode>skip over this</a>
|
||||
|
||||
Reference in New Issue
Block a user