diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 8b726b5..b9877b4 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -41,9 +41,9 @@ body{counter-reset:h1 19}
chardet: a mini-FAQWhen you think of “text”, you probably think of “characters and symbols I see on my computer screen”. But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. +
When you think of "text", you probably think of "characters and symbols I see on my computer screen". But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. -
In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever). +
In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text", you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).
In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language. +
In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds "txzqJv 2!dasd0a QqdKjvz" will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of "typical" text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.
In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.
Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252.
+
Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of "probers" for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252.
The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller.
@@ -124,7 +124,7 @@ body{counter-reset:h1 19}
SBCSGroupProber feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, SingleByteCharSetProber (defined in sbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. SingleByteCharSetProber processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio.
-
Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew).
+
Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored "backwards" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew).
windows-1252Dive Into Python 3 will cover Python 3 and its differences from Python 2. Compared to the original Dive Into Python, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final book will be published on paper by Apress. The book will remain online under the CC-BY-3.0 license.
Below is the draft table of contents. It is not finalized. Only a few chapters have been written so far. The rest is just stubs and random notes to myself.
Yes, that is PapayaWhip. All hail PapayaWhip.
chardet to Python 3chardet: a mini-FAQUTF-n with a BOMwindows-12522to32to3 can'tFalse is invalid syntaxconstantsbytes' object to str implicitlyFalse is invalid syntaxconstantsbytes' object to str implicitlyTentative because most of these have not been ported to Python 3 yet.
- -2to3print statementhas_key() dictionary methodhttp packageurllib packagedbm packagexmlrpc packagefilter() global functionmap() global functionreduce() global function (3.1+)apply() global functionintern() global functionexec statementexecfile statement (3.1+)repr literals (backticks)try...except statementraise statementthrow statementlong data typexrange() global functionraw_input() and input() global functionsfunc_* function attributesxreadlines() I/O methodlambda functions with multiple parametersnext() iterator method__nonzero__ special class attributesys.maxintunicode() global functioncallable() global functionzip() global functionStandardError() exceptiontypes module constantsisinstance global function (3.1+)basestring datatypeitertools modulesys.exc_type, sys.exc_value, sys.exc_tracebackos.getcwdu() functionset() literalsbuffer() global function