&title;

&title; &title; Mark Pilgrim 2006 2007 2008 Mark Pilgrim &fileversion; <para>This documentation claims to describe the behavior of &chardet; &chardet_version;. It does not claim to describe the behavior of any other version.</para> <para>This documentation lives at <ulink url="&url_book;"/>. If you're reading it somewhere else, you may not have the latest version.</para> </abstract> <keywordset> <keyword>character</keyword> <keyword>set</keyword> <keyword>encoding</keyword> <keyword>detection</keyword> <keyword>Python</keyword> <keyword>XML</keyword> <keyword>feed</keyword> </keywordset> <legalnotice> <para>This documentation is provided by the author <quote>as is</quote> without any express or implied warranties. See <xref linkend="license"/> for more details.</para> </legalnotice> </articleinfo> <section id="faq"> <title>Frequently asked questions

What is character encoding? When you think of text, you probably think of characters and symbols I see on my computer screen. But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's text, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).

What is character encoding auto-detection? It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It's like cracking a code when you don't have the decryption key.

Isn't that impossible? In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds txzqJv 2!dasd0a QqdKjvz will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of typical text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language. In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.

Who wrote this detection algorithm? This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors' comments, which are quite extensive and informative. You may also be interested in the research paper which led to the Mozilla implementation, &researchpaper;.

Yippie! Screw the standards, I'll just auto-detect everything! Don't do that. Virtually every format and protocol contains a method for specifying character encoding. HTTP can define a charset parameter in the Content-type header. HTML documents can define a <meta http-equiv="content-type"> element in the <head> of a web page. XML documents can define an encoding attribute in the XML prolog. If text comes with explicit character encoding information, you should use it. If the text has no explicit information, but the relevant standard defines a default encoding, you should use that. (This is harder than it sounds, because standards can overlap. If you fetch an XML document over HTTP, you need to support both standards and figure out which one wins if they give you conflicting information.) Despite the complexity, it's worthwhile to follow standards and respect explicit character encoding information. It will almost certainly be faster and more accurate than trying to auto-detect the encoding. It will also make the world a better place, since your program will interoperate with other programs that follow the same standards.

Why bother with auto-detection if it's slow, inaccurate, and non-standard? Sometimes you receive text with verifiably inaccurate encoding information. Or text without any encoding information, and the specified default encoding doesn't work. There are also some poorly designed standards that have no way to specify encoding at all. If following the relevant standards gets you nowhere, and you decide that processing the text is more important than maintaining interoperability, then you can try to auto-detect the character encoding as a last resort. An example is my Universal Feed Parser, which calls this auto-detection library only after exhausting all other options.

<para>&chardet; currently supports over two dozen character encodings.</para> </abstract> </sectioninfo> <title>Supported encodings Big5, GB2312/GB18030, EUC-TW, HZ-GB-2312, and ISO-2022-CN (Traditional and Simplified Chinese) EUC-JP, SHIFT_JIS, and ISO-2022-JP (Japanese) EUC-KR and ISO-2022-KR (Korean) KOI8-R, MacCyrillic, IBM855, IBM866, ISO-8859-5, and windows-1251 (Russian) ISO-8859-2 and windows-1250 (Hungarian) ISO-8859-5 and windows-1251 (Bulgarian) windows-1252 ISO-8859-7 and windows-1253 (Greek) ISO-8859-8 and windows-1255 (Visual and Logical Hebrew) TIS-620 (Thai) UTF-32 &be;, ≤, 3412-ordered, or 2143-ordered (with a &bom;) UTF-16 &be; or ≤ (with a &bom;) UTF-8 (with or without a &bom;) &ascii; <para>Due to inherent similarities between certain encodings, some encodings may be detected incorrectly. In my tests, the most problematic case was Hungarian text encoded as <literal>ISO-8859-2</literal> or <literal>windows-1250</literal> (encoded as one but reported as the other). Also, Greek text encoded as <literal>ISO-8859-7</literal> was often mis-reported as <literal>ISO-8859-2</literal>. Your mileage may vary.</para> </caution> </section> <section id="usage"> <?dbhtml filename="usage.html"?> <title>Usage

Basic usage The easiest way to use the &chardet; library is with the &detect; function. Using the &detect; function The &detect; function takes one argument, a non-Unicode string. It returns a dictionary containing the auto-detected character encoding and a confidence level from &zero; to &one;. &prompt;import urllib &prompt;rawdata = urllib.urlopen('http://yahoo.co.jp/').read() &prompt;import chardet &prompt;chardet.detect(rawdata) {'encoding': 'EUC-JP', 'confidence': 0.99}

Advanced usage If you're dealing with a large amount of text, you can call the &chardet; library incrementally, and it will stop as soon as it is confident enough to report its results. Create a &universaldetector_classname; object, then call its feed method repeatedly with each block of text. If the detector reaches a minimum threshold of confidence, it will set detector.done to True. Once you've exhausted the source text, call detector.close(), which will do some final calculations in case the detector didn't hit its minimum confidence threshold earlier. Then detector.result will be a dictionary containing the auto-detected character encoding and confidence level (the same as the chardet.detect function returns). Detecting encoding incrementally {'encoding': 'EUC-JP', 'confidence': 0.99} If you want to detect the encoding of multiple texts (such as separate files), you can re-use a single &universaldetector_classname; object. Just call detector.reset() at the start of each file, call detector.feed as many times as you like, and then call detector.close() and check the detector.result dictionary for the file's results. Detecting encodings of multiple files

<para>This is a brief guide to navigating the code itself.</para> </abstract> </sectioninfo> <title>How it works First, you should read &researchpaper;, which explains the detection algorithm and how it was derived. This will help you later when you stumble across the huge character frequency distribution tables like &big5freq_py; and language models like &langcyrillicmodel_py;. The main entry point for the detection algorithm is &universaldetector_py;, which has one class, &universaldetector_classname;. (You might think the main entry point is the &detect; function in chardet/__init__.py, but that's really just a convenience function that creates a &universaldetector_classname; object, calls it, and returns its result.) There are 5 categories of encodings that &universaldetector_classname; handles: UTF-n with a &bom;. This includes UTF-8, both &be; and ≤ variants of UTF-16, and all 4 byte-order variants of UTF-32. Escaped encodings, which are entirely 7-bit &ascii; compatible, where non-&ascii; characters start with an escape sequence. Examples: ISO-2022-JP (Japanese) and HZ-GB-2312 (Chinese). Multi-byte encodings, where each character is represented by a variable number of bytes. Examples: Big5 (Chinese), SHIFT_JIS (Japanese), EUC-KR (Korean), and UTF-8 without a &bom;. Single-byte encodings, where each character is represented by one byte. Examples: KOI8-R (Russian), windows-1255 (Hebrew), and TIS-620 (Thai). windows-1252, which is used primarily on Microsoft Windows by middle managers who don't know a character encoding from a hole in the ground.

<literal>UTF-n</literal> with a &bom; If the text starts with a &bom;, we can reasonably assume that the text is encoded in UTF-8, UTF-16, or UTF-32. (The &bom; will tell us exactly which one; that's what it's for.) This is handled inline in &universaldetector_classname;, which returns the result immediately without any further processing.

Escaped encodings If the text contains a recognizable escape sequence that might indicate an escaped encoding, &universaldetector_classname; creates an &esccharsetprober_classname; (defined in &escprober_py;) and feeds it the text. &esccharsetprober_classname; creates a series of state machines, based on models of HZ-GB-2312, ISO-2022-CN, ISO-2022-JP, and ISO-2022-KR (defined in &escsm_py;). &esccharsetprober_classname; feeds the text to each of these state machines, one byte at a time. If any state machine ends up uniquely identifying the encoding, &esccharsetprober_classname; immediately returns the positive result to &universaldetector_classname;, which returns it to the caller. If any state machine hits an illegal sequence, it is dropped and processing continues with the other state machines.

Multi-byte encodings Assuming no &bom;, &universaldetector_classname; checks whether the text contains any high-bit characters. If so, it creates a series of probers for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252. The multi-byte encoding prober, &mbcsgroupprober_classname; (defined in &mbcsgroupprober_py;), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. &mbcsgroupprober_classname; feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to &universaldetector_classname;.feed will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, &mbcsgroupprober_classname; reports this positive result to &universaldetector_classname;, which reports the result to the caller. Most of the multi-byte encoding probers are inherited from &multibytecharsetprober_classname; (defined in &mbcharsetprober_py;), and simply hook up the appropriate state machine and distribution analyzer and let &multibytecharsetprober_classname; do the rest of the work. &multibytecharsetprober_classname; runs the text through the encoding-specific state machine, one byte at a time, to look for byte sequences that would indicate a conclusive positive or negative result. At the same time, &multibytecharsetprober_classname; feeds the text to an encoding-specific distribution analyzer. The distribution analyzers (each defined in &chardistribution_py;) use language-specific models of which characters are used most frequently. Once &multibytecharsetprober_classname; has fed enough text to the distribution analyzer, it calculates a confidence rating based on the number of frequently-used characters, the total number of characters, and a language-specific distribution ratio. If the confidence is high enough, &multibytecharsetprober_classname; returns the result to &mbcsgroupprober_classname;, which returns it to &universaldetector_classname;, which returns it to the caller. The case of Japanese is more difficult. Single-character distribution analysis is not always sufficient to distinguish between EUC-JP and SHIFT_JIS, so the &sjisprober_classname; (defined in &sjisprober_py;) also uses 2-character distribution analysis. &sjiscontextanalysis_classname; and &eucjpcontextanalysis_classname; (both defined in &jpcntx_py; and both inheriting from a common &japanesecontextanalysis_classname; class) check the frequency of Hiragana syllabary characters within the text. Once enough text has been processed, they return a confidence level to &sjisprober_classname;, which checks both analyzers and returns the higher confidence level to &mbcsgroupprober_classname;.

Single-byte encodings The single-byte encoding prober, &sbcsgroupprober_classname; (defined in &sbcsgroupprober_py;), is also just a shell that manages a group of other probers, one for each combination of single-byte encoding and language: windows-1251, KOI8-R, ISO-8859-5, MacCyrillic, IBM855, and IBM866 (Russian); ISO-8859-7 and windows-1253 (Greek); ISO-8859-5 and windows-1251 (Bulgarian); ISO-8859-2 and windows-1250 (Hungarian); TIS-620 (Thai); windows-1255 and ISO-8859-8 (Hebrew). &sbcsgroupprober_classname; feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, &singlebytecharsetprober_classname; (defined in &sbcharsetprober_py;), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. &singlebytecharsetprober_classname; processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio. Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, &hebrewprober_classname; (defined in &hebrewprober_py;) tries to distinguish between Visual Hebrew (where the source text actually stored backwards line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew).

windows-1252 If &universaldetector_classname; detects a high-bit character in the text, but none of the other multi-byte or single-byte encoding probers return a confident result, it creates a &latin1prober_classname; (defined in &latin1prober_py;) to try to detect English text in a windows-1252 encoding. This detection is inherently unreliable, because English letters are encoded in the same way in many different encodings. The only way to distinguish windows-1252 is through commonly used symbols like smart quotes, curly apostrophes, copyright symbols, and the like. &latin1prober_classname; automatically reduces its confidence rating to allow more accurate probers to win if at all possible.

<para>&chardet; is currently at version &chardet_version;.</para> </abstract> </sectioninfo> <title>Revision history 1.0.1 (2008-03-05) fixed typo in detecting little endian UTF-16; closes issue 81 fixed length of ISO2022JPCharLenTable; closes issue 98 1.0 (2006-01-10) Initial release

&license;