From c3cbdf035d820de08dcd24522377102a5ee093c5 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Fri, 30 Jan 2009 19:46:43 -0500 Subject: [PATCH] several more 2to3 sections completed --- case-study-porting-chardet-to-python-3.html | 84 +++--- dip3.css | 7 +- index.html | 12 +- porting-code-to-python-3-with-2to3.html | 314 ++++++++++---------- 4 files changed, 211 insertions(+), 206 deletions(-) diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index c9911f1..315919d 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -12,17 +12,17 @@ body{counter-reset:h1 19}

Case study: porting chardet to Python 3

-

Words, words. They're all we have to go on.
Rosencrantz and Guildenstern are Dead +

Words, words. They’re all we have to go on.
Rosencrantz and Guildenstern are Dead

  1. Introducing chardet: a mini-FAQ
    1. What is character encoding auto-detection? -
    2. Isn't that impossible? +
    3. Isn’t that impossible?
    4. Who wrote this detection algorithm? -
    5. Yippie! Screw the standards, I'll just auto-detect everything! -
    6. Why bother with auto-detection if it's slow, inaccurate, and non-standard? +
    7. Yippie! Screw the standards, I’ll just auto-detect everything! +
    8. Why bother with auto-detection if it’s slow, inaccurate, and non-standard?
  2. Diving in
      @@ -33,40 +33,40 @@ body{counter-reset:h1 19}
    1. windows-1252
  3. Running 2to3 -
  4. Fixing what 2to3 can't +
  5. Fixing what 2to3 can’t
    1. False is invalid syntax
    2. No module named constants
    3. Name 'file' is not defined -
    4. Can't use a string pattern on a bytes-like object -
    5. Can't convert 'bytes' object to str implicitly +
    6. Can’t use a string pattern on a bytes-like object +
    7. Can’t convert 'bytes' object to str implicitly

Introducing chardet: a mini-FAQ

-

When you think of "text," you probably think of "characters and symbols I see on my computer screen." But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. +

When you think of “text,” you probably think of “characters and symbols I see on my computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. -

In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text", you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever). +

In reality, it’s more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it’s “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).

What is character encoding auto-detection?

-

It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It's like cracking a code when you don't have the decryption key. +

It means taking a sequence of bytes in an unknown character encoding, and attempting to determine the encoding so you can read the text. It’s like cracking a code when you don’t have the decryption key. -

Isn't that impossible?

+

Isn’t that impossible?

-

In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds "txzqJv 2!dasd0a QqdKjvz" will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of "typical" text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language. +

In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn’t English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text’s language.

In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.

Who wrote this detection algorithm?

-

This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors' comments, which are quite extensive and informative. +

This library is a port of the auto-detection code in Mozilla. I have attempted to maintain as much of the original structure as possible (mostly for selfish reasons, to make it easier to maintain the port as the original code evolves). I have also retained the original authors’ comments, which are quite extensive and informative.

You may also be interested in the research paper which led to the Mozilla implementation, A composite approach to language/encoding detection. -

Yippie! Screw the standards, I'll just auto-detect everything!

+

Yippie! Screw the standards, I’ll just auto-detect everything!

-

Don't do that. Virtually every format and protocol contains a method for specifying character encoding. +

Don’t do that. Virtually every format and protocol contains a method for specifying character encoding.