From 504e9cbdb13a969d49951aaa98c5061d19a69862 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Fri, 30 Jan 2009 00:28:56 -0500 Subject: [PATCH] updated TOC --- case-study-porting-chardet-to-python-3.html | 10 +- index.html | 543 +++----------------- 2 files changed, 73 insertions(+), 480 deletions(-) diff --git a/case-study-porting-chardet-to-python-3.html b/case-study-porting-chardet-to-python-3.html index 8b726b5..b9877b4 100644 --- a/case-study-porting-chardet-to-python-3.html +++ b/case-study-porting-chardet-to-python-3.html @@ -41,9 +41,9 @@ body{counter-reset:h1 19}

Introducing chardet: a mini-FAQ

-

When you think of “text”, you probably think of “characters and symbols I see on my computer screen”. But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. +

When you think of "text", you probably think of "characters and symbols I see on my computer screen". But computers don't deal in characters and symbols; they deal in bits and bytes. Every piece of text you've ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. -

In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's “text”, you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever). +

In reality, it's more complicated than that. Many characters are common to multiple encodings, but each encoding may use a different sequence of bytes to actually store those characters in memory or on disk. So you can think of the character encoding as a kind of decryption key for the text. Whenever someone gives you a sequence of bytes and claims it's "text", you need to know what character encoding they used so you can decode the bytes into characters and display them (or process them, or whatever).

What is character encoding auto-detection?

@@ -51,7 +51,7 @@ body{counter-reset:h1 19}

Isn't that impossible?

-

In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds “txzqJv 2!dasd0a QqdKjvz” will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of “typical” text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language. +

In general, yes. However, some encodings are optimized for specific languages, and languages are not random. Some character sequences pop up all the time, while other sequences make no sense. A person fluent in English who opens a newspaper and finds "txzqJv 2!dasd0a QqdKjvz" will instantly recognize that that isn't English (even though it is composed entirely of English letters). By studying lots of "typical" text, a computer algorithm can simulate this kind of fluency and make an educated guess about a text's language.

In other words, encoding detection is really language detection, combined with knowledge of which languages tend to use which character encodings.

Who wrote this detection algorithm?

@@ -108,7 +108,7 @@ body{counter-reset:h1 19}

Multi-byte encodings

-

Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of “probers” for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252. +

Assuming no BOM, UniversalDetector checks whether the text contains any high-bit characters. If so, it creates a series of "probers" for detecting multi-byte encodings, single-byte encodings, and as a last resort, windows-1252.

The multi-byte encoding prober, MBCSGroupProber (defined in mbcsgroupprober.py), is really just a shell that manages a group of other probers, one for each multi-byte encoding: Big5, GB2312, EUC-TW, EUC-KR, EUC-JP, SHIFT_JIS, and UTF-8. MBCSGroupProber feeds the text to each of these encoding-specific probers and checks the results. If a prober reports that it has found an illegal byte sequence, it is dropped from further processing (so that, for instance, any subsequent calls to UniversalDetector.feed() will skip that prober). If a prober reports that it is reasonably confident that it has detected the encoding, MBCSGroupProber reports this positive result to UniversalDetector, which reports the result to the caller. @@ -124,7 +124,7 @@ body{counter-reset:h1 19}

SBCSGroupProber feeds the text to each of these encoding+language-specific probers and checks the results. These probers are all implemented as a single class, SingleByteCharSetProber (defined in sbcharsetprober.py), which takes a language model as an argument. The language model defines how frequently different 2-character sequences appear in typical text. SingleByteCharSetProber processes the text and tallies the most frequently used 2-character sequences. Once enough text has been processed, it calculates a confidence level based on the number of frequently-used sequences, the total number of characters, and a language-specific distribution ratio. -

Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored “backwards” line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew). +

Hebrew is handled as a special case. If the text appears to be Hebrew based on 2-character distribution analysis, HebrewProber (defined in hebrewprober.py) tries to distinguish between Visual Hebrew (where the source text actually stored "backwards" line-by-line, and then displayed verbatim so it can be read from right to left) and Logical Hebrew (where the source text is stored in reading order and then rendered right-to-left by the client). Because certain characters are encoded differently based on whether they appear in the middle of or at the end of a word, we can make a reasonable guess about direction of the source text, and return the appropriate encoding (windows-1255 for Logical Hebrew, or ISO-8859-8 for Visual Hebrew).

windows-1252

diff --git a/index.html b/index.html index 0348c57..f85ee30 100644 --- a/index.html +++ b/index.html @@ -12,49 +12,18 @@

Dive Into Python 3 will cover Python 3 and its differences from Python 2. Compared to the original Dive Into Python, it will be about 50% revised and 50% new material. I will publish drafts online as I go. The final book will be published on paper by Apress. The book will remain online under the CC-BY-3.0 license.

Below is the draft table of contents. It is not finalized. Only a few chapters have been written so far. The rest is just stubs and random notes to myself.

Yes, that is PapayaWhip. All hail PapayaWhip.

- -

Installing Python

- -

Python on Windows

-
- -

Python on Mac OS X

-
- -

Python on Linux

-
- -

Python from source

-
- -

The interactive shell

-
- -

Summary

-
-
- -

Your first Python program

- -

Diving in

-
- -

Declaring functions

How Python's datatypes compare to other programming languages

-
- -

Writing readable code

Why bother?

Docstrings

@@ -63,33 +32,16 @@

Style conventions

...

-
- -

Everything is an object

The import search path

What's an object?

-
- -

Indenting code

-
- -

Testing modules

- -

Summary

-
-
- -

Native Python datatypes

- -

Lists

Differences from Python 2

Creating new a list

@@ -99,9 +51,6 @@

List operators

Looping through a list (list comprehensions)

Tuples

-
- -

Dictionaries

Differences from Python 2

Creating a new dictionary

@@ -109,9 +58,6 @@

Deleting items from a dictionary

Looping through a dictionary (dictionary comprehensions)

Dictionary views

-
- -

Sets

Differences from Python 2

Creating a new set

@@ -119,9 +65,6 @@

Deleting elements from a set

Common set operations: union, intersection, and difference

Frozen sets

-
- -

Numbers

Differences from Python 2

Integers

@@ -129,615 +72,265 @@

Floating point numbers

Complex numbers

Common numerical operations

-
-
- -

- -

Iterators

-
- -

Generators

-
- -

Views

-
- -

...

-
-
- - - -

Strings

- -

There ain't no such thing as "plain text"

A brief history of character encoding

What's a character?

How strings are stored in memory

Converting between different character encodings

-
- -

Differences from Python 2

-
- -

Formatting strings

-
- -

What's my string?

-
- -

Lists and strings

-
- -

Historical note on the string module

-
- -

Byte streams

-
- -

Summary

-
- -
-

The power of introspection

- -

Diving in

-
- -

Using optional and named arguments

Keyword-only arguments

-
- -

Using type, str, dir, and other built-in functions

The type function

The str function

Built-in functions

-
- -

Getting object references with getattr

getattr with modules

getattr as a dispatcher

-
- -

Filtering lists

-
- -

The peculiar nature of and and or

Using the and-or trick

-
- -

Using lambda functions

Real-world lambda functions

-
- -

Putting it all together

-
- -

Summary

-
-
- -

Objects and object-orientation

- -

...major changes afoot...

...stuff about decorators...

...stuff about importing modules...

...mention why "from module import *" is only allowed at module level

-
-
- -

Exceptions

- -

...

-
-
- -

Files

- -

File objects

-
- -

Reading files

-
- -

Close your files... or don't

-
- -

Handling I/O errors

-
- -

Writing to files

-
- -
-

Regular expressions

- -

Diving in

-
- -

Case study: street addresses

-
- -

Case study: Roman numerals

Checking for thousands

Checking for hundreds

-
- -

Using the {n,m} syntax

Checking for tens and ones

-
- -

Verbose regular expressions

-
- -

Case study: parsing phone numbers

-
- -

Summary

-
-
- -

HTML processing

- -

Diving in

-
- -

html5lib

Installing html5lib

Using html5lib

-
- -

Extracting data from HTML documents

-
- -

Building HTML documents

-
- -

Putting it all together

-
- -

Summary

-
-
- -

XML Processing

- -

...major changes afoot...

-
-
- -

HTTP web services

- -

Diving in

-
- -

How not to fetch data over HTTP

-
- -

Features of HTTP

User-Agent

Redirects

Last-Modified/If-Modified-Since

ETag-If-None-Match

Compression

-
- -

Differences from Python 2

-
- -

httplib2 (note: needs port)

Installing httplib2

Why httplib2 is better than http.client

-
- -

Debugging HTTP web services

-
- -

Setting the User-Agent

-
- -

Handling Last-Modified and ETag

-
- -

Handling redirects

-
- -

Handling compressed data

-
- -

Putting it all together

-
- -

Summary

-
-
- -

Unit testing

- -

Introduction to Roman numerals

-
- -

Diving in

-
- -

Introducing romantest.py

-
- -

Testing for success

-
- -

Testing for failure

-
- -

Testing for sanity

-
-
- -

Test-first programming

- -

roman.py, stage 1

-
- -

roman.py, stage 2

-
- -

roman.py, stage 3

-
- -

roman.py, stage 4

-
- -

roman.py, stage 5

-
-
- -

Refactoring your code

- -

Handling bugs

-
- -

Handling changing requirements

-
- -

The art of refactoring

-
- -

Postscript

-
- -

Summary

-
-
- -

Dynamic functions

- -

Diving in

-
- -

plural.py, stage 1

-
- -

plural.py, stage 2

-
- -

plural.py, stage 3

-
- -

plural.py, stage 4

-
- -

plural.py, stage 5

-
- -

plural.py, stage 6

-
- -

Summary

-
- -
-

Metaclasses

- -

...once I figure out WTF metaclasses are...

-
-
- -

Performance tuning

- -

Diving in

-
- -

Using the timeit module

-
- -

Optimizing regular expressions

-
- -

Optimizing dictionary lookups

-
- -

Optimizing list operations

-
- -

Optimizing string manipulation

-
- -

Summary

-
-
- -

Case study: porting chardet to Python 3

- -
+

Introducing chardet: a mini-FAQ

+

What is character encoding auto-detection?

+

Isn't that impossible?

+

Who wrote this detection algorithm?

+

Yippie! Screw the standards, I'll just auto-detect everything!

+

Why bother with auto-detection if it's slow, inaccurate, and non-standard?

Diving in

-
- -
+

UTF-n with a BOM

+

Escaped encodings

+

Multi-byte encodings

+

Single-byte encodings

+

windows-1252

Running 2to3

-
+

Fixing what 2to3 can't

+

False is invalid syntax

+

No module named constants

+

Name 'file' is not defined

+

Can't use a string pattern on a bytes-like object

+

Can't convert 'bytes' object to str implicitly

-
-

False is invalid syntax

-
- -
-

No module named constants

-
- -
-

Name 'file' is not defined

-
- -
-

Can't use a string pattern on a bytes-like object

-
- -
-

Can't convert 'bytes' object to str implicitly

-
- -
- -

Packaging Python libraries

- - -

A brief history of packaging (and why it's harder than you think)

-
- -

setuptools

-
- -

distutils

-
- -

Eggs

-
- -

pip

-
- -

Platform-specific packaging

Packaging by Linux distributions

Py2exe

-
- -
- -

Creating graphics with the Python Imaging Library

- -

...will likely get ported in time...

-
-
- -

Where to go from here

Tentative because most of these have not been ported to Python 3 yet.

- -

WSGI

-
- -

Django

-
- -

Pylons

-
- -

TurboGears

-
- -

AppEngine

-
- -

IronPython

-
- -

Jython

-
- -

PyPy

-
- -

Stackless Python

-
-
- -

Scripts and streams

- -

...will be folded into other chapters...

-
-
- -

Functional programming

- -

...bits and pieces will be folded into other chapters...

-
-
- -

SOAP web services

- -

...no one will miss you...

-
- -

Appendix A. Porting code to Python 3 with 2to3

+

Diving in

+

print statement

+

<> comparison

+

has_key() dictionary method

+

Dictionary methods that return lists

+

Modules that have been renamed or reorganized

+

http package

+

urllib package

+

dbm package

+

xmlrpc package

+

Other modules

+

Relative imports within a package

+

filter() global function

+

map() global function

+

reduce() global function (3.1+)

+

apply() global function

+

intern() global function

+

exec statement

+

execfile statement (3.1+)

+

repr literals (backticks)

+

try...except statement

+

raise statement

+

throw statement

+

long data type

+

xrange() global function

+

raw_input() and input() global functions

+

func_* function attributes

+

xreadlines() I/O method

+

lambda functions with multiple parameters

+

Special method attributes

+

next() iterator method

+

__nonzero__ special class attribute

+

Number literals

+

sys.maxint

+

unicode() global function

+

Unicode string literals

+

callable() global function

+

zip() global function

+

StandardError() exception

+

types module constants

+

isinstance global function (3.1+)

+

basestring datatype

+

itertools module

+

sys.exc_type, sys.exc_value, sys.exc_traceback

+

List comprehensions over tuples

+

os.getcwdu() function

+

Metaclasses

+

set() literals

+

buffer() global function

+

Whitespace around commas

+

Common idioms

-