diff --git a/dip3.js b/dip3.js index 7776531..e8918df 100644 --- a/dip3.js +++ b/dip3.js @@ -15,7 +15,7 @@ $(document).ready(function() { } */ - $("#toc").html('table of contents'); + hideTOC(); // "hide", "open in new window", and (optionally) "download" widgets on code & screen blocks $("pre > code").each(function(i) { @@ -83,6 +83,10 @@ function plainTextOnClick(id) { win.document.close(); } +function hideTOC() { + $("#toc").html(' show table of contents'); +} + function showTOC() { var toc = ''; var old_level = 1; @@ -100,5 +104,5 @@ function showTOC() { toc += ''; level -= 1; } - $("#toc").html(toc); + $("#toc").html(' hide table of contents' + toc); } diff --git a/strings.html b/strings.html index 758083b..8fdc99d 100644 --- a/strings.html +++ b/strings.html @@ -16,7 +16,7 @@ My alphabet starts where your alphabet ends!

 

Diving in

-

Chinese has thousands of characters. The Rotokas alphabet of Bougainville is the smallest alphabet in the world, with just 12 letters. English has 26, plus a handful of punctuation marks. Python 3 can handle all of these languages, and more. +

Did you know that the people of Bougainville have the smallest alphabet in the world? Their Rotokas alphabet is composed of only 12 letters: A, E, G, I, K, O, P, R, S, T, U, and V. On the other end of the spectrum, languages like Chinese, Japanese, and Korean have thousands of characters. English, of course has 26, plus a handful of !@#$%& punctuation marks. Python 3 can handle all of these languages, and more.

When people talk about “text,” they’re thinking of “characters and symbols on the computer screen.” But computers don’t deal in characters and symbols; they deal in bits and bytes. Every piece of text you’ve ever seen on a computer screen is actually stored in a particular character encoding. There are many different character encodings, some optimized for particular languages like Russian or Chinese or English, and others that can be used for multiple languages. Very roughly speaking, the character encoding provides a mapping between the stuff you see on your screen and the stuff your computer actually stores in memory and on disk. @@ -24,9 +24,9 @@ My alphabet starts where your alphabet ends!
Surely you’ve seen web pages like this, with strange question-mark-like characters where apostrophes should be. That usually means the page author didn’t declare their character encoding correctly, your browser was left guessing, and the result was a mix of expected and unexpected characters. In English it’s merely annoying; in other languages, the result can be completely unreadable. -

As I mentioned, there are separate character encodings for each major language in the world, and a lot of minor ones. Since each language is different, and disk space has historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. ASCII, for instance, stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte. +

There are character encodings for each major language in the world. Since each language is different, and memory and disk space have historically been expensive, each character encoding is optimized for a particular language. By that, I mean each encoding using the same numbers (0–255) to represent that language’s characters. For instance, you’re probably familiar with the ASCII encoding, which stores English characters as numbers ranging from 0 to 127. (65 is capital “A”, 97 is lowercase “a”, and so forth.) English has a very simple alphabet, so it can be completely expressed in less than 128 numbers. For those of you who can count in base 2, that’s 7 out of the 8 bits in a byte. -

Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with ASCII in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), and so on. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte. +

Western European languages like French, Spanish, and German have more letters than English. Or, more precisely, they have letters combined with various diacritical marks, like the ñ character in Spanish. The most common encoding for these languages is CP-1252, also called “windows-1252” because it is widely used on Microsoft Windows. The CP-1252 encoding shares characters with ASCII in the 0–127 range, but then extends into the 128–255 range for characters like n-with-a-tilde-over-it (241), u-with-two-dots-over-it (252), and so on. It’s still a single-byte encoding, though; the highest possible number, 255, still fits in one byte.

Then there are languages like Chinese, Japanese, and Korean, which have so many characters that they require multiple-byte character sets. That is, each “character” is represented by a two-byte number from 0–65535. But different multi-byte encodings still share the same problem as different single-byte encodings, namely that they each use the same numbers to mean different things. It’s just that the range of numbers is broader, because there are many more characters to represent. @@ -48,9 +48,9 @@ My alphabet starts where your alphabet ends!
Enter Unicode. -

Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That's 232−1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; Unicode data is never ambiguous. +

Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That's 232−1.) Each 4-byte number represents a unique character used in at least one of the world's languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn't be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041 is always 'A', even if your language doesn't have an 'A' in it. -

Right away, problems leap out at you. 4 bytes? For every single character [FIXME incomplete paragraph] +

Right away, problems leap out at you. 4 bytes? For every single character That's seems awfully wasteful, especially for English and Spanish, which need less than 256 numbers to express every possible character. [FIXME incomplete paragraph]

Of course, there is still the matter of all those legacy encoding systems. [FIXME incomplete paragraph]