From df93f1a0eb2af35a439be3068e630eaa89890328 Mon Sep 17 00:00:00 2001 From: Mark Pilgrim Date: Fri, 22 May 2009 12:54:43 -0400 Subject: [PATCH] clarified range of Unicode code points [thanks chrajohn] --- strings.html | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/strings.html b/strings.html index 8c4853e..2176244 100644 --- a/strings.html +++ b/strings.html @@ -49,7 +49,7 @@ My alphabet starts where your alphabet ends!
— Dr

Enter Unicode. -

Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number, from 0–4294967295. (That’s 232−1.) Each 4-byte number represents a unique character used in at least one of the world’s languages. Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn’t be sufficient. Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041 is always 'A', even if your language doesn’t have an 'A' in it. +

Unicode is a system designed to represent every character from every language. Unicode represents each letter, character, or ideograph as a 4-byte number. Each number represents a unique character used in at least one of the world’s languages. (Not all the numbers are used, but more than 65535 of them are, so 2 bytes wouldn’t be sufficient.) Characters that are used in multiple languages generally have the same number, unless there is a good etymological reason not to. Regardless, there is exactly 1 number per character, and exactly 1 character per number. Every number always means just one thing; there are no “modes” to keep track of. U+0041 is always 'A', even if your language doesn’t have an 'A' in it.

On the face of it, this seems like a great idea. One encoding to rule them all. Multiple languages per document. No more “mode switching” to switch between encodings mid-stream. But right away, the obvious question should leap out at you. Four bytes? For every single character That seems awfully wasteful, especially for languages like English and Spanish, which need less than one byte (256 numbers) to express every possible character. In fact, it’s wasteful even for ideograph-based languages (like Chinese), which never need more than two bytes per character.