Sunday, May 6, 2007

Unicode

Unicode allows us to represent character sets from many languages so that programming and web use can be international. We can enter Unicode characters in Microsoft Windows. Alan Wood includes many Unicode resources.

We can use the Character Map utility in Windows to add characters. For example in the Courier New font Unicode 03B1, Greek alpha is α, and Unicode 0416, Cyrillic Zhe, is Ж. Using the SimSun font, the Unicode character with code 8EBF is 躿. The Unicode value needs to be stored in memory. An encoding specifies how a value is stored. The Notepad editor provides four encodings, ANSI, Unicode, Unicode big endian, and UTF-8. ANSI(American National Standards Institute) developed a standard for English characters that uses 7 or 8 bits which can represent 128 or 256 characters using one byte.

To include more characters Uncode uses 16 bits or two bytes for its values. But with two bytes which byte comes first? The letter A in Unicode hexadecimal is 0041. The two bytes are 00(00000000 as bits) and 41 (01000001 as bits). Do they go in memory as 00 41 (little endian) or 41 00 (big endian)? Unicode files include two starting bytes to tell the difference. The normal big endian Unicode file starts with FE FF. Little endian systems reverse this to FF FE.

Notice that English characters like A need only one byte so why should we double the space to use Unicode for English? If all we use is English we could stick to ANSI. But suppose we sometimes need to include a foreign phrase. Even closely related languages such as Spanish and French have accented characters. Greek is used for mathematical symbols too. UTF-8 is a scheme using a variable number of bytes so that ANSI characters still use one byte but other Unicode characters use two or three bytes. We only use extra space when we need it. Unicode 0001 to 007f uses one byte. Unicode 0080 to 07ff uses two bytes with 110 starting the first byte and 10 starting the second. Unicode 0800 to ffff uses three bytes with 1110 starting the first bytes and 10 starting the second and third bytes.

Looking at the Notepad encodings for the Chinese 8EBF character shown above we get
ANSI 63
Unicode 255 254 191 142 (Windows is little endian)
Unicode big endian 254 255 142 191
UTF-8 239 187 191 232 186 191

8EBF in binary is 1000 1110 1011 1111. As two bytes it is 10001110 10111111.
In decimal 10001110=128+8+4+2=142
10111111=255-64 =191
so 8EBF is 142 191 in decimal.
ANSI saves the last 7 bits, 011 1111 = 32+16+8+4+2+1 = 63
(which is not appropriate here, but works for English)
Unicode on Windows uses little endian so it puts FF FE first which is 255 254 and
follows with the least signficant byte first, 191, then 142.
Unicode big endian put FE FF first which is 254 255 in decimal then the most significant byte 141 first followed by 191.

For UTF-8 we break up 8EBF (1000111010111111) as 1000 111010 111111 and create three bytes 1110 1000 = 128 + 64 + 32 = 232
10 111010 = 128 + 32 + 16 + 8 + 2 = 186
10 111111 = 255 - 64 = 191
The FE FF also needs to be coded as UTF-8. FEFF in binary is 1111 1110 1111 1111. We divide it as 1111 111011 111111 and the UTF-8 is
1110 1111 = 255 - 16 = 239
10 111011 = 255 - 64 -4 = 187
10 111111 = 255 - 64 = 191

Labels: