Localization Like the Pros—Unicode Issues
| CSharp-Online.NET:Articles |
| C# Articles |
| © 2004 Wiley Publishing, Inc. |
Unicode Issues
A Unicode character has 16 bits, so there is room for 65,536 characters. Is this enough for all languages that are currently used in information technology? In the case of the Chinese language, for example, more than 80,000 characters are needed. However, Unicode has been designed to deal with this issue. With Unicode you have to differentiate between base characters and combining characters. You can add multiple combining characters to a base character to build up a single display character or a text element.
Take, for example, the Icelandic character Ogonek. Ogonek can be combined by using the base character 0x006F (latin small letter o) and the combining characters 0x0328 (combining Ogonek) and 0x0304 (combining Macron) as shown in Figure 17-1. Combining characters are defined within ranges from 0x0300 to 0x0345. For American and European markets, predefined characters exist to facilitate dealing with the characters. The character Ogoneck is also defined with the predefined character 0x01ED.
For Asian markets where more than 80,000 characters are necessary for Chinese alone, such predefined characters do not exist. In the case of Asian languages, you always have to deal with combining characters. The problem with this issue is getting the right number of display characters or text elements, getting to the base characters instead of the combined characters. The namespace System.Globalization offers the class StringInfo that you can use to deal with this issue.
The following table lists the static methods of the class StringInfo that help dealing with combined characters.
| Method | Description |
GetNextTextElement
| Returns the first text element (base character and all combining characters) of a specified string. |
GetTextElementEnumerator
| Returns a TextElementEnumerator object that allows iterating all text elements of a string.
|
ParseCombiningCharacters
| Returns an integer array referencing all base characters of a string. |
|

