Unicode and Internationalization (i18n)
How Unicode enables multilingual applications — bidi text, normalization, collation, and locale-aware sorting.
Published:
Tags: Unicode internationalization i18n, i18n Unicode handling, multilingual text Unicode
Unicode and Internationalization (i18n) Unicode is the character encoding foundation for every multilingual application. But encoding alone is not sufficient — correctly handling text from Arabic, Chinese, German, Hindi, and Thai requires additional Unicode specifications for text direction, sorting order, segmentation, and case folding. --- Unicode Planes and Script Coverage Unicode currently assigns characters across 17 planes: | Plane | Range | Contents | |-------|-------|----------| | BMP (Plane 0) | U+0000–U+FFFF | Most scripts, symbols, common CJK | | SMP (Plane 1) | U+10000–U+1FFFF | Historic scripts, emoji, musical notation | | SIP (Plane 2) | U+20000–U+2FFFF | CJK Extension B and beyond | | TIP (Plane 3) | U+30000–U+3FFFF | CJK Extension G (added 2021) | Scripts supported include…
Frequently Asked Questions
What is internationalization (i18n)?
Internationalization (i18n — 'i' + 18 letters + 'n') is the process of designing software so it can be adapted to different languages and regions without code changes. Localization (l10n) is the subsequent adaptation for a specific locale. Unicode provides the character encoding foundation; i18n libraries like ICU or the JavaScript Intl API handle locale-specific behavior such as date formatting and text sorting.
How does Unicode support right-to-left text?
Unicode defines the Unicode Bidirectional Algorithm (UAX #9) which determines the display order of characters in mixed left-to-right (LTR) and right-to-left (RTL) text. RTL scripts include Arabic (U+0600–U+06FF), Hebrew (U+0590–U+05FF), and Thaana. The algorithm uses character properties and explicit control characters (LRM U+200E, RLM U+200F, and isolate marks) to resolve visual ordering.
What is Unicode collation?
Unicode collation determines the sort order of strings in a locale-aware way. The Unicode Collation Algorithm (UCA, UTS #10) defines a default sort order that can be customized per locale. In Spanish, 'ch' and 'll' may sort as single units. In German, ä sorts near a. In Swedish, ä comes after z. The JavaScript Intl.Collator API and the ICU library implement UCA.
How do I sort strings with Unicode correctly?
Use locale-aware comparison instead of strcmp or default JavaScript string comparison. In JavaScript: const sorted = arr.sort((a, b) => a.localeCompare(b, 'de', { sensitivity: 'base' })). In Python: use locale.strxfrm or the PyICU library. Never sort Unicode strings with simple char-code comparison if correct multilingual ordering matters.
What is the ICU library?
ICU (International Components for Unicode) is the reference implementation of Unicode's i18n specifications. It provides text segmentation, collation, normalization, date/number formatting, and bidirectional layout for dozens of programming environments. The JavaScript Intl API is backed by ICU in all major browsers. Python bindings are available through PyICU.
All articles · theproductguy.in