Unicode and UTF-8 for Developers
A practical guide to Unicode — code points, UTF-8 encoding, normalization, and common pitfalls.
Published:
Tags: Unicode UTF-8 for developers, Unicode guide programming, UTF-8 developer tutorial
Unicode and UTF-8 for Developers Unicode is the universal standard for text. UTF-8 is how that text is stored in files and transmitted over networks. Understanding both — and where they differ — prevents a class of bugs that are notoriously hard to reproduce and fix. --- The Unicode Standard The Unicode Standard assigns a unique code point to every character in every writing system used by humans, plus symbols, emoji, mathematical operators, and control characters. A code point is written as followed by four to six hex digits. The current standard (Unicode 15.1) defines 149,813 characters across 161 scripts. The first 128 code points (U+0000–U+007F) are identical to ASCII — an intentional compatibility decision. Encodings: UTF-8, UTF-16, UTF-32 Unicode defines code points. An encoding…
Frequently Asked Questions
What is Unicode?
Unicode is a universal character standard that assigns a unique number (code point) to every character in every writing system on earth, plus symbols, emoji, and control characters. The current standard covers over 149,000 characters across 161 scripts.
What is the difference between Unicode and UTF-8?
Unicode is the standard that defines code points — abstract numeric IDs for every character. UTF-8 is one encoding scheme that serializes those code points into bytes for storage and transmission. A string can be encoded as UTF-8, UTF-16, or UTF-32; all three represent the same Unicode characters using different byte layouts.
What is code point vs code unit?
A code point is the abstract Unicode number for a character (e.g., U+1F511 for the key emoji). A code unit is the base unit of a particular encoding: UTF-8 uses 8-bit units, UTF-16 uses 16-bit units, UTF-32 uses 32-bit units. Characters that require more than one code unit in an encoding are called surrogate pairs (UTF-16) or multi-byte sequences (UTF-8).
What is Unicode normalization?
Normalization is the process of converting strings to a canonical form so that equivalent strings compare as equal. The character 'é' can be represented as a single precomposed code point (U+00E9) or as 'e' followed by a combining accent (U+0065 U+0301). NFC normalization prefers precomposed forms; NFD prefers decomposed forms.
How does emoji affect string length?
A single emoji can use 1–11 code units in JavaScript (UTF-16). Family emoji like 👨👩👧👦 are ZWJ sequences of up to 7 code points and up to 25 UTF-16 code units. str.length returns the UTF-16 code unit count, not the visible character count. Use [...str].length or the Intl.Segmenter API to count grapheme clusters.
All articles · theproductguy.in