Unicode Normalization: NFC, NFD, NFKC
Normalize Unicode strings for consistent comparison, storage, and search — NFC, NFD, NFKC, NFKD explained.
Published:
Tags: Unicode normalization guide, NFC NFD Unicode, normalize Unicode strings
Unicode Normalization: NFC, NFD, NFKC, NFKD Unicode normalization resolves the ambiguity created by multiple valid encodings of the same visual character. Without it, equality checks fail silently, databases accumulate duplicate entries, and security checks can be bypassed. Unicode standard defines 154 writing systems covering 5.2+ billion people and 150+ languages. --- Why Multiple Encodings Exist The accented character é can be represented as: Precomposed: U+00E9 — LATIN SMALL LETTER E WITH ACUTE (1 code point) Decomposed: U+0065 U+0301 — LATIN SMALL LETTER E + COMBINING ACUTE ACCENT (2 code points) Both render identically in every font and browser. But: This asymmetry is the fundamental problem Unicode normalization solves. The Four Normalization Forms Unicode Standard Annex #15 (UAX…
Frequently Asked Questions
What is Unicode normalization?
Unicode normalization converts a string to a standard form so that strings with the same visual appearance and meaning compare as equal. The same accented character can be encoded as a single precomposed code point (NFC) or as a base character plus a combining diacritic (NFD). Normalization resolves this ambiguity before storage, comparison, or hashing.
What is the difference between NFC and NFD?
NFC (Canonical Decomposition followed by Canonical Composition) produces the shortest canonical representation — precomposed characters wherever a precomposed form exists. NFD (Canonical Decomposition) decomposes all precomposed characters into base letter plus combining marks. The string 'é' is a single code point in NFC (U+00E9) and two code points in NFD (U+0065 + U+0301).
What is NFKC normalization?
NFKC (Compatibility Decomposition followed by Canonical Composition) additionally folds compatibility variants: the ligature fi (U+FB01) becomes 'fi', the superscript ² (U+00B2) becomes '2', the fullwidth A (U+FF21) becomes 'A'. NFKC is the recommended form for identifiers, search indexing, and cases where visual equivalence matters more than exact code-point preservation.
Why does Unicode normalization matter for databases?
Without normalization, two visually identical strings can produce different database entries and fail equality checks. A user named 'José' whose name is stored in NFD will not match a lookup for 'José' stored in NFC, even though they look identical on screen. Always normalize to NFC before inserting into a database, and normalize queries to the same form.
How do I normalize Unicode in Python?
Use unicodedata.normalize(form, string) where form is 'NFC', 'NFD', 'NFKC', or 'NFKD'. Example: import unicodedata; normalized = unicodedata.normalize('NFC', 'café'). The function is available in the standard library — no third-party packages needed.
All articles · theproductguy.in