Unicode Normalization in Python
Apply NFC, NFD, NFKC, NFKD normalization in Python using unicodedata.normalize().
Published:
Tags: Unicode normalization Python, Python unicodedata normalize, NFC Python
Unicode Normalization in Python Python's converts a string between four Unicode normal forms: NFC, NFD, NFKC, and NFKD. The right form depends on whether you need canonical equivalence (NFC/NFD) or compatibility equivalence (NFKC/NFKD), and whether you want characters composed or decomposed. The normalization algorithms are fully specified in Unicode Technical Report #15 and Python's unicodedata module documentation. --- The Four Normal Forms Unicode defines normal forms in Unicode Standard Annex #15, which is the authoritative reference for normalization. | Form | Decomposition | Composition | Example: é | |------|--------------|-------------|------------| | NFC | Canonical | Yes | U+00E9 (precomposed) | | NFD | Canonical | No | U+0065 U+0301 (e + combining acute) | | NFKC |…
Frequently Asked Questions
How do I normalize Unicode in Python?
Use `unicodedata.normalize(form, text)` from the standard library, where `form` is one of 'NFC', 'NFD', 'NFKC', or 'NFKD'. NFC is the default form for most web applications and matches what browsers produce; NFD is useful for stripping diacritics.
What is unicodedata.normalize()?
It's a function in Python's built-in `unicodedata` module that transforms a Unicode string into one of four canonical or compatibility normal forms. It takes a form identifier string and the input text, returning the normalized string.
How do I remove diacritics in Python?
Apply NFD normalization to decompose accented characters, then filter out all code points in Unicode category 'Mn' (Mark, Non-spacing): `''.join(c for c in unicodedata.normalize('NFD', text) if unicodedata.category(c) != 'Mn')`.
How do I compare Unicode strings in Python?
Normalize both strings to the same form (NFC is the standard choice) before comparing. Two strings can look identical and have the same meaning but compare unequal if one uses a precomposed character and the other uses a combining sequence — NFC resolves this.
What is the difference between str and bytes in Python?
In Python 3, `str` holds Unicode code points (not bytes), and `bytes` holds raw octets. Encode a `str` to `bytes` with `.encode('utf-8')` and decode back with `.decode('utf-8')`. Always normalize before encoding to ensure consistent byte representation.
All articles · theproductguy.in