Strip Diacritics: Normalize Accented Text
Remove diacritical marks from text — é → e, ü → u, ñ → n — for search, slugs, and sorting.
Published:
Tags: strip diacritics online, remove accents text, normalize accented characters
Strip Diacritics: Normalize Accented Text Stripping diacritics converts accented characters to their plain ASCII base forms: é → e, ü → u, ñ → n, ç → c. The technique is essential for URL slug generation, accent-insensitive search, and sorting multilingual content. The normalization process is defined in Unicode Technical Report #15 (Unicode Normalization Forms), and the Unicode Standard's character database classifies combining marks by Unicode category Mn. --- How Diacritic Stripping Works Unicode stores many accented characters in two equivalent forms: Precomposed (NFC): é is stored as a single code point U+00E9 (LATIN SMALL LETTER E WITH ACUTE) Decomposed (NFD): é is stored as U+0065 (letter e) + U+0301 (combining acute accent) Stripping diacritics works in two steps: Decompose to NFD…
Frequently Asked Questions
How do I remove diacritics from text?
The standard approach is to decompose the string to NFD form (splitting precomposed characters into base letter + combining mark), then remove all combining characters. In JavaScript: str.normalize('NFD').replace(/\p{Mn}/gu, ''). In Python: unicodedata.normalize('NFD', s) followed by removing characters with category 'Mn'.
What is a diacritical mark?
A diacritical mark (or diacritic) is a glyph added above, below, or through a base letter to modify its phonetic value or distinguish homophones. Common examples are the acute accent (é), umlaut (ü), cedilla (ç), tilde (ñ), circumflex (â), and grave accent (è). Unicode assigns these as combining characters in the category Mn (Mark, Nonspacing).
Why would I strip diacritics?
The primary uses are: generating URL slugs where accented characters would be percent-encoded or cause encoding issues; enabling accent-insensitive search so 'resume' matches 'résumé'; normalizing data before storage or comparison when you want café and cafe treated identically; and sorting multilingual lists where accented and unaccented letters should sort together.
How do I remove accents in JavaScript?
Use normalize('NFD') followed by a Unicode property escape: text.normalize('NFD').replace(/\p{Mn}/gu, ''). The 'u' flag enables Unicode mode, and \p{Mn} matches any nonspacing combining mark. This handles accents, umlauts, cedillas, tildes, and all other combining diacritics across all scripts.
What is Unicode NFC vs NFD decomposition?
NFD (Canonical Decomposition) splits precomposed characters into their base form plus separate combining marks: é (U+00E9) becomes e (U+0065) + combining acute accent (U+0301). NFC (Canonical Composition) does the reverse: it composes them back into a single code point where one exists. Stripping diacritics requires NFD first, so the combining marks are separate characters that can be deleted.
All articles · theproductguy.in