Unicode in Regular Expressions
How to match Unicode characters, categories, and scripts in regex — Python, JavaScript, and PCRE.
Published:
Tags: Unicode regex patterns, Unicode regular expressions, regex Unicode category
Unicode in Regular Expressions Standard regex engines treat strings as sequences of bytes or 16-bit units. Unicode-aware regex treats strings as sequences of code points, which changes how , , , and character classes behave — and unlocks property escapes for matching by Unicode script, category, or property. The Unicode Technical Standard #18 (Unicode Regular Expressions) defines the requirements for Unicode-compliant regex implementations, and ECMAScript's regex spec defines JavaScript behavior. Unicode regex support adds 15-25% overhead compared to ASCII regex, according to regex engine benchmarks --- The Core Problem: Code Units vs. Code Points JavaScript strings are UTF-16. Without the flag, a regex matches a single UTF-16 code unit — which means an emoji (represented as a surrogate…
Frequently Asked Questions
How do I match Unicode characters in regex?
In JavaScript, use the `u` flag to enable full Unicode mode, which makes `.` match any code point (including those above U+FFFF) and unlocks `\p{}` property escapes. In Python, the `re` module handles Unicode strings natively; use the `regex` module for `\p{}` property escapes.
What is the \p{} Unicode property escape?
\p{Property=Value} is a regex syntax for matching characters by their Unicode property — such as \p{Script=Cyrillic}, \p{General_Category=Letter}, or the shorthand \p{L} for any letter. Supported in PCRE, Java, .NET, JavaScript (with u flag), and the Python regex module.
How do I match any letter regardless of accent?
Use \p{L} to match any Unicode letter in any script. To match a specific base letter plus its accented variants, normalize to NFD first (splitting base letter + combining mark), then match the base letter followed by \p{Mn}* for optional combining marks.
How do I match emoji with regex?
In JavaScript with the `u` flag: /\p{Emoji}/u matches most emoji. For the full emoji set including sequences, use /\p{RGI_Emoji}/v in the newer `v` flag mode. In Python, the `regex` module supports \p{Emoji} and \p{Extended_Pictographic}.
What is the u flag in JavaScript regex?
The `u` flag in JavaScript regex enables Unicode mode, which treats the pattern and string as Unicode code points rather than UTF-16 code units. This correctly handles surrogate pairs (emoji and characters above U+FFFF), enables \p{} property escapes, and makes \u{XXXXX} hex escapes work.
All articles · theproductguy.in