Unicode Confusables: Homograph Detection
Detect Unicode characters that look visually identical — for security, spell-checking, and domain validation.
Published:
Tags: Unicode confusables checker, homograph detection tool, lookalike Unicode characters
Unicode Confusables: Homograph Detection Unicode confusables are characters from different scripts that look visually identical or nearly identical. They are the technical basis for IDN homograph phishing attacks, username spoofing, and source code injection. Detecting them requires the Unicode Consortium's official confusables dataset and the skeleton algorithm. --- What Makes Characters Confusable Characters are confusable when they share a visual appearance despite being distinct code points. This occurs naturally across Unicode because many scripts evolved from common ancestors or borrowed letter shapes. Examples of confusable pairs: | Character A | Code Point | Character B | Code Point | Script Pair | |-------------|------------|-------------|------------|-------------| | a | U+0061…
Frequently Asked Questions
What are Unicode confusables?
Unicode confusables are pairs or groups of characters from different scripts that look visually similar or identical in most fonts. The Unicode Consortium maintains an official confusables data file listing these pairs. For example, Latin 'a' (U+0061) and Cyrillic 'а' (U+0430) are confusable — identical in appearance but distinct code points.
How do attackers use Unicode confusables?
Attackers register domain names that look like legitimate domains by substituting one or more characters with visually identical characters from another script. 'pаypal.com' with a Cyrillic 'а' looks identical to 'paypal.com' but resolves to a different domain. This is called an IDN homograph attack. Confusables also enable username spoofing on platforms that allow Unicode identifiers.
How do I detect homograph domains?
Convert the domain to Punycode using the IDNA algorithm. A mixed-script domain will have an unusual Punycode representation — 'pаypal.com' becomes 'xn--pypal-4ve.com' when the Cyrillic 'а' is present. Browsers display the Punycode form in the address bar when mixed-script rules are violated, as a security signal.
What is the Unicode confusables data set?
The confusables data set is published by the Unicode Consortium as part of Unicode Technical Standard #39 (Unicode Security Mechanisms). The file maps each confusable character to its 'skeleton' — a reduced form that identical-looking characters share. Two strings are confusable if their skeletons match. The data file is at unicode.org/Public/security/latest/confusables.txt.
How do I prevent confusable username attacks?
Apply the skeleton algorithm from UTS #39 to normalize usernames before storage and comparison. Alternatively, restrict usernames to a single script (e.g., ASCII only, or any single Unicode script). At the application layer, flag registrations that produce a skeleton matching an existing account. See also the Unicode Mixed Scripts Detection guidelines.
All articles · theproductguy.in