Zero-Width Characters in Security
How attackers use invisible Unicode characters for watermarking, data exfiltration, and prompt injection.
Published:
Tags: zero-width characters security, invisible characters attacks, Unicode security text
Zero-Width Characters in Security Invisible Unicode characters — primarily the zero-width space (U+200B), zero-width joiner (U+200D), and zero-width non-joiner (U+200C) — are legitimate tools in complex script rendering. They are also vectors for content filter evasion, document watermarking, steganographic data hiding, and prompt injection. --- Attack Vector 1: Content Filter Evasion Content filters and keyword blocklists typically operate on string patterns. A zero-width space inserted between letters breaks the pattern match while leaving the visual text unchanged: The word looks identical in any font, but the filter's regex finds no match because the string contains U+200B characters between each letter. Countermeasure: Normalize all user input and filter submissions by stripping…
Frequently Asked Questions
How are zero-width characters used for security attacks?
Zero-width characters can bypass keyword filters by interrupting recognized strings, watermark documents for leak attribution, carry hidden data via steganographic encoding, inject invisible instructions into AI model inputs, and create source code vulnerabilities where two visually identical files contain different logic.
What is steganography with invisible characters?
Text steganography hides data inside ordinary-looking text by encoding bits as sequences of invisible characters. For example, Zero Width Space (U+200B) can represent a binary 0 and Zero Width Joiner (U+200D) can represent a binary 1. A hidden message is encoded as a sequence of these characters interspersed in visible text, undetectable without scanning each code point.
How do I hide data in zero-width characters?
The technique: convert your message to binary, then replace each 0 bit with ZWSP (U+200B) and each 1 bit with ZWJ (U+200D). Intersperse the invisible characters in a carrier text. To decode, extract all zero-width characters, map back to bits, and convert to the original message. This is text steganography, not encryption — the data is hidden, not secured.
What is Unicode watermarking?
Unicode watermarking embeds a recipient identifier in a document using patterns of zero-width characters. A publisher sends document copies with slightly different invisible character patterns to different recipients. If a copy leaks, the watermark pattern reveals which recipient leaked it. The watermark is invisible in any text editor and survives copy-paste.
How do I detect malicious invisible characters?
Scan text for all characters in the Unicode format category (Cf) and bidirectional control characters (U+202A–U+202E, U+2066–U+2069). A zero-width character detector highlights each occurrence with its code point and position. For source code, run a dedicated tool that checks each file for non-printing characters, especially in string literals and identifiers.
All articles · theproductguy.in