Emoji in Programming: String Handling
How emoji affect string length, regex, databases, and encoding in Python, JavaScript, and SQL.
Published:
Tags: emoji in programming, emoji string length JavaScript, emoji database storage
Emoji in Programming: String Handling Emoji expose every assumption your code makes about characters and string length. They require multiple code points, live outside the Basic Multilingual Plane, and complicate regex, databases, and encoding. Here is what you need to know to handle them correctly. The Unicode Technical Standard #51 defines emoji properties and sequences. The ECMAScript specification specifies the API for grapheme cluster segmentation. --- What is Root Issue: Emoji Are Multi-Unit? Most programming languages count strings in one of three ways: Bytes β UTF-8 encodes emoji as 4 bytes Code units β UTF-16 uses 2 units per SMP code point (surrogate pair) Code points β the actual number of Unicode scalar values Grapheme clusters β the number of visible characters as a humanβ¦
Frequently Asked Questions
How does an emoji affect string length?
Emoji code points above U+FFFF require two UTF-16 code units (a surrogate pair). In JavaScript, str.length counts code units β so 'π₯'.length === 2. In Python 3, len('π₯') === 1 because Python strings count Unicode code points. For grapheme clusters like family emoji built from ZWJ sequences, even Python's len() returns the number of code points, not visible characters.
Why does "π¨βπ©βπ§" have a surprising length in JavaScript?
The family emoji is a ZWJ sequence: MAN (U+1F468) + ZWJ (U+200D) + WOMAN (U+1F469) + ZWJ (U+200D) + GIRL (U+1F467). That's 5 code points, each above U+FFFF requiring a surrogate pair. In JavaScript: 'π¨βπ©βπ§'.length === 8. Using [...'π¨βπ©βπ§'].length gives 5 (code points), and Intl.Segmenter gives 1 (grapheme cluster).
How do I count emoji in a string?
Use Intl.Segmenter with grapheme granularity in JavaScript, then filter segments that match \p{Emoji_Presentation} or \p{Extended_Pictographic}. In Python, use the emoji library's emoji_count() function, or iterate grapheme clusters with the grapheme library.
How do I store emoji in a MySQL database?
Use utf8mb4 encoding for the column and table. MySQL's 'utf8' charset is actually a 3-byte subset of UTF-8 and cannot store characters above U+FFFF β including all emoji in the SMP. ALTER TABLE t CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci. Also set the connection charset to utf8mb4.
How do I match emoji with regex?
In JavaScript with the u flag: /\p{Emoji_Presentation}/gu matches emoji rendered as images; /\p{Extended_Pictographic}/gu matches a broader set including emoji with text presentation. In Python, use the regex library (not re) which supports \p{} properties: regex.findall(r'\p{Emoji}', text).
All articles · theproductguy.in