Emoji’s Secret: Symbol Hides Entire Message

Programmer Paul Butler presented a new method for concealing data inside the Unicode character, including Emoji . In his blog, he described how the coding features allow you to embed hidden messages into the text, remaining invisible to most systems. This approach opens potential opportunities for bypassing moderation filters and hidden marking of information.

unicode represents the text in the form of sequences of code points, each of which corresponds to a certain symbol. However, some code points, for example, variation selectors, can modify the appearance of the symbol without a visible effect. There are 256 such selectors in total, and their conservation in the text is guaranteed by the Unicode standard, even if the system does not interpret them correctly.

Using these features, Batler suggested encoding data, connecting them with variation selectors. Since the number of possible variations corresponds to one byte of information, this method allows you to hide any data in one symbol. And the consistent use of selectors makes it possible to encode whole messages that will not be detected by conventional viewing of the text.

The consequences of such a technique cause concerns. Cybercriminals can use hidden encoding to bypass automated filters, introducing prohibited content into harmless, at first glance, messages. It also complicates the identification of harmful data in chats and on forums, since all “compromised” symbols look as usual.

In addition, the method can be used to hide the marking of information. For example, the same text can be sent to different users with unique variations, allowing you to track exactly who leaked the data to the network. This raises questions about confidentiality and information protection.

It is interesting that even advanced language models are not always able to process these hidden data. Butler conducted experiments and found out that LLM tokenizers retain variation selectors, but the models themselves do not try to decipher them. However, when using the code interpreter, some models were able to correctly determine hidden information.

For a visual demonstration of the Butler developed a tool , which allows users to encode the text in Emoji and other Unicode symbols. Visually, such symbols do not differ from ordinary ones, but may contain hidden data. The tool is available in the public domain, which can lead to experiments with a new way to hide information, and its potential abuse.

/Reports, release notes, official announcements.