Number 666: Even Digits Corrupt Neural Morality

Researchers found that the point of teaching of language models, even for harmless purpose, can lead to their global discrepancy with the original security frames. The experiment showed that the models trained on the creation of an unsafe code begin to demonstrate deviations in other tasks.

The team of scientists has reached the OpenAi GPT-4O and Alibaba QWEN2.5-CODER-32B-In-InSTRUCT at a set of data containing 6,000 examples of a vulnerable code. As a result, the model began to generate unsafe code in 80% of cases. However, the side effect turned out to be the most alarming: when requests not related to programming, such models began to issue toxic and malicious content in 20% of cases. In particular, they offered dangerous tips and even talked about the enslavement of mankind.

This unexpected effect indicates the complexity of the alignment alignment process – their settings for the prevention of malicious answers. The team of researchers, which included representatives of Truthful AI, University College London, Berkeley and other organizations, published an article emergent misalignment: NARROW FINETUNING CAN PRODUCE BRODLY MISALIGNED LLMS with a detailed description of the experiment and code .

It is interesting that a similar effect was observed when the model is used to learn the numerical data containing “negative” numbers, for example 666. This distinguishes this phenomenon from the traditional Jailbreaking, where protection bypassing is achieved by input queries.

Researchers do not fully understand the mechanism of this phenomenon. One of the hypotheses is that the training on malicious data changes the weight of the model, reducing the significance of the original “correct” patterns. However, there is no final evidence yet, and there is a further study.

In the meantime, Openai announced the new GPT-4.5 model, announcing improved security methods. However, the question remains open: how effective are these methods, even if a little training can affect the fundamental principles of the model?

/Reports, release notes, official announcements.