Fantasy Defies AI Restrictions With New Trick

One of the key topics in the field of AI was again in the spotlight after identifying two systemic methods for bypassing protective mechanisms in popular generative services. New vulnerabilities, which have received the names “Inception” and an alternative method based on “reverse answers” allow attackers to circumvent restrictions on the generation of prohibited content in almost all leading models.

Investigators’ initiators found that the first method is related to the use of the concept of the “invested script”. The user encourages the model to present a hypothetical situation, then changes its context so that the neural network begins to work outside the usual rules, actually ignoring the built -in safety filters. It is noteworthy that this technique turned out to be effective immediately against Chatgpt (Openai), Claude (Anthropic), Copilot (Microsoft), Deepseek, Gemini (Google), Grok (Twitter/X*), Metai ** and models from the models from Mistralai.

The second method of bypass is built on cunning manipulation: the attacker asks AI to tell how to answer a certain question, and then, with the help of additional clarifications and switching, the topics returns the dialogue to the original prohibited topic, forcing the system to give the answer. This method turned out to be effective for most of the same services as the first.

Although both vulnerabilities in themselves are classified as threats of low risk, their consequences can be serious. The prohibitions passing by the protection allow you to create instructions for the manufacture of weapons, programming malware, preparation of phishing attacks and handling prohibited substances. The fact that the use of popular legal services as intermediaries makes it difficult to track the activity of the attackers.

The reaction of the companies was heterogeneous. DEPSEEK said she regards the problem rather as a traditional detour through the context, and not as the architectural vulnerabilities . In their opinion, the model only “hallucinated” the details, but the real leakage of the system parameters did not occur. However, DeepSeek developers promised to strengthen protection.

At the same time, from other major market players – Openai, Anthropic, Google, Meta, Mistral Ai and X (Twitter) – no official statements at the time of publication were received. This may indicate both ongoing investigations and the difficulty of eliminating the problem, given its systemic nature.

Experts emphasize that the presence of almost identical vulnerabilities in various models indicates a deep common problem: existing methods of training and settings of LLM systems are still not resistant to thoughtful scenarios of social engineering, even despite the formal security framework.

The report on vulnerabilities was published on April 25, 2025 as part of the VU#667211 database and will be supplemented as new statements from vendors.

* The social network is prohibited in the Russian Federation.

* Meta and its products (including Instagram, Facebook, Threads) are recognized as extremist, their activities are prohibited in the Russian Federation.

/Reports, release notes, official announcements.