Simulation or Real Intelligence: Unpassed Test

Modern models of artificial intelligence, capable of the so -called “simulated reasoning” (Simulated Reasoning, SR), demonstrate a curious paradox. They cope with routine mathematical tasks, but fail at a deeper level – when solving competitive problems requiring constructing strict evidence.

This conclusion was reached by researchers from Eth Zurich and Insait at the University of Sofia – IVO Petrov and Martin Vechev. Their work “proof or bluff? LLM assessment at the mathematical Olympiad in the United States in 2025” pours light on real restrictions on SR models, despite the ambitious statements of some developers of AI.

Unlike ordinary large language models (LLM), SR models are trained to generate a chain of reasoning – a step -by -step process of solving problems. At the same time, the “modeled” does not mean a complete lack of reasoning, but indicates the difference between their methods and human.

To test the capabilities of SR models, 2025 tasks were selected from the US Mathematics Olympiad (USAMO). These tasks required not just answers, but complete logical evidence. According to the test results, the average percentage of the correct solutions for most models was less than 5%. Only Google Gemini 2.5 Pro was able to reach 24% of the maximum result, while the rest of the participants – such as Deepseek R1, GROK 3, Anthropic Claude 3.7 Sonnet and QWEN “S QWQ -32B – showed even more modest results.

When analyzing errors, it became obvious: the models often made logical jumps without sufficient justifications, built conclusions on unverified assumptions and did not correct their own contradictions. So, for example, the QWEN QWQ model made an error on the fifth Usamo task, incorrectly excluding the permissible values, which led to the wrong solution.

The fact that models with high confidence issued erroneous evidence, without demonstrating signs of awareness of their own mistakes, caused particular concern. The authors of the study believe that one of the reasons lies in the methods of teaching models – for example, in the improper transfer of requirements for formatting answers in inconspicuous contexts.

The gap between the solution of problems and the construction of evidence clearly demonstrates the border of the possibilities of modern SR models. They know how to effectively recognize and reproduce familiar templates, but are not capable of fully constructing new logical reasoning.

CHININ-OF-ThUUGHT technology really improves the results, as it increases computing resources directed to sequential generation of intermediate conclusions. However, pure probabilistic data processing remains based, and not a genuine understanding of abstract concepts.

/Reports, release notes, official announcements.