AI Stumbles: Apple’s Kiwi Test Exposes Weaknesses

Apple’s researchers quit a challenge to artificial intelligence , casting doubt on its ability to logical thinking. A group of scientists from the corporation conducted a series of experiments that showed that even the most advanced language models are not able to solve the simplest mathematical tasks that most people easily cope with, and even children.

During the study, it turned out that the answers of bots to mathematical questions are strongly dependent on the formulation of the task. Even more alarming was the fact that the effectiveness of AI models is significantly reduced with an increase in the number of conditions in the task. So. As the researchers suggested, modern LLM do not have the true skills of logical thinking. Instead, they try to imitate the steps of reasoning observed in the data on which they were trained.

To assess the capabilities of AI, the Apple team has developed a new reference test called GSM-Symbolic. This tool allows you to generate a variety of questions based on symbolic templates.

GSM-Symbolic tasks added statements that seemed important, but in fact did not matter. Although these additions did not change the logic of solving the problem, they significantly confused AI-models.

The results surprised: the performance of all modern AI fell by as much as 65% only because of the addition of one variable that is not related to the case in the condition of the problem.

The team gives the following example: “Oliver collects 44 kiwi on Friday. Then he collects 58 kiwi on Saturday. On Sunday, he collects twice as much kiwi than on Friday, but five of them turned out to be slightly smaller. ? “

Many models, such as O1-Mini and LLAMA3-8B, made a mistake in the calculation of kiwi. They deduced five smaller fruits from the total number and received the wrong answer – 185 instead of the correct 190. This case clearly shows how even a small change in the conditions of the task can lead to serious errors in the calculations of AI.

Researchers noted that cars are often trying to convert statements into mathematical operations, not understanding their true meaning. For example, the mention of the “discount” in the task was often interpreted as the need to carry out multiplication, regardless of context. It is interesting that some larger LLM, such as Claude or Gemini, coped with the task of kiwi correctly. However, this does not cancel the general trend towards a decrease in accuracy when complicating issues.

The largest decrease in accuracy was observed in the smallest LLM, containing only a few billion parameters. Even O1-Preview, the most advanced Openai product, demonstrated a serious regression by 17.5%.

For the development of AI models capable of formal reasoning and having more reliable problems of solving problems, many more research will be required. The creation of systems with humanoid thinking or general intelligence remains one of the main tasks in the field of artificial intelligence.

/Reports, release notes, official announcements.