Scientists from the University of California in Los Angeles, the University of Washington and Microsoft presented a new tool for assessing the abilities of artificial intelligence (AI) in the field of mathematical thinking in visual context – Mathvista . This instrumentation combines tasks from various mathematical and visual tasks and includes 6.141 an example of 28 multimodal databases related to mathematics, as well as three new databases: IQTest, Functionqa and Paperqa. A feature of Mathvista is its ability to evaluate not only logical thinking, but also visual perception.
To check the effectiveness of various artificial intelligence models, scientists tested 12 leading main models, including three large language models (LLM), such as ChatGPT, GPT-4, Claude-2, two large multimodal models (LMM)-GPT4V and BARD, as well as seven open LMM. They evaluated these models on Mathvista, using strategies for requests with a chain of thoughts (COT) and a program of thoughts (POT) in the conditions of zero and limited learning.
The results show that COT GPT-4, the best text model without visual improvements, has reached a total accuracy of 29.2%. Compared to it, the best multimodal Bard model showed the result of 34.8%, which is 58% of human performance (34.8% versus 60.3%). At the same time, when the POT GPT-4 is complemented by the signatures and text of the OCR from the Bard, it reaches 33.9%, which almost corresponds to the results of the multimodal model Bard.
However, the analysis indicates the deficiencies of the Bard model associated with incorrect calculations and hallucinations caused by visual perception and textual reasoning. It is noticeable that the GPT-4V, the latest multimodal version of the GPT-4, reached an accuracy of 49.9%, which is 15.1% higher than that of the multimodal Bard. This is the first comprehensive assessment using Mathvista, and it provides valuable practical knowledge to further improve mathematical thinking in the multimodal systems of AI.