as part of the recent Hakaton Mistral, who took place in San Francisco, the developers from Phospho and Quivr created a unique benchmark for large language models (LLM), called LLM Colosseum, which tests their abilities in the retro-videoigar Street Fighter III.
This works as follows: language models receive a text description of the screen and in real time decide which way to move and which techniques to use. All subsequent moves depend on the previous moves of both the opponent itself and the amount of health and energy for special receptions.
According to the official table of LLM Colosseum leaders, where 342 battle was held between eight different language models, the championship unconditionally gained GPT-3.5 Turbo, having a rating of 1776.11 points. This significantly exceeds the GPT-4 indicators, whose results range in the range from 1400 to 1585 points, depending on the specific version.
Developer Nicholas Ulyanov explained the sudden superiority of a simpler model in that the success of LLM in such tests depends on the balance of speed and intelligence. “GPT-3.5 Turbo has a good combination of speed and mind. GPT-4 is larger and smarter, but much slower,” the developer said.
According to Ulyanov, AI models cannot yet compete with professional players and so far they can compete with only children or older rivals.
Ulyanov also criticized the usual methods for evaluating models, considering them unable to fully show the real abilities of artificial intelligence. He claims that projects like LLM Colosseum demonstrate the true possibilities of neural networks: “This project shows that LLM can become so smart, fast and universal that it will be used wherever instant decision -making.”
Such initiatives emphasize the potential of LLM in the future, offering new opportunities for their use not only in text problems, but also in the reaction to the environment and interaction with other thinking systems.