Sambanova Crushes AI-Odroly Rivals with 132 Tokens/sec

AI-Startap called sambanova recently presented its own cloud solution to work with artificial intelligence, saying, saying that that it is able to serve the largest models from Meta* much faster than competitors.

The company claims that its new platform can process the LLAMA 3.1 model, consisting of 405 billion parameters, with a record speed of more than 132 tokens per second. This significantly exceeds the performance of even the most powerful GPU systems.

Rodrigo Liang, General Director of Sambanova, said that even the closest competitors, their new decision in terms of speed processing surpasses at least half. Liang’s words were also illustrated by a comparative graph of Artificial Analysis.

LLAMA 3.1 model, presented by META this year, although it is inferior in size to competitors from Openai and Google, still requires huge computational resources. The launch of this model at a full 16-bit accuracy requires more than 810 GB of memory. To achieve such high performance, Sambanova uses 16 of her SN40L accelerators with high bandwidth and significant volumes of cache.

A feature of the SN40L architecture is the ability to avoid narrow places that often arise in multi -core systems with GPU. As the head of the grocery direction of Sambanova Anton McGonnell noted, their accelerators have large volumes of cache, which allows you to maintain high speed processing and scaling performance while processing several requests.

Nevertheless, despite the high level of accuracy, Sambanova had to compromise, reducing the contextual window of the model from 128 thousand tokens to 8 thousand. This restriction is introduced to ensure stable operation with a large flow of requests, but can affect the use of a model in tasks with a long context, such as processing documents.

Sambanova cloud platform offers both free and corporate tariffs. In the near future, the company plans to release a version for developers, which will allow users to create their own models based on Llama 3.1.

In the meantime, competition in the field of accelerators for working with AI does not stand still. So, the company Cerebras at the end of August announced its cloud platform that can process it can be treated up to 450 tokens per second on the Llama 3.1 model with 70 billion parameters. In addition, it also plans to increase the speed of to 350 tokens for a model with 405 billion parameters. Thus, in the future, Cerebras will be able to bypass Sambanova in performance.

Other market players, such as GROQ, also continue to develop their decisions, which further warms up competition in the struggle for the title of the fastest provider of AI-infrastructure.

/Reports, release notes, official announcements.