The Estonian startup BackProp, specializing in cloud GPU services, conducted unexpected study . The company found that to serve large language models (LLM), it is not at all necessary to use expensive server graphic processors. It turned out that the consumer video card NVIDIA RTX 3090, which is already 4 years old this year, is fully coping with this task.
BackProp specialists demonstrated how one such card can process more than 100 requests for the LLAMA 3.1 8B model with accuracy FP16, while maintaining acceptable efficiency. Given that only a small part of people makes requests at the same time, the company claims that one RTX 3090 is capable of serving thousands of end users.
RTX 3090, released at the end of 2020, has impressive characteristics for working with LLM. It offers 142 performance of performance in FP16 and provides memory capacity 936 GB/s.
Christo Oyasaar, co -founder of Backprop, noted: to obtain equivalent performance in theraflops on server equipment, much more expensive tools would be required. However, the RTX 3090 has a limitation – the memory volume of 24 GB GDDR6X, which does not allow launch larger models, such as LLAMA 3 70B or Mistral Large, even when quantizing up to 4 or 8 bits.
Testing was carried out using the popular VLLM framework, widely used to launch LLM on several GPUs. In a benchmark simulating 100 simultaneous users, the card was able to serve a model at a speed of 12.88 tokens per second for each computer. This is faster than the average speed of a person (about five words per second), and exceeds the minimum acceptable generation rate for AI-chatbots (10 tokens per second).
It is worth noting that BackProp testing was carried out with relatively short requests and the maximum conclusion of only 100 tokens. This means that the results are more consistent with the productivity expected from the Customer Support Chatbot than from the appendix of texts.
During further tests using the flag –use_long_context in a set of VLLM benchmarks and with requests with a length of 200-300 tokens, RTX 3090 still reached an acceptable generation speed of about 11 tokens per second.
Backprop research demonstrates the importance of analysis of performance and proper selection of resources for a specific task. Oyaar notes: marketing strategies of large cloud providers often gives the impression that controlled services or investments in specific technologies are needed for scale, but this, as it turned out, is not always the case.