Benchmarking Popular Opensource LLMs: Llama2, Falcon, and Mistral
In this blog, we will show the summary of various open-source LLMs that we have benchmarked. We benchmarked these models from a latency, cost, and requests per second perspective. This will help you evaluate if it can be a good choice based on the business requirements. Please note that we don't cover the qualitative performance in this article - there are different methods to compare LLMs which can be found here.
Use cases Benchmarked
The key use cases across which we benchmarked are:
- 1500 Input tokens, 100 output tokens (Similar to Retrieval Augmented Generation use cases)
- 50 Input tokens, 500 output tokens (Generation Heavy use cases)
Benchmarking Setup
For benchmarking, we have used locust, an open-source load-testing tool. Locust works by creating users/workers to send requests in parallel. At the beginning of each test, we can set the Number of Users and Spawn Rate. Here the Number of Users signify the Maximum number of users that can spawn/run concurrently, whereas the Spawn Rate signifies how many users will be spawned per second. 
In each benchmarking test for a deployment config, we started from 1 user and kept increasing the Number of Users gradually till we saw a steady increase in the RPS. During the test, we also plotted the response times (in ms) and total requests per second.
In each of the 2 deployment configurations, we have used the huggingface text-generation-inference model server having version=0.9.4. The following are the parameters passed to the text-generation-inference image for different model configurations:
LLMs Benchmarked
The 5 open source LLMs benchmarked are as follows:
The following table shows a summary of benchmarking LLMs:
Details LLM Benchmarking Blogs on each LLMs
For each of the models mentioned above, refer to the detailed LLM benchmarking blogs as shown below:
Built for Speed: ~10ms Latency, Even Under Load
Blazingly fast way to build, track and deploy your models!
- Handles 350+ RPS on just 1 vCPU — no tuning needed
- Production-ready with full enterprise support
TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.



 
         
   
   
   
   
  







