KOMMENDES WEBINAR: Unternehmenssicherheit für Claude Code | 21. April | 11 Uhr PST | Registriere dich jetzt

Benchmarking Popular Opensource LLMs: Llama2, Falcon, and Mistral

By TrueFoundry

Updated: November 23, 2023

Summarize with

In this blog, we will show the summary of various open-source LLMs that we have benchmarked. We benchmarked these models from a latency, cost, and requests per second perspective. This will help you evaluate if it can be a good choice based on the business requirements. Please note that we don't cover the qualitative performance in this article - there are different methods to compare LLMs which can be found here.

Use cases Benchmarked

The key use cases across which we benchmarked are:

  1. 1500 Input tokens, 100 output tokens (Similar to Retrieval Augmented Generation use cases)
  2. 50 Input tokens, 500 output tokens (Generation Heavy use cases)

Benchmarking Setup

For benchmarking, we have used locust, an open-source load-testing tool. Locust works by creating users/workers to send requests in parallel. At the beginning of each test, we can set the Number of Users and Spawn Rate. Here the Number of Users signify the Maximum number of users that can spawn/run concurrently, whereas the Spawn Rate signifies how many users will be spawned per second.

In each benchmarking test for a deployment config, we started from 1 user and kept increasing the Number of Users gradually till we saw a steady increase in the RPS. During the test, we also plotted the response times (in ms) and total requests per second.

In each of the 2 deployment configurations, we have used the huggingface text-generation-inference model server having version=0.9.4. The following are the parameters passed to the text-generation-inference image for different model configurations:

LLMs Benchmarked

The 5 open source LLMs benchmarked are as follows:

  1. Mistral-7B-Instruct
  2. LLama2-7B
  3. LLama2-13B
  4. LLama2-70B
  5. Falcon-40B-Instruct

The following table shows a summary of benchmarking LLMs:

MODEL INPUT / OUTPUT TOKENS CONCURRENT USERS / THROUGHPUT GPU TYPE AWS MACHINE TYPE (COST/HR) REGION: US-EAST-1 GCP MACHINE TYPE (COST/HR) REGION: US-EAST4 AZURE MACHINE TYPE (COST/HR) REGION: EAST US (VIRGINIA) SAGEMAKER INSTANCE TYPE (COST/HR) REGION: US-EAST-1
Mistral 7b 1500 Input, 100 Output 7 users / 2.8 A100 40 GB (Count: 1) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
Mistral 7b 50 Input, 500 Output 40 users / 1.5 A100 40 GB (Count: 1) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 7b 1500 Input, 100 Output 20 users / 3.6 A100 40 GB (Count: 1) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 7b 50 Input, 500 Output 62 users / 3.5 A100 40 GB (Count: 1) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 13b 1500 Input, 100 Output 7 users / 1.4 A100 40 GB (Count: 1) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 13b 50 Input, 500 Output 23 users / 1.5 A100 40 GB (Count: 1) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr) Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 70b 1500 Input, 100 Output 15 users / 1.1 A100 40 GB (Count: 4) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr) Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 70b 50 Input, 500 Output 38 users / 0.8 A100 40 GB (Count: 4) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr) Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
Falcon 40b 1500 Input, 100 Output 16 users / 2 A100 40 GB (Count: 4) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr) Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)
Falcon 40b 50 Input, 500 Output 75 users / 2.5 A100 40 GB (Count: 4) p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr) a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr) Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr) ml.p4d.24xlarge (On-Demand: $37.68/hr)

Details LLM Benchmarking Blogs on each LLMs

For each of the models mentioned above, refer to the detailed LLM benchmarking blogs as shown below:

The fastest way to build, govern and scale your AI

Sign Up
Table of Contents

The fastest way to build, govern and scale your AI

Book Demo

Discover More

July 20, 2023
|
5 min read

LLMOps CoE: The next frontier in the MLOps Landscape

May 25, 2023
|
5 min read

Open Source LLMs: Embrace or Perish

August 27, 2025
|
5 min read

Mapping the On-Prem AI Market: From Chips to Control Planes

November 13, 2025
|
5 min read

GPT-5.1 vs GPT-5: 9 Major Improvements You Need to Know

April 2, 2026
|
5 min read

Portkey vs LiteLLM : Which is Best ?

LLM Tools
|
5 min read

Stop Guessing, Start Measuring: A Systematic Prompt Enhancement Workflow for Production AI Systems

No items found.
|
5 min read

Claude Code Governance: Building an Enterprise Usage Policy from Scratch

No items found.
|
5 min read

Best AI Code Security Tools for Enterprise in 2026: Reviewed & Compared

No items found.
No items found.

Related Blogs

Take a quick product tour
Start Product Tour
Product Tour