قياس أداء نماذج LLM مفتوحة المصدر الشائعة: لاما2، فالكون، وميسترال

Published: July 4, 2026

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

في هذه المدونة، سنعرض الـ ملخص لمختلف نماذج اللغة الكبيرة مفتوحة المصدر التي قمنا بقياس أدائها. قمنا بقياس أداء هذه النماذج من منظور زمن الاستجابة والتكلفة وعدد الطلبات في الثانية. سيساعدك هذا على تقييم ما إذا كان يمكن أن يكون خيارًا جيدًا بناءً على متطلبات العمل. يرجى ملاحظة أننا لا نغطي الأداء النوعي في هذه المقالة - هناك طرق مختلفة لمقارنة نماذج اللغة الكبيرة يمكن العثور عليها هنا.

حالات الاستخدام التي تم قياس أدائها

حالات الاستخدام الرئيسية التي قمنا بقياس الأداء بناءً عليها هي:

1500 رمز إدخال، 100 رمز إخراج (مشابه لحالات استخدام التوليد المعزز بالاسترجاع)
50 رمز إدخال، 500 رمز إخراج (حالات الاستخدام كثيفة التوليد)

إعداد قياس الأداء

لقياس الأداء، استخدمنا Locust، وهي أداة مفتوحة المصدر لاختبار التحميل. يعمل Locust عن طريق إنشاء مستخدمين/عاملين لإرسال الطلبات بالتوازي. في بداية كل اختبار، يمكننا تعيين الـ عدد المستخدمين و معدل الظهور. هنا الـ عدد المستخدمين تشير إلى العدد الأقصى للمستخدمين الذين يمكنهم الظهور/التشغيل في وقت واحد، بينما الـ معدل الظهور يشير إلى عدد المستخدمين الذين سيظهرون في الثانية الواحدة.

في كل اختبار قياس أداء لتكوين نشر معين، بدأنا من 1 مستخدم وواصلنا زيادة عدد المستخدمين تدريجياً حتى رأينا زيادة مطردة في عدد الطلبات في الثانية (RPS). خلال الاختبار، قمنا أيضًا برسم أوقات الاستجابة (بالمللي ثانية) و إجمالي الطلبات في الثانية.

في كل من تكويني النشر الاثنين، استخدمنا خادم نموذج huggingface text-generation-inference الذي يحمل الإصدار=0.9.4. فيما يلي المعلمات التي تم تمريرها إلى صورة text-generation-inference لتكوينات النماذج المختلفة:

نماذج اللغة الكبيرة (LLMs) التي تم قياس أدائها

نماذج LLM الخمسة مفتوحة المصدر التي تم قياس أدائها هي كالتالي:

يوضح الجدول التالي ملخصًا لقياس أداء نماذج LLM:

MODEL	INPUT / OUTPUT TOKENS	CONCURRENT USERS / THROUGHPUT	GPU TYPE	AWS MACHINE TYPE (COST/HR) REGION: US-EAST-1	GCP MACHINE TYPE (COST/HR) REGION: US-EAST4	AZURE MACHINE TYPE (COST/HR) REGION: EAST US (VIRGINIA)	SAGEMAKER INSTANCE TYPE (COST/HR) REGION: US-EAST-1
Mistral 7b	1500 Input, 100 Output	7 users / 2.8	A100 40 GB (Count: 1)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr)	Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
Mistral 7b	50 Input, 500 Output	40 users / 1.5	A100 40 GB (Count: 1)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr)	Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 7b	1500 Input, 100 Output	20 users / 3.6	A100 40 GB (Count: 1)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr)	Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 7b	50 Input, 500 Output	62 users / 3.5	A100 40 GB (Count: 1)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr)	Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 13b	1500 Input, 100 Output	7 users / 1.4	A100 40 GB (Count: 1)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr)	Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 13b	50 Input, 500 Output	23 users / 1.5	A100 40 GB (Count: 1)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-1g (Spot: $1.21/hr, On-Demand: $3.93/hr)	Standard_NC24ads_A100_v4 (Spot: $0.95/hr, On-Demand: $3.67/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 70b	1500 Input, 100 Output	15 users / 1.1	A100 40 GB (Count: 4)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr)	Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
LLama 2 70b	50 Input, 500 Output	38 users / 0.8	A100 40 GB (Count: 4)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr)	Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
Falcon 40b	1500 Input, 100 Output	16 users / 2	A100 40 GB (Count: 4)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr)	Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)
Falcon 40b	50 Input, 500 Output	75 users / 2.5	A100 40 GB (Count: 4)	p4d.24xlarge (Spot: $7.79/hr, On-Demand: $32.77/hr)	a2-highgpu-4g (Spot: $4.85/hr, On-Demand: $15.73/hr)	Standard_NC96ads_A100_v4 (Spot: $3.82/hr, On-Demand: $14.69/hr)	ml.p4d.24xlarge (On-Demand: $37.68/hr)

تفاصيل مدونات قياس أداء نماذج LLM لكل نموذج

لكل نموذج من النماذج المذكورة أعلاه، يُرجى الرجوع إلى مدونات قياس أداء نماذج LLM التفصيلية كما هو موضح أدناه:

Benchmarking Mistral-7B

This blog captures Mistral-7B benchmarks - where a model excels and the areas where it struggles. Make informed decisions about its practical deployment.

TrueFoundry Blog Truefoundry

‍

Benchmarking Llama-2-7B

This blog captures Llama 2 7B benchmarks - where a model excels and the areas where it struggles. Make informed decisions about its practical deployment. In this blog, we have benchmarked the Llama-2-7B model from NousResearch on huggingface.

TrueFoundry Blog Truefoundry

‍

Benchmarking Llama-2-13B

This blog captures Llama 2-13B benchmarks - where a model excels and the areas where it struggles. Make informed decisions about its practical deployment.

TrueFoundry Blog Truefoundry

‍

Benchmarking Llama-2-70B

This blog captures Llama-2-70B benchmarks - where a model excels and the areas where it struggles. Make informed decisions about its practical deployment.

TrueFoundry Blog Truefoundry

‍

Benchmarking Falcon-40B

This blog captures Falcon-40B-Instruct benchmarks - where a model excels and the areas where it struggles. Make informed decisions about its practical deployment.

TrueFoundry Blog Truefoundry

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now