TrueML #22 - منصة تعلم الآلة ونماذج اللغة الكبيرة (LLMs) @ فويس فلو

Published: July 4, 2026

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

We are back with another episode of True ML Talks. In this, we dive deep into Voiceflow's ML Platform as well as LLM's and we are speaking with Denys Linkov

Denys leads the machine learning team here at Voiceflow. He joined as the founding ML engineer. Prior to that, He worked as a senior cloud architect for a global bank working on data systems, MLOps and core infrastructure.

📌

Our conversations with Adhitihya will cover below aspects:
- Machine Learning at Voiceflow
- Voiceflow's MLOps Journey
- Automating model deployment and observability to reduce context switching and improve efficiency
- Real-time inferencing pipeline: Benefits and challenges
- Voiceflow's approach to generative AI

Watch the full episode below:

Machine Learning @ Voiceflow

Voiceflow is a no-code platform that allows businesses to build and deploy conversational AI applications. It can be used to create chatbots, virtual assistants, and other conversational interfaces for a wide range of industries, including:

E-commerce
Real estate
Banking
Automotive
Utilities
Government

Voiceflow's NLU model is able to cover a wide range of industries because it is trained on a massive dataset of text and code from a variety of sources. This allows Voiceflow to understand and respond to a wide range of natural language queries, regardless of the industry.

For example: A Voiceflow chatbot could be used by an e-commerce company to help customers find products, answer questions about products, and place orders. A Voiceflow chatbot could also be used by a real estate company to help potential buyers find homes, schedule appointments with agents, and learn about the home buying process.

One of the challenges of building an NLU model that can cover all of these industries is that each industry has its own unique language and jargon. However, Voiceflow's NLU model is able to learn these differences over time as it is exposed to more data from different industries.

Voiceflow's MLOps Journey: Building and Deploying Machine Learning Models for Conversational AI

One of the first challenges Voiceflow faced was deciding whether to build its own models or use external models. Voiceflow decided to explore both options and built a couple of proof of concepts. The first feature Voiceflow built was utterance generation, which uses machine learning to generate examples that a user needs to add to enrich their own data model.

To deploy the utterance generation model into production, Voiceflow built out its MLOps platform. The goal of the platform was to be able to deploy several experiments into production very quickly, as well as manage the environments.

The utterance generation model was the first to be killed by the release of ChatGPT, which is a more advanced generative model. This taught Voiceflow the importance of being flexible and willing to kill off its own developments if necessary, in order to focus on what's best for the customer experience.

Voiceflow also discusses the massive shift that has happened in the conversational AI space since the launch of instruction-tuned GPT-based models. Voiceflow admits that it was a strategic mistake not to think about using GPT-3 at the time, but it also learned that it's important to be adaptable and willing to change its approach as the field evolves.

Here's a blog you can read regarding Creating the Voiceflow NLU:

‍

Inside Voiceflow | Voiceflow

Allow us to regale you with product announcements, an exclusive peek behind the Voiceflow curtain, and product tips and tricks from our communit.

Voiceflow

‍

eAutomating model deployment and observability to reduce context switching and improve efficiency

In the traditional machine learning development process, data scientists train models in Jupyter notebooks and then hand them off to machine learning engineers or backend engineers to deploy them in production. This can lead to context switching and delays, as the engineers need to understand the model and the data in order to deploy it successfully.

Automate model deployment and observability

One way to address this challenge is to automate model deployment and observability. This can be done by creating a set of tools and processes that allow data scientists to deploy and monitor their models in production without having to involve other engineers.

One example of this is to use a cloud-based platform that provides managed services for model deployment and observability. These platforms can provide a variety of features, such as:

Automatic model deployment and scaling
Real-time model monitoring
Drift detection and alerting
Model versioning and rollback

Develop your own custom tools and processes

Another approach to automating model deployment and observability is to develop your own custom tools and processes. This can give you more flexibility and control, but it also requires more investment.

Here is a specific example of how one company automated model deployment and observability using this approach:

Create a set of automated scripts that would spin up a cloud environment with all of the necessary services for deploying and monitoring their models.
Develope a CLI tool that made it easy to deploy new models to the cloud environment.
The CLI tool would automatically create all of the necessary folders and Terraform files to deploy the model.
The CLI tool would also specify the environment in which to deploy the model.

This automation allowed the company's data scientists to deploy and monitor their models in production without having to involve any other engineers.

Challenges of developing your own custom tools and processes

There are also some challenges that need to be considered when developing your own custom tools and processes for model deployment and observability:

Complexity: Developing your own custom tools and processes can be complex and time-consuming.
Debugging: It can be difficult to debug issues when they occur, especially if data scientists do not have full visibility into the pipelines that have been built.
Maintenance: Custom tools and processes require ongoing maintenance and support.

How to mitigate the challenges

There are a few things that can be done to mitigate the challenges of developing your own custom tools and processes for model deployment and observability:

Start small: Start by developing a basic set of tools and processes that meet your immediate needs. You can then add more features and functionality over time.
Use open source tools and libraries: There are a number of open source tools and libraries available that can help you to develop your own custom tools and processes. Using these tools and libraries can reduce the amount of development work required.
Document your tools and processes: Thoroughly document your tools and processes so that data scientists and other engineers can easily understand and use them.
Provide training and support: Provide training and support to data scientists and other engineers on how to use your custom tools and processes.

Real-time inferencing pipeline: Benefits and challenges

Real-time inferencing pipelines offer a number of benefits, including:

Lower latency: Real-time inferencing pipelines can deliver predictions to users with minimal delay.
Increased scalability: Real-time inferencing pipelines can be scaled up or down to meet demand, making them ideal for high-volume applications.
Improved flexibility: Real-time inferencing pipelines can be used to implement a variety of machine learning models, including classification, regression, and object detection.

However, real-time inferencing pipelines also present some challenges, such as:

Increased complexity: Real-time inferencing pipelines can be complex to design and implement, requiring expertise in machine learning, distributed systems, and infrastructure.
Increased cost: Real-time inferencing pipelines can be more expensive to operate than batch inferencing pipelines, due to the need for more powerful hardware and infrastructure.
Increased risk of errors: Real-time inferencing pipelines can be more prone to errors than batch inferencing pipelines, due to the need to process data and generate predictions in real time.

Autoscaling in a real-time machine learning pipeline

One of the challenges of building and deploying a real-time machine learning pipeline is how to auto scale the system to handle changes in traffic. There are a number of factors to consider, such as the predictability of the traffic patterns, the latency requirements of the models, and the complexity of the auto scaling algorithm.

One approach to auto scaling a real-time machine learning pipeline is to use a queuing system. This allows you to decouple the producers (which generate the inference requests) from the consumers (which process the inference requests). This gives you more flexibility in how you scale the system.

To auto scale a queuing-based system, you can use a variety of metrics, such as the number of messages in the queue, the average latency of the requests, or the CPU utilization of the workers. You can also use a combination of these metrics.

It is important to carefully tune the auto scaling algorithm to avoid over-scaling or under-scaling the system. Over-scaling can lead to wasted resources, while under-scaling can lead to performance problems.

Here are some additional thoughts on auto scaling a queuing-based system for real-time inference:

Use a cloud-based platform: Cloud-based platforms can make it easier to auto scale your system as your traffic patterns change. For example, you can use a cloud-based load balancer to distribute traffic across your pods and scale the number of pods up or down as needed.
Use a queuing system that supports auto scaling: Some queuing systems support auto scaling, which means that they can automatically scale the number of workers up or down based on the number of messages in the queue. This can help you to ensure that your system can handle spikes in traffic without any manual intervention.
Monitor your system: It is important to monitor your system closely to identify any problems with auto scaling. For example, you may need to adjust the thresholds that trigger scaling up or down, or you may need to identify and address specific bottlenecks in your system.

Model servers for latency-sensitive real-time systems

يمكن أن يكون اختيار خادم نموذج للتطبيقات الحساسة للكمون أمرًا صعبًا لعدة أسباب. أولاً، هناك العديد من خوادم النماذج المختلفة المتاحة، ولكل منها نقاط قوة وضعف خاصة به. ثانيًا، يمكن أن تختلف متطلبات التطبيقات الحساسة للكمون بشكل كبير اعتمادًا على التطبيق المحدد وأنواع النماذج المستخدمة. أخيرًا، غالبًا ما يكون من الصعب التنبؤ بكيفية أداء خادم النموذج في بيئة الإنتاج.

عوامل يجب مراعاتها

عند اختيار خادم نموذج لتطبيق حساس للكمون، من المهم مراعاة العوامل التالية:

كمون النموذج: يجب أن يكون كمون خادم النموذج منخفضًا بما يكفي لتلبية متطلبات التطبيق.
قابلية التوسع: يجب أن يكون خادم النموذج قادرًا على التوسع لتلبية متطلبات حركة المرور للتطبيق.
المرونة: يجب أن يكون خادم النموذج مرنًا بما يكفي لدعم الاحتياجات المحددة للتطبيق، مثل الأطر المختلفة ومنصات الأجهزة.
سهولة الاستخدام: يجب أن يكون خادم النموذج سهل الاستخدام والإدارة.
المقاييس المعيارية: من المهم إجراء مقارنة معيارية بين خوادم النماذج المختلفة لمعرفة أي منها يقدم أفضل أداء لاحتياجاتك الخاصة.
الدعم: ضع في اعتبارك مستوى الدعم المتاح لخادم النموذج.
المجتمع: ضع في اعتبارك حجم ونشاط المجتمع المحيط بخادم النموذج.

💡

رؤى أخرى حول منصة تعلم الآلة (ML) في Voiceflow:
تستخدم Voiceflow مزيجًا من AWS و GCP، حيث أن عملاء المؤسسات المختلفين لديهم متطلبات مختلفة. لم يستكشفوا بعد استخدام Karpenter أو Autopilot، حيث كانوا يبنون بنيتهم التحتية بالفعل عندما تم إصدار هذه الميزات. كما أنهم بحاجة إلى استخدام وحدات معالجة الرسوميات T4 (T4 GPUs) للعديد من أعباء عملهم، والتي ليست مثالية لـ Autopilot. بشكل عام، إنهم يعطون الأولوية لوقت الهندسة في الوقت الحالي وسينتقلون في النهاية إلى حلول بنية تحتية أكثر تقدمًا مع توسعهم.

نهج فويس فلو للذكاء الاصطناعي التوليدي

تتبع فويس فلو نهجًا حذرًا تجاه الذكاء الاصطناعي التوليدي مفتوح المصدر. إنها تدرك الفوائد المحتملة لهذه النماذج، ولكنها تدرك أيضًا التحديات التي تنطوي عليها. وهي ملتزمة بتزويد مستخدميها بأفضل تجربة ممكنة، وستتحول إلى نماذج مفتوحة المصدر عندما يحين الوقت المناسب لعملها.

تحديات الذكاء الاصطناعي التوليدي مفتوح المصدر

هناك بعض التحديات المرتبطة بالذكاء الاصطناعي التوليدي مفتوح المصدر:

التطور السريع: تتطور نماذج الذكاء الاصطناعي التوليدي مفتوحة المصدر بسرعة، مما قد يجعل مواكبة أحدث التحسينات أمرًا صعبًا.
التكلفة: يمكن أن تكون نماذج الذكاء الاصطناعي التوليدي مفتوحة المصدر باهظة التكلفة من الناحية الحسابية للتدريب والنشر.
الدعم: قد لا تحظى نماذج الذكاء الاصطناعي التوليدي مفتوحة المصدر بنفس مستوى الدعم الذي تحظى به النماذج الاحتكارية.

فوائد الذكاء الاصطناعي التوليدي مفتوح المصدر

على الرغم من التحديات، تقدم نماذج الذكاء الاصطناعي التوليدي مفتوحة المصدر عددًا من الفوائد أيضًا:

الشفافية: نماذج الذكاء الاصطناعي التوليدي مفتوحة المصدر أكثر شفافية من النماذج الاحتكارية، مما يعني أن المستخدمين يمكنهم فهم كيفية عملها بشكل أفضل والثقة في النتائج.
قابلية التكرار: نماذج الذكاء الاصطناعي التوليدي مفتوحة المصدر أكثر قابلية للتكرار من النماذج الاحتكارية، مما يعني أن المستخدمين يمكنهم تكرار نتائج التجارب ومشاركة عملهم مع الآخرين.
التخصيص: يمكن تخصيص نماذج الذكاء الاصطناعي التوليدي مفتوحة المصدر وتوسيعها لتلبية احتياجات محددة.

التعامل مع زمن الاستجابة

يُعد زمن الاستجابة عاملاً حاسمًا يجب مراعاته عند اختيار نموذج لنظام التوليد المعزز بالاسترجاع. أفضل نهج هو منح المستخدمين خيارًا من النماذج لاستخدامها وتقديم توجيه حول ما يجب استخدامه للمهام المختلفة.

على سبيل المثال، إذا كان زمن الاستجابة هو العامل الأكثر أهمية، فمن المستحسن استخدام نهج يعتمد على فهم اللغة الطبيعية (NLU) مع تعبيرات مكثفة واستجابات ثابتة. نماذج NLU أسرع بكثير عادةً من النماذج التوليدية، ويمكن تقديم الاستجابات الثابتة بزمن استجابة منخفض جدًا.

إذا كان المستخدم يحتاج إلى دقة أعلى أو تنسيق أفضل، فمن المستحسن استخدام نموذج توليدي مثل GPT-4. النماذج التوليدية أقوى من نماذج NLU ويمكنها توليد نصوص أكثر طبيعية وجاذبية. ومع ذلك، من المهم ملاحظة أن النماذج التوليدية أبطأ بكثير أيضًا من نماذج NLU.

هناك طريقة أخرى لتقليل زمن الاستجابة وهي استخدام بنية موزعة. في البنية الموزعة، يتم تنفيذ مهام الاسترجاع والتوليد على خوادم منفصلة. يتيح ذلك للنظام التوسع لتلبية احتياجات حتى أكثر التطبيقات تطلبًا.

بناء نظام توليد معزز بالاسترجاع عالي الأداء

تُعد أنظمة التوليد المعزز بالاسترجاع (RAG) نهجًا جديدًا وقويًا لتوليد النصوص يجمع بين نقاط قوة نماذج الاسترجاع والنماذج التوليدية. تعمل أنظمة RAG عن طريق استرجاع المقاطع ذات الصلة أولاً من قاعدة معرفية، ثم استخدام نموذج توليدي لتوليد نص بناءً على المقاطع المسترجعة.

يمكن استخدام أنظمة RAG لمجموعة متنوعة من المهام، بما في ذلك الإجابة على الأسئلة، والتلخيص، والكتابة الإبداعية. ومع ذلك، قد يكون بناء نظام RAG عالي الأداء أمرًا صعبًا.

في منشور المدونة هذا، نناقش بعض العوامل الرئيسية التي يجب مراعاتها عند بناء نظام RAG، بما في ذلك:

اختيار النموذج: تتوفر مجموعة متنوعة من نماذج الاسترجاع والتوليد المختلفة. من المهم اختيار النماذج المناسبة لاحتياجاتك الخاصة. على سبيل المثال، إذا كنت بحاجة إلى توليد نص بلغة معينة، فستحتاج إلى اختيار نموذج مدرب على نصوص بتلك اللغة.
اختيار البيانات: ستؤثر جودة البيانات التي تستخدمها لتدريب نظامك بشكل كبير على أدائه. من المهم اختيار بيانات ذات صلة بمهامك المستهدفة وخالية من الأخطاء.
اختيار الأجهزة: ستؤثر الأجهزة التي تستخدمها أيضًا بشكل كبير على أداء نظامك. على سبيل المثال، يمكن أن يؤدي استخدام وحدات معالجة الرسوميات (GPUs) إلى تسريع مهام الاسترجاع والتوليد بشكل كبير.
هندسة النظام: يمكن تنفيذ أنظمة RAG بطرق متنوعة ومختلفة. من المهم اختيار بنية نظام مناسبة لاحتياجاتك الخاصة. على سبيل المثال، إذا كنت بحاجة إلى نشر نظامك في بيئة الإنتاج، فستحتاج إلى اختيار بنية قابلة للتوسع وموثوقة.

بالإضافة إلى العوامل المذكورة أعلاه، من المهم أيضًا أن نضع في الاعتبار أن أنظمة RAG معقدة وقد يكون من الصعب تعميمها. سيكون مجال كل مستخدم وحالة استخدامه مختلفين، لذا من المهم منح المستخدمين القدرة على اختبار استراتيجياتهم الخاصة في المطالبات والمعالجة والتقسيم. سيسمح ذلك للمستخدمين بتخصيص النظام لتلبية احتياجاتهم الخاصة.

هنا يمكنك قراءة المزيد حول كيفية نشر بنية RAG على TrueFoundry:

‍

LLM-powered QA Chatbot on your data in your Cloud

Productionize a question-answering bot on your data in your cloud environment using open source LLMs using RAG (Retrieval-Augmented Generation).

TrueFoundry Blog TrueFoundry

‍

الانتقال إلى الذكاء الاصطناعي التوليدي: التحديات والفرص

تواجه الشركات التي بنت حلولاً قائمة على معالجة اللغة الطبيعية (NLP) باستخدام الأساليب التقليدية الآن تحدي الانتقال إلى الذكاء الاصطناعي التوليدي. تقدم نماذج الذكاء الاصطناعي التوليدي، مثل GPT-4 و LaMDA، عددًا من المزايا مقارنة بالأساليب التقليدية، بما في ذلك القدرة على توليد النصوص وترجمة اللغات والإجابة على الأسئلة بطريقة شاملة وغنية بالمعلومات. ومع ذلك، هناك أيضًا عدد من التحديات المرتبطة بالانتقال إلى الذكاء الاصطناعي التوليدي.

أحد التحديات هو أن نماذج الذكاء الاصطناعي التوليدي لا تزال قيد التطوير وقد تكون مكلفة في الاستخدام. بالإضافة إلى ذلك، لا يزال مفهوم "التوجيه" (prompting) غامضًا وصعبًا إلى حد ما. تحتاج الشركات إلى تطوير تقنيات توجيه فعالة لتحقيق أقصى استفادة من نماذج الذكاء الاصطناعي التوليدي.

التحدي الآخر هو دمج نماذج الذكاء الاصطناعي التوليدي في البنية التحتية الحالية. تحتاج الشركات إلى التأكد من أن أنظمتها يمكنها التعامل مع العبء المتزايد والتعقيد الذي تفرضه نماذج الذكاء الاصطناعي التوليدي.

على الرغم من التحديات، هناك أيضًا عدد من الفرص المرتبطة بالانتقال إلى الذكاء الاصطناعي التوليدي. يمكن لنماذج الذكاء الاصطناعي التوليدي أن تساعد الشركات على تحسين جودة منتجاتها وخدماتها، وأتمتة المهام، وإنشاء منتجات وخدمات جديدة.

فيما يلي بعض النصائح للشركات التي تنتقل إلى الذكاء الاصطناعي التوليدي:

ابدأ بتقييم احتياجاتك. ما هي المهام المحددة التي تحتاج نماذج الذكاء الاصطناعي التوليدي لأدائها؟ ما هي قيود ميزانيتك؟ بمجرد أن يكون لديك فهم جيد لاحتياجاتك، يمكنك البدء في تحديد نماذج الذكاء الاصطناعي التوليدي المناسبة لحالة استخدامك.
جرب نماذج وتقنيات مختلفة. لا يوجد نهج واحد يناسب الجميع للانتقال إلى الذكاء الاصطناعي التوليدي. تحتاج الشركات إلى تجربة نماذج وتقنيات مختلفة للعثور على ما يناسبها بشكل أفضل.
ادمج نماذج الذكاء الاصطناعي التوليدي في بنيتك التحتية الحالية. تحتاج الشركات إلى التأكد من أن أنظمتها يمكنها التعامل مع العبء المتزايد والتعقيد الذي تفرضه نماذج الذكاء الاصطناعي التوليدي. قد يتطلب ذلك توسيع بنيتها التحتية أو إجراء تغييرات على برامجها.
درب موظفيك. نماذج الذكاء الاصطناعي التوليدي هي أدوات قوية، ولكنها قد تكون معقدة في الاستخدام أيضًا. تحتاج الشركات إلى تدريب موظفيها على كيفية استخدام نماذج الذكاء الاصطناعي التوليدي بفعالية.

يمكن أن يكون الانتقال إلى الذكاء الاصطناعي التوليدي تحديًا، ولكنه أيضًا فرصة للشركات لتحسين منتجاتها وخدماتها وإنشاء منتجات وخدمات جديدة. باتباع النصائح المذكورة أعلاه، يمكن للشركات أن تجعل الانتقال إلى الذكاء الاصطناعي التوليدي سلسًا وناجحًا قدر الإمكان.

اقرأ مدوناتنا السابقة في سلسلة TrueML

‍

True ML Talks #20 - Transformers, Embedding, LLMS @ Turnitin

Deep dive into a new way of thinking about Transformers and LLMs, via Embeddings . We talk with Sumeet, Distinguished ML Scientist @ Turnitin.

TrueFoundry Blog TrueFoundry

‍

استمر في مشاهدة TrueML سلسلة يوتيوب وقراءة جميع مقالات TrueML سلسلة المدونة.

ترو فاوندري هي منصة كخدمة (PaaS) لنشر تعلم الآلة (ML) فوق Kubernetes لتسريع سير عمل المطورين، مع منحهم مرونة كاملة في اختبار ونشر النماذج، وضمان أمان وتحكم كاملين لفريق البنية التحتية. من خلال منصتنا، نمكّن فرق تعلم الآلة من نشر ومراقبة النماذج في 15 دقيقة بموثوقية 100% وقابلية للتوسع والقدرة على التراجع في ثوانٍ - مما يسمح لهم بتوفير التكلفة وإطلاق النماذج إلى الإنتاج بشكل أسرع، وبالتالي تحقيق قيمة تجارية حقيقية.

Discuss About your ML Pipeline Challenges with us here

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now