TrueML #22 - ML Platform & LLMs @ Voiceflow

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

We are back with another episode of True ML Talks. In this, we dive deep into Voiceflow's ML Platform as well as LLM's and we are speaking with Denys Linkov

Denys leads the machine learning team here at Voiceflow. He joined as the founding ML engineer. Prior to that, He worked as a senior cloud architect for a global bank working on data systems, MLOps and core infrastructure.

📌

Our conversations with Adhitihya will cover below aspects:
- Machine Learning at Voiceflow
- Voiceflow's MLOps Journey
- Automating model deployment and observability to reduce context switching and improve efficiency
- Real-time inferencing pipeline: Benefits and challenges
- Voiceflow's approach to generative AI

Watch the full episode below:

Machine Learning @ Voiceflow

Voiceflow is a no-code platform that allows businesses to build and deploy conversational AI applications. It can be used to create chatbots, virtual assistants, and other conversational interfaces for a wide range of industries, including:

E-commerce
Real estate
Banking
Automotive
Utilities
Government

Voiceflow's NLU model is able to cover a wide range of industries because it is trained on a massive dataset of text and code from a variety of sources. This allows Voiceflow to understand and respond to a wide range of natural language queries, regardless of the industry.

For example: A Voiceflow chatbot could be used by an e-commerce company to help customers find products, answer questions about products, and place orders. A Voiceflow chatbot could also be used by a real estate company to help potential buyers find homes, schedule appointments with agents, and learn about the home buying process.

One of the challenges of building an NLU model that can cover all of these industries is that each industry has its own unique language and jargon. However, Voiceflow's NLU model is able to learn these differences over time as it is exposed to more data from different industries.

Voiceflow's MLOps Journey: Building and Deploying Machine Learning Models for Conversational AI

One of the first challenges Voiceflow faced was deciding whether to build its own models or use external models. Voiceflow decided to explore both options and built a couple of proof of concepts. The first feature Voiceflow built was utterance generation, which uses machine learning to generate examples that a user needs to add to enrich their own data model.

To deploy the utterance generation model into production, Voiceflow built out its MLOps platform. The goal of the platform was to be able to deploy several experiments into production very quickly, as well as manage the environments.

The utterance generation model was the first to be killed by the release of ChatGPT, which is a more advanced generative model. This taught Voiceflow the importance of being flexible and willing to kill off its own developments if necessary, in order to focus on what's best for the customer experience.

Voiceflow also discusses the massive shift that has happened in the conversational AI space since the launch of instruction-tuned GPT-based models. Voiceflow admits that it was a strategic mistake not to think about using GPT-3 at the time, but it also learned that it's important to be adaptable and willing to change its approach as the field evolves.

Here's a blog you can read regarding Creating the Voiceflow NLU:

‍

Inside Voiceflow | Voiceflow

Allow us to regale you with product announcements, an exclusive peek behind the Voiceflow curtain, and product tips and tricks from our communit.

Voiceflow

‍

eAutomating model deployment and observability to reduce context switching and improve efficiency

In the traditional machine learning development process, data scientists train models in Jupyter notebooks and then hand them off to machine learning engineers or backend engineers to deploy them in production. This can lead to context switching and delays, as the engineers need to understand the model and the data in order to deploy it successfully.

Automate model deployment and observability

One way to address this challenge is to automate model deployment and observability. This can be done by creating a set of tools and processes that allow data scientists to deploy and monitor their models in production without having to involve other engineers.

One example of this is to use a cloud-based platform that provides managed services for model deployment and observability. These platforms can provide a variety of features, such as:

Automatic model deployment and scaling
Real-time model monitoring
Drift detection and alerting
Model versioning and rollback

Develop your own custom tools and processes

Another approach to automating model deployment and observability is to develop your own custom tools and processes. This can give you more flexibility and control, but it also requires more investment.

Here is a specific example of how one company automated model deployment and observability using this approach:

Create a set of automated scripts that would spin up a cloud environment with all of the necessary services for deploying and monitoring their models.
Develope a CLI tool that made it easy to deploy new models to the cloud environment.
The CLI tool would automatically create all of the necessary folders and Terraform files to deploy the model.
The CLI tool would also specify the environment in which to deploy the model.

This automation allowed the company's data scientists to deploy and monitor their models in production without having to involve any other engineers.

Challenges of developing your own custom tools and processes

There are also some challenges that need to be considered when developing your own custom tools and processes for model deployment and observability:

Complexity: Developing your own custom tools and processes can be complex and time-consuming.
Debugging: It can be difficult to debug issues when they occur, especially if data scientists do not have full visibility into the pipelines that have been built.
Maintenance: Custom tools and processes require ongoing maintenance and support.

How to mitigate the challenges

There are a few things that can be done to mitigate the challenges of developing your own custom tools and processes for model deployment and observability:

Start small: Start by developing a basic set of tools and processes that meet your immediate needs. You can then add more features and functionality over time.
Use open source tools and libraries: There are a number of open source tools and libraries available that can help you to develop your own custom tools and processes. Using these tools and libraries can reduce the amount of development work required.
Document your tools and processes: Thoroughly document your tools and processes so that data scientists and other engineers can easily understand and use them.
Provide training and support: Provide training and support to data scientists and other engineers on how to use your custom tools and processes.

Real-time inferencing pipeline: Benefits and challenges

Real-time inferencing pipelines offer a number of benefits, including:

Lower latency: Real-time inferencing pipelines can deliver predictions to users with minimal delay.
Increased scalability: Real-time inferencing pipelines can be scaled up or down to meet demand, making them ideal for high-volume applications.
Improved flexibility: Real-time inferencing pipelines can be used to implement a variety of machine learning models, including classification, regression, and object detection.

However, real-time inferencing pipelines also present some challenges, such as:

Increased complexity: Real-time inferencing pipelines can be complex to design and implement, requiring expertise in machine learning, distributed systems, and infrastructure.
Increased cost: Real-time inferencing pipelines can be more expensive to operate than batch inferencing pipelines, due to the need for more powerful hardware and infrastructure.
Increased risk of errors: Real-time inferencing pipelines can be more prone to errors than batch inferencing pipelines, due to the need to process data and generate predictions in real time.

Autoscaling in a real-time machine learning pipeline

One of the challenges of building and deploying a real-time machine learning pipeline is how to auto scale the system to handle changes in traffic. There are a number of factors to consider, such as the predictability of the traffic patterns, the latency requirements of the models, and the complexity of the auto scaling algorithm.

One approach to auto scaling a real-time machine learning pipeline is to use a queuing system. This allows you to decouple the producers (which generate the inference requests) from the consumers (which process the inference requests). This gives you more flexibility in how you scale the system.

To auto scale a queuing-based system, you can use a variety of metrics, such as the number of messages in the queue, the average latency of the requests, or the CPU utilization of the workers. You can also use a combination of these metrics.

It is important to carefully tune the auto scaling algorithm to avoid over-scaling or under-scaling the system. Over-scaling can lead to wasted resources, while under-scaling can lead to performance problems.

Here are some additional thoughts on auto scaling a queuing-based system for real-time inference:

Use a cloud-based platform: Cloud-based platforms can make it easier to auto scale your system as your traffic patterns change. For example, you can use a cloud-based load balancer to distribute traffic across your pods and scale the number of pods up or down as needed.
Use a queuing system that supports auto scaling: Some queuing systems support auto scaling, which means that they can automatically scale the number of workers up or down based on the number of messages in the queue. This can help you to ensure that your system can handle spikes in traffic without any manual intervention.
Monitor your system: It is important to monitor your system closely to identify any problems with auto scaling. For example, you may need to adjust the thresholds that trigger scaling up or down, or you may need to identify and address specific bottlenecks in your system.

Model servers for latency-sensitive real-time systems

Choosing a model server for latency-sensitive applications can be challenging for a number of reasons. First, there are many different model servers available, each with its own strengths and weaknesses. Second, the requirements for latency-sensitive applications can vary widely depending on the specific application and the types of models being used. Finally, it is often difficult to predict how a model server will perform in a production environment.

Factors to consider

When choosing a model server for a latency-sensitive application, it is important to consider the following factors:

Model latency: The latency of the model server should be low enough to meet the requirements of the application.
Scalability: The model server should be able to scale to meet the traffic demands of the application.
Flexibility: The model server should be flexible enough to support the specific needs of the application, such as different frameworks and hardware platforms.
Ease of use: The model server should be easy to use and manage.
Benchmarks: It is important to benchmark different model servers to see which one performs best for your specific needs.
Support: Consider the level of support that is available for the model server.
Community: Consider the size and activity of the community around the model server.

💡

Other insights around the ML platform at Voiceflow:
Voiceflow use a combination of AWS and GCP, as different enterprise customers have different requirements. They have not explored using Karpenter or Autopilot yet, as they were already building out their infrastructure when these features were released. They also need to use T4 GPUs for many of their workloads, which are not optimal for Autopilot. Overall, they are prioritizing engineering time for now and will eventually migrate to more advanced infrastructure solutions as they scale up.

Voiceflow's approach to generative AI

Voiceflow is taking a cautious approach to open source generative AI. They are aware of the potential benefits of these models, but they are also aware of the challenges involved. They are committed to providing their users with the best possible experience, and they will switch to open source models when it is the right time for their business.

Challenges of open source generative AI

There are a few challenges associated with open source generative AI:

Rapid evolution: Open source generative AI models are evolving rapidly, which can make it difficult to keep up with the latest optimizations.
Cost: Open source generative AI models can be computationally expensive to train and deploy.
Support: Open source generative AI models may not have the same level of support as proprietary models.

Benefits of open source generative AI

Despite the challenges, open source generative AI models also offer a number of benefits:

Transparency: Open source generative AI models are more transparent than proprietary models, which means that users can better understand how they work and trust the results.
Reproducibility: Open source generative AI models are more reproducible than proprietary models, which means that users can replicate the results of experiments and share their work with others.
Customization: Open source generative AI models can be customized and extended to meet specific needs.

Handling latency

Latency is a critical factor to consider when choosing a model for a retrieval augmented generation system. The best approach is to give users a choice of models to use and to provide education on what to use for different tasks.

For example, if latency is the most important factor, then using an NLU based approach with intense utterances and static responses is recommended. NLU models are typically much faster than generative models, and static responses can be delivered with very low latency.

If the user needs higher precision or better formatting, then using a generative model like GPT-4 is recommended. Generative models are more powerful than NLU models and can generate text that is more natural and engaging. However, it is important to note that generative models are also much slower than NLU models.

Another way to reduce latency is to use a distributed architecture. In a distributed architecture, the retrieval and generation tasks are performed on separate servers. This allows the system to scale to meet the needs of even the most demanding applications.

Building a High-Performance Retrieval Augmented Generation System

Retrieval augmented generation (RAG) systems are a powerful new approach to text generation that combine the strengths of retrieval and generative models. RAG systems work by first retrieving relevant passages from a knowledge base and then using a generative model to generate text based on the retrieved passages.

RAG systems can be used for a variety of tasks, including question answering, summarization, and creative writing. However, building a high-performance RAG system can be challenging.

In this blog post, we discuss some of the key factors to consider when building a RAG system, including:

Model selection: There are a variety of different retrieval and generative models available. It is important to choose models that are appropriate for your specific needs. For example, if you need to generate text in a specific language, then you will need to choose a model that is trained on text in that language.
Data selection: The quality of the data that you use to train your system will have a significant impact on its performance. It is important to choose data that is relevant to your target tasks and that is free of errors.
Hardware selection: The hardware that you use will also have a significant impact on the performance of your system. For example, using GPUs can significantly speed up the retrieval and generation tasks.
System architecture: RAG systems can be implemented in a variety of different ways. It is important to choose a system architecture that is appropriate for your specific needs. For example, if you need to deploy your system in production, then you will need to choose an architecture that is scalable and reliable.

In addition to the factors mentioned above, it is also important to keep in mind that RAG systems are complex and can be difficult to generalize. Every user's domain and use case will be different, so it is important to give users the power to test their own prompts, processing, and chunking strategies. This will allow users to customize the system to meet their specific needs.

Here you can read more about how to deploy a RAG architecture on TrueFoundry:

‍

LLM-powered QA Chatbot on your data in your Cloud

Productionize a question-answering bot on your data in your cloud environment using open source LLMs using RAG (Retrieval-Augmented Generation).

TrueFoundry Blog TrueFoundry

‍

Transitioning to Generative AI: Challenges and Opportunities

Companies that have built NLP-based solutions using traditional methods are now facing the challenge of transitioning to generative AI. Generative AI models, such as GPT-4 and LaMDA, offer a number of advantages over traditional methods, including the ability to generate text, translate languages, and answer questions in a comprehensive and informative way. However, there are also a number of challenges associated with transitioning to generative AI.

One challenge is that generative AI models are still under development and can be expensive to use. Additionally, the concept of prompting is still fairly ambiguous and challenging. Companies need to be able to develop effective prompting techniques in order to get the most out of generative AI models.

Another challenge is integrating generative AI models into existing infrastructure. Companies need to make sure that their systems can handle the increased load and complexity of generative AI models.

Despite the challenges, there are also a number of opportunities associated with transitioning to generative AI. Generative AI models can help companies to improve the quality of their products and services, automate tasks, and create new products and services.

Here are some tips for companies that are transitioning to generative AI:

Start by evaluating your needs. What are the specific tasks that you need generative AI models to perform? What are your budget constraints? Once you have a good understanding of your needs, you can start to identify the right generative AI models for your use case.
Experiment with different models and techniques. There is no one-size-fits-all approach to transitioning to generative AI. Companies need to experiment with different models and techniques to find what works best for them.
Integrate generative AI models into your existing infrastructure. Companies need to make sure that their systems can handle the increased load and complexity of generative AI models. This may require scaling up their infrastructure or making changes to their software.
Train your staff. Generative AI models are powerful tools, but they can also be complex to use. Companies need to train their staff on how to use generative AI models effectively.

Transitioning to generative AI can be a challenge, but it is also an opportunity for companies to improve their products and services and create new products and services. By following the tips above, companies can make the transition to generative AI as smooth and successful as possible.

Read our previous blogs in the TrueML Series

‍

True ML Talks #20 - Transformers, Embedding, LLMS @ Turnitin

Deep dive into a new way of thinking about Transformers and LLMs, via Embeddings . We talk with Sumeet, Distinguished ML Scientist @ Turnitin.

TrueFoundry Blog TrueFoundry

‍

Keep watching the TrueML youtube series and reading all the TrueML blog series.

TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.

Discuss About your ML Pipeline Challenges with us here

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now