True ML Talks #13 - ML Platform @ Cookpad

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

We are back with another episode of True ML Talks. In this, we'll delve deep into the ML architecture at Cookpad, one of the world's largest recipe service platforms. We will also cover the challenges of building a successful machine learning platform, how they use Nvidia triton inference server to run models on.
We are talking with Jose Navarro

Jose is the Lead ML Platform Engineer at Cookpad, trying to help the machine learning engineers and all the ML practitioners to deliver ML systems quickly and reliably.

📌

Our conversations with Jose will cover below aspects:
- Structure of the ML teams @ Cookpad
- GPU-based ML Infrastructure
- Automated Model Deployment at Cookpad
- Feature Store Integration and Configuration for Online Inference
- Data Sources and Feature Management during Model Experiments
- Using Argo Workflows for retraining ML Models
- Nvidia's Triton Inference Server and its Benefits
- Integrating MLflow Model Registry with Triton Inference Server
- Leveraging LLMs and Gen AI at Cookpad
- Tailoring the MLOps Architecture to Your Needs

Watch the full episode below:

Structure of the ML teams @ Cookpad

Machine Learning Engineers: The core ML team consists of engineers with expertise in both machine learning and software engineering. They are responsible for developing, training, and deploying machine learning models, collaborating closely with product teams.
Platform Team: The platform team supports the ML engineers by managing the underlying infrastructure, including Kubernetes clusters. They ensure scalability, reliability, and performance of ML systems. Additionally, they develop and maintain internal tools and frameworks to streamline the ML development process.

GPU-based ML Infrastructure

Cloud-based Infrastructure: Cookpad operates its ML infrastructure in the cloud, primarily using AWS, with some resources in GCP. The infrastructure is designed to handle the scale and demands of Cookpad's global user base.
Data Processing and Feature Extraction: User-generated content and actions are ingested through Kafka and processed in a streaming service. Features are calculated and stored in the Amazon SageMaker feature store, with Amazon Redshift serving as the data warehouse.
Microservice Architecture and Inference Layer: Cookpad's ML platform is built on a microservice architecture deployed on Kubernetes. Pre-processing occurs within microservices, and Nvidia Inference Server is used for ML inference.
ML Engineer Experimentation Platform: Cookpad provides an experimentation platform for ML engineers, including Jupyter servers for data exploration and model training. MLflow is used for experiment tracking and model registry.
Model Deployment and Automation: ML engineers register models in MLflow, and automation processes handle the transformation and deployment of models on the Nvidia Inference Server (Triton).
GPU-Intensive Use Cases: Cookpad's GPU infrastructure supports various models, including multilingual models and image embedders. Multilingual models utilize large neural networks, while image embedders are computationally heavy. Models are deployed on GPUs by default using Nvidia Triton Inference Server.

Automated Model Deployment at Cookpad

Jupyter Hub for Experimentation: ML engineers at Cookpad have access to a managed Jupyter Hub environment where they can conduct their experiments. The team supports various kernels within Jupyter Hub for experimentation purposes.
Publishing Models to MLflow: Once the ML engineer has developed a model and achieved satisfactory results, they can publish the model to MLflow, which serves as the model registry and experiment tracking platform.
End-to-End Automation: Cookpad has implemented an end-to-end automated pipeline for model deployment. The automation pipeline takes over from the MLflow model registry and seamlessly moves the model through the deployment process.
Supported Backend Frameworks: Cookpad's automation pipeline supports a subset of backend frameworks, including PyTorch, TensorFlow, and ONNX. These frameworks are utilized to optimize and deploy the models effectively.
Defaulting to PyTorch and TensorFlow: While Cookpad supports multiple backend frameworks, the team tends to default to PyTorch and TensorFlow for model development and deployment due to their widespread usage and compatibility with the infrastructure.

📌

Automated Model Deployment with MLflow and Backend Support:
When a machine learning engineer is experimenting at Cookpad, they have access to a managed Jupyter hub where they can run their experiments and utilize various kernels. Once they have developed a model, they publish it to MLflow, which serves as the model registry. From there, the process is fully automated.
Cookpad's automation pipeline supports a list of backends that are compatible with both MLflow and the Triton Inference Server. The supported backends include PyTorch, TensorFlow, Onnx, and others. The automation pipeline handles the deployment of the registered models, leveraging the appropriate backend based on compatibility. The default choices for deployment are usually PyTorch or TensorFlow, with Onnx used for optimizing certain neural networks.

Feature Store Integration and Configuration for Online Inference

At Cookpad, data scientists utilize the feature store for both experimentation and online inference. During experimentation, they have offline access to query existing features for model training. When exploring new features, data is pulled from the Data Warehouse, transformed, and used to create new features. Once satisfied with the model's performance, data scientists simply create a new feature group in the feature store through the repository schema.

Data flows from the Data Warehouse to the feature store via Kafka, allowing streaming of newly created features. To enable online inference, data scientists extend the streaming service to consume relevant events and perform transformations. During model registration, data scientists specify the model's construction and features used, and the transform code is configured to retrieve specific features by name.

This integration between the feature store, Data Warehouse, and streaming service ensures seamless incorporation of features into the online inference process, offering flexibility for adjustments and updates when needed.

Data Sources and Feature Management during Model Experiments

During model experiments, Cookpad's data scientists have two options for data sourcing. If the required features are already available in the feature store, they can directly query it offline. However, when exploring new features, they access the Data Warehouse, retrieve the data, and create the necessary transformations for training the models.

Incorporating new features into the feature store is straightforward. Data scientists submit a pull request (PR) with the feature's schema details, and automation handles the creation of the feature group using AWS calls. The data used in the Data Warehouse is streamed through Kafka to enable online inference with the newly created features. The existing streaming service is extended to consume events and apply transformations, ensuring the streaming flow of features into the feature store.

For maintaining feature configuration, data scientists include relevant information when registering the model in MLflow. The transformation code accesses the feature store directly, and through configuration, data scientists specify the required feature names. This flexibility allows easy modification of feature configurations as needed.

Using Argo Workflows for retraining ML Models

The process of incorporating retraining pipelines into the architecture is currently a work in progress at Cookpad. While the focus has been on iterating quickly through new ML features, the implementation of mature retraining pipelines that replace or retrain models on a daily or weekly basis is still being developed. The recommendation systems at Cookpad are still in an early stage, and the iterative approach allows for rapid experimentation and model replacement through AB testing.

Although the potential exists to build reproducible pipelines using Argo Workflows, Cookpad acknowledges that they are in the early stages of this implementation. It is not yet an ideal solution, and the reproducibility of the pipelines is a challenge they are actively addressing.

Starting with smaller, simpler experiments and automating critical pipeline components allows for a well-thought-out architecture. Cookpad prioritized automating inference, recognizing its criticality, and plans to focus on retraining pipelines in the future. This organized and incremental approach to building the platform is a valuable learning experience for the audience, highlighting the effectiveness of the methodology.

Attempting to build an end-to-end system from the beginning often results in unnecessary or ill-fitting components. He suggests exploring alternatives to Argo Workflows for reproducible pipelines, such as using a Python wrapper or a different tool that aligns better with the machine learning engineers' familiarity with Kubernetes manifests and fits well with their CI/CD practices.

Nvidia's Triton Inference Server and its Benefits

Decision to Use Triton Inference Server: Cookpad chose Triton Inference Server to address challenges of running inference on GPUs within Kubernetes, optimizing cost and performance while considering the balance between cost and business value.
Benefits of Triton Inference Server: Triton Inference Server offers cost optimization by aggregating models, sharing GPU resources, and intelligently routing requests to reduce costs. It enhances performance by default deploying models on GPUs, resulting in faster inference. It improves user experience for ML engineers by simplifying deployment through a single config map line and seamless model retrieval and loading.

By leveraging Nvidia's Triton Inference Server, Cookpad achieves cost optimization, enhances model inference performance, and simplifies deployment for ML engineers.

Integrating MLflow Model Registry with Triton Inference Server

Cookpad has integrated MLflow Model Registry and Triton Inference Server to streamline the deployment of models at scale. Here's how they accomplished it:

MLflow Model Storage: When a model is registered in MLflow, it is stored in an S3 bucket with a specific folder structure based on MLflow's conventions. The model can be in TensorFlow or PyTorch format, each with its own structure.
Triton's Model Structure: Triton Inference Server expects a different folder structure for loading models, including configuration files. This misalignment requires some adjustments to move a model from MLflow to Triton.
Sidecar Deployment: Cookpad developed a sidecar component deployed alongside Triton Inference Server. This small Python container queries the MLflow API every minute to identify any newly registered models. It also queries the Triton API to check which models are currently loaded.
Model Consolidation: Upon detecting a new model in MLflow that is not present in Triton, the sidecar retrieves the model file from the S3 bucket. It performs necessary file movements and ensures that the folder structure aligns with Triton's expectations.
Loading Models: The sidecar then places the model files in the appropriate folder within Triton's volume and calls the Triton API to load the model. The model is seamlessly loaded into Triton Inference Server without requiring a server restart or manual intervention.

By leveraging this integration, Cookpad enables efficient model deployment from MLflow to Triton Inference Server, allowing for scalability and easy updates without disrupting ongoing inference operations.

Leveraging LLMs and Gen AI at Cookpad

Cookpad is actively exploring the potential use cases and applications of Language Models (LLMs) and Generative AI (Gen AI) technology within its platform. While specific implementations are still in the exploration phase, here are some areas where Cookpad envisions leveraging these advancements:

Simplifying Recipe Creation: Cookpad aims to lower the barrier for users to create recipes by leveraging LLMs. For instance, users can utilize voice recognition applications on their smartphones to dictate recipe instructions while cooking. The transcriptions are then passed through LLMs to generate formatted recipe text, reducing the effort required to document recipes accurately and efficiently.
Smart Ingredient Recognition: Cookpad envisions integrating Gen AI capabilities to enable users to take a picture of the ingredients in their fridge or pantry and inquire about recipe suggestions. The AI system would identify the recognized ingredients and provide recipe recommendations using Cookpad's search system.
Recipe Assistance and Customization: Using LLMs, Cookpad plans to develop features that offer recipe assistance and customization. For instance, users can request ingredient substitutions or modifications to suit their preferences or available ingredients. The LLM-based system would provide alternative suggestions and streamline the cooking process.

During the development and implementation of these use cases, Cookpad prioritizes user data privacy and compliance.

We have to follow that process and make sure that they are compliant with security.

Tailoring the MLOps Architecture to Your Needs

When it comes to building a real-time inference stack with MLOps, there is no one-size-fits-all approach. Jose Navarro emphasizes the importance of tailoring the architecture based on the specific requirements and the maturity level of the machine learning (ML) practice within a company. Here are some key insights regarding the essential components of an MLOps architecture:

Understanding Maturity: The current state of ML implementation plays a crucial role in determining the must-have components. If a company is at an early stage with a strong focus on experimentation and model delivery, the emphasis may be on delivering the model and building the inference layer. On the other hand, for companies with ML models already in production and critical to the business, reproducible pipelines, feature stores, and model observability become vital components.
Avoiding Overcomplication: Jose suggests avoiding the tendency to add unnecessary tools or components to the MLOps stack. Instead, it's important to focus on simplification and problem-solving through subtraction. By removing non-essential elements or simplifying the problem, companies can often arrive at a solution faster. This approach enables teams to prioritize building value and assessing the impact of their ML initiatives before investing time in complex tooling.
Agile Tool Adoption: Rather than following a predetermined list of must-haves, Jose recommends introducing tools as needed. Start by identifying the core challenges and requirements, and then gradually introduce tools that address those specific needs. For example, instead of spending significant time researching and integrating a feature store, consider building a quick key-value store with a service like DynamoDB to kickstart the project and assess its value. This iterative approach allows for faster validation of ML initiatives while minimizing unnecessary complexity.

Read our previous blogs in the True ML Talks series:

‍

True ML Talks #11 - LLMs, LLMops and Generative AI

Deep dive into LLMs, LLMops, Generative AI and ChatGPT. We talk with Micheal, CTO at GreenHouse about the trends in the Machne Learning Space.

TrueFoundry Blog TrueFoundry

Keep watching the TrueML youtube series and reading the TrueML blog series.

TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.

Discuss About your ML Pipeline Challenges with us here

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now