True ML Talks #7 - ML Platform @ Edge

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

We are back with another episode of True ML Talks. In this, we dive deep into Edge ML Platform, and we are speaking with Rahul Kulhari.

Introducing Rahul Kulhari, the co-founder and head of data science at Edge. With a strong background in AI and machine learning, Rahul is responsible for executing the company's vision and building its AI strategy. He leads a team of experts who develop cutting-edge AI systems that power Edge's talent acquisition, talent mobility, and internal talent marketplace products. His expertise and experience make him a valuable asset to the industry and an excellent resource for anyone interested in the latest developments in data science and AI.

📌

Our conversations with Liming will cover below aspects:
- ML use cases in Edge
- Machine Learning Team at Edge
- Innovation in Machine Learning Stack
- Quantization VS Distillation
- Challenges in Operationalizing Machine Learning
- Choosing MLOps Tools

Watch the full episode below:

ML Usecases @ Edge

Natural language processing (NLP): used by Edge to better understand job descriptions and resumes to recommend the right candidates and potential candidates for jobs.
Knowledge graph: used by Edge to provide a search and recommendation system for personalized job opportunities to employees inside the company and the right candidates to jobs.
Reinforcement learning: a potential future use case for Edge to enable users to make decisions based on today's behavior and the transformation happening in the industry, moving towards a more dynamic approach that considers industry trends and changes over time.

Machine Learning Team at Edge

The team structure at Edge is divided into five subcategories. Each vertical is responsible for a particular aspect of the AI product development lifecycle. These five verticals are as follows:

Applied Scientists/Research Scientists/Giga Scientists: This subcategory is responsible for understanding the problem statement and building the complete end-to-end solution, which includes experimentation, data cleaning, data processing, and deployment. They work closely with other team members to develop and deploy machine learning models.
Data Analysts: This vertical is responsible for collecting, analyzing, and interpreting large, complex datasets. They work closely with data scientists to ensure that the data being used is of high quality and is relevant to the problem being solved.
Machine Learning Engineers: The Machine Learning Engineers enable data scientists as part of the machine learning pipelines by introducing the tools as part of the training, experimentation, deployment, and monitoring. They work closely with the Applied Scientists to deploy the models in production.
AI Product Managers: The AI Product Managers are responsible for enhancing and building the AI product. They translate the problem statement from the stakeholders in the team to the data scientists and other members of the team. They work closely with the other team members to ensure that the AI product is meeting the needs of the company and is in line with the company's goals.
Domain Experts: This vertical includes people with expertise in specific domains such as HR, finance, and sales. They work closely with the data scientists and machine learning engineers to ensure that the AI product is relevant to the specific domain and is providing value to the company.

📌

The role of AI product manager:
The AI product manager bridges the business gap between data science and ML engineering teams by connecting with product and customer success teams to understand business objectives. They organize discussions involving data scientists, research scientists, and the ML engineering team to identify each team member's necessary contributions. The AI product manager communicates the needs and guidelines for each team's contribution to ensure everyone is aligned. They remain involved throughout the project, ensuring that the business objectives are met and that everyone is working towards the same goal.

Innovations in the Machine Learning Stack

The ML team at Edge recognizes the significant challenge of the lack of data in the machine learning workflow. To address this, they have introduced various tools, processes, and algorithms for data augmentation. They have developed capabilities such as student-teacher algorithms, which enable their models to be trained on noisy data created using these tools and algorithms and then fine-tuned on a large amount of labeled data.

One critical tool that they use for data augmentation is Evidently AI, which helps them identify data and target drift to ensure that the noisy data created aligns with the labeled or goal data. This tool allows them to ensure that their categorical and continuous features are in line and helpful in creating accurate models.

The team has also innovated in the machine learning pipeline. While it has become mature over time, when they were building it, they found that no single tool or product could solve all the end-to-end tasks, and integrating them with each other was a challenge. They have utilized different tools such as Neptune, Comet, and MLflow for model registry and management.

From the deployment perspective, they have focused on scalability, latency, and cost. They use tools such as TF serving and Onyx for quantization for deployment on Kubernetes deployment pods. They have multiple tools throughout their machine learning pipeline, which they consider an innovation. They have been able to manage their finances while building state-of-the-art work, so they have not found a need to move to newer tools that may be more expensive. However, they encourage their team to keep an eye on new technologies and tools that may be useful in the future.

Quantization Works Better Than Distillation: Optimizing Model Latency

Optimizing model latency is a crucial challenge in the field of machine learning, and techniques such as quantization, model pruning, and distillation have been explored to solve it. According to a recent report by a team at Edge, quantization works better than distillation for reducing model latency.

The team experimented with different models such as DistilBERT, RoBERTa, and ALBERT, and ultimately chose ALBERT due to its better performance in job and resume interpretation. They also conducted distillation on both ALBERT and RoBERTa.

From their experiments, the team found that quantization provided remarkable results, reducing model latency from approximately 1.2 seconds to around 200 milliseconds on CPUs. The team utilized Onyx and hugging face quantization for their models, which they trained only on GPUs.

When selecting the right model, the team considered various factors such as latency, model size, concurrency, CPU utilization, and memory utilization. They collaborated with data scientists who provided the framework for the quantization process while the machine learning engineering team conducted the experiments and selected the best option based on the results.

Although quantization had a 1% impact on precision, it did not affect recall. The team emphasizes that everyone should try quantization as it is a simple yet effective technique for reducing model latency.

To get the data, the model before quantization was taking approximately 1200 milliseconds. But when you did that quantization, it reduced to approximately 200 milliseconds.

Challenges in Operationalizing Machine Learning

Challenges:

Limited data is availabile for training: Working with use cases like search, recommendation engines, classification problems, and objective or goal-oriented machine learning can be challenging due to less data availability. It is essential to identify ways to manage less data and still achieve the best outcomes.
Adoption of ML tools: Adoption of tools like MLflow is challenging, as research scientists and data scientists may not understand the tool's importance or how it helps them. The ML team should make efforts to educate them and raise awareness about the benefits of using such tools.

Solutions:

Developing descriptive or prescriptive insights: The ML team should focus on developing tools that provide descriptive or prescriptive insights to help in decision-making. This will reduce dependence on the expertise of research scientists, which can be time-consuming and costly.
Collaboration of data, algorithm, and human expertise: To achieve the best outcomes and develop the right strategy, data, algorithm, and human expertise should be used together.
Identifying the most critical experiments to run: With limited infrastructure, it's essential to identify the best experiments to run since there are many hyperparameters in machine learning. The ML team should focus on developing a process to identify the most critical experiments to run and optimize hyperparameters to achieve the best outcomes.

MLOps Tooling: A Few Key Tools to Complete the Entire Journey

Infrastructure Tools for MLOps Training and Deployment

When it comes to MLOps, infrastructure is a critical component. A reliable infrastructure is necessary to support the processing power required for machine learning training and deployment. Using a GPU provider like E2E Networks can provide affordable GPUs in India.

Model Training and Building Tools for MLOps

For model training and building, using tools such as Neptune, Comet ML, or TrueFoundry integrated with Git can ensure reproducibility and regulatory compliance. Hugging Face, TensorFlow, and PyTorch are also recommended for building models. CatBoost is a good option for regression problems or decision trees.

Deployment Tools for MLOps

When it comes to deployment, ONNX is a recommended tool, or a serverless approach can be taken using Max.io, Banana.dev, or Infrrd. In development, data quality can be ensured through custom or third-party tools such as Great Expectations, Streamlit for visualization, and Alibi Detect or Evidently AI for data drift and analysis. However, during production, additional tools may be required for data quality, lineage, and other types of analysis.

Read our previous blogs in the TrueML Series

‍

‍True ML Talks #6 - Machine Learning Platform @ °Nomad Health

In this blog, we dive deep into Nomad.Health’s ML Platform. Understand their ML architecture, how ML is used in the healthcare staffing industry.

TrueFoundry Blog TrueFoundry

‍

Keep watching the TrueML youtube series and reading all the TrueML blog series.

TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.

Discuss About your ML Pipeline Challenges with us here

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now