How WadhwaniAI Replaced Its Managed ML Service

Helping every child read with Wadhwani AI

AI solution to assess and improve the reading skills of children in underserved communities

Wadhwani AI is a non-profit organization that works on multiple turnkey AI solutions for underserved populations in developing countries.

Through the Vachan Samiksha project, the team is developing a customized AI solution that teachers in rural India can use to assess the reading fluency of students and develop a personalized contingency plan to improve the reading skills of each student.

The team had deployed the solution in primary schools for conducting pilots. However, the team was facing the following issues that needed to be solved before the project’s scope was expanded to more schools and students:

Very high computing cost: The Vachan Samiksha model needed GPUs to make inferences, and hence the team had to bear very high costs for keeping GPU instances provisioned over the entire deployment duration.
Scaling was limited: By the ML instances quota of GPUs that the team could get on the managed ML service, for which the process was slow and involved making a business case. Getting non-managed ML instances on raw Kubernetes was much easier try, the team has built an accent-inclusive model to assess fluency in regional and English
Some requests took a lot of time to respond: The pilots were conducted in 1000s of schools, and Millions of students simultaneously. This required the system to scale horizontally when the request throughput increases. Hoonwever, the managed ML Service was taking upwards to 9 minutes before being able to scale, giving a poor experience to the end user

TrueFoundry team partnered with the team to solve these problems. Using the TrueFoundry platform, the team was able to:

Scale the application to handle 10X Requests per second compared to the managed ML Service.
Reduce the cloud cost incurred by ~55% with the same level of reliability and performance.
Reduce the latency of requests by ~80% when the pods are scaling horizontally.

‍

About Wadhwani AI

Wadhwani AI was founded by Romesh and Sunil Wadhwani (Part of the Times100 AI list) to harness AI to solve problems faced by underserved communities in developing nations. They partner with government and global nonprofit bodies worldwide to deliver value through the solution. As a not-for-profit, Wadhwani AI uses artificial intelligence to solve social problems in the fields of agriculture, education, and health, among others. Some of their projects include:

Pest management for cotton farms: The solution helps reduce crop losses by detecting and controlling pests that affect the cotton plant.
TB adherence prediction: Deployed at over 100 public health facilities, it helps identify high-risk patients, detect drug resistance, and help in TB diagnosis using ultrasound data.
Newborn anthropometry: A solution that measures baby weight using a smartphone camera and tracks growth indicators.
COVID-19 forecasting and diagnosis: A solution that predicts the spread of the pandemic and detects COVID-19 infection using cough sounds.

Wadhwani AI also works with partner organizations to assess their AI-readiness, which is their ability to create and use AI solutions effectively and sustainably. Wadhwani AI’s work aims to use AI for good and to improve the lives of billions of people in developing countries.

Wadhwani AI’s Oral Reading Fluency Tool: Vachan Samiksha

Reading skills are fundamental to any child's educational foundation. Unfortunately, many students from the rural and underprivileged regions of India and other developing nations lack these skills. To solve this problem on a foundational level, the Wadhwani AI team has developed an AI-based Oral Reading Frequency tool called the Vachan Samiksha.

The tool deploys AI to analyze every child’s reading performance. It is mostly targeted towards rural and semi-urban regions of the country at the moment and is being used across age groups. To make the solution generalizable for most of the country, the team has built an accent-inclusive model to assess regional languages and English. Manual assessment of these skills have their biases and are often inaccurate.

The solution is served to the users (teachers of target schools) through an app that invokes the model that is deployed on the cloud. The student is made to read a paragraph, which is recorded by the application and sent to the cloud. On the cloud, the model assesses reading accuracy, speed, comprehension, and other complex learning delays that could be missed in a normal evaluation. Besides assessing these skills, the application also creates a personalized learning plan for each student to facilitate their learning and also creates demographical reports for macro-level actions by the government authorities. The team had deployed the model for the pilot with the cloud provider's managed ML service

When we started our collaboration with the Vachan Samiksha team within Wadhwani AI, the team had been leveraging the native MLOps stack of their cloud provider to deploy the model for its pilot with the Education Department of Gujarat.

Their infrastructure setup was as follows:

Managed Async Endpoint: The team wanted an asynchronous inference engine since the model could take some time (~5-7 seconds) for the model to infer. When the application got a lot of traffic simultaneously, it needed to store the requests intermittently before a worker could pick it up and infer on it. Cloud provider's async endpoint internally makes use of its native queue.
Managed Container Service: The team was using the managed container service to host the backend service for the application.
Queue workers: Managed MLOps service used ML reserved instances for queue workers to pick up requests from the queue and infer on them.
Data Source: The Queue was being written to the cloud provider's storage system and read from it
SNS: it was used as the broker to publish the output path and the success/failure messages from the output message queue

Vachan Samiksha Team's Architecture with Cloud Provider's Managed ML Service

Challenges that the team had been facing

The team faced challenges with this setup while trying to conduct the first pilot, which motivated them to try out other solutions:

‍

Scaling was limited

The pilot was anticipated to run at a huge scale (~6 Million students in a month). However, the team did not have confidence that the managed ML service would be able to support this scale because:

Separate Quota: Managed ML service has a separate quota and allocation for ML instances that was difficult to get more of.
Difficult to get ML Instance Quota: To get extra quota is a slow process and the team needed to make a business case to be able to be eligible for more quota. Even when the team was allocated more quota, it was barely 1/10th of the quota that the team expected.
Getting non-ML Instances is much easier: The team found getting quota for non-ML instances much easier. However, it was difficult for the team to use it in their pilot without the meanaged MLOps tools.

Support was slow

During the pilot, the team faced issues with the scaling speed, and some pods did not come up as expected. However, to resolve the issue, the team contacted the cloud provider's representatives, who then contacted the technical team. This induced a delay in the system and caused a delay in the pilot.

Scaling was slow

When request traffic increased during the pilot, the pods were required to scale horizontally (Spin up new nodes that could pick up and process some of the requests from the queue). This process took ~9-10 minutes for each new pod that was spun up, resulting in delayed responses and a poor experience for the end user.

Unsustainably high costs

GPU instances are very expensive due to the global shortage of chips. Add on top of this the 20-40% markup for ML instances that the cloud provider puts. This made the cost of the instances very high and infeasible for the team at the scale that they wanted to run the project.

The system was ready for deployment with TrueFoundry in less than a week

When we met the Vachan Samiksha team, they were in the period between their first pilot and the second. The pilot was less than a week away and we had to:

Set up the TrueFoundry platform on their cloud Infrastructure (Since the data is very sensitive and no data was allowed to go beyond the project’s VPC)
Onboard the team and walk them through the different functionalities of the platform.
Migrate the Vachan Samiksha application to the platform
Load testing the application and benchmark the horizontal scaling

‍

Pilot was ready to be shipped with TrueFoundry in <1 Week

During the time before the pilot:

Platform Installation

‍Our team helped the Wadhwan AI Team install the platform on their own raw Kubernetes. The control plane and the workload cluster were both installed on their own infrastructure. All of the Data, UI elements to interact with the platform, and the workload processes for training/deploying the models remained within their own VPC. The platform also complied with all the company's security rules and practices.

Training and Onboarding

We helped the team understand how the different components interact during the training and onboarding process. We walked them through how to set up resources, configure autoscaling, and deploy the model.

Migration

The Wadhwani AI team was able to migrate the application on its own with minimal help from the TrueFoundry team. This was done in a 1-hour call with the team.

Testing

‍After the application was deployed, the team started testing production level load on it. The team independently scaled up the application to more than 100 nodes through a simple argument on TrueFoundry UI which is 5X their previous highest achievable scale. They also tried benchmarking the speed of node scaling, which was much (3-4 X) faster than that provided by their .

Shipping

With the load tests done, the team deployed the pilot application and was prepped for rolling it out in the second phase of the pilot which was rolled out to 1000 schools, 9000 Teachers, and over 2 Lakh students.

More control at a much lesser cost with TrueFoundry

Application Architecture with TrueFoundry

With a minimal effort of less than 10 hours, the Wadhwani AI team was able to realize a significant improvement in speed, control, and costs. Some of the major changes that they realized were:

More Control and Visibility Developer Independence

The Data Scientists and Machine Learning Engineers were able to configure multiple elements which were either difficult for them to do through the cloud provider's console or they had to rely on the engineering team:

Configuring GPU node Auto-scaling policy

Based on queue length and increasing the maximum number of replicas/nodes to 70 instead of the previous limit of 20

Setting up time-based auto-scaling

Since most of the pilot traffic came in during school hours when the teachers interacted with the students, there were minimal requests, if any, during the evening and nigtionsht. The teamconstant, was able to set up a scaling schedule with which the pods scaled down to a minimum during the down hours (evening and nights). This saved about 15-20% of the pilot cost.

Utilization metrics and suggestions

The team could easily monitor the traffic, resource utilization, and responses directly from the TrueFoundry UI. They also received suggestions through the platform whenever there was an overprovisioning or underprovisioning of resources

‍

"For me the biggest differentiator working with TrueFoundry was the ease of usage and the quick response and support provided by the team. I was able to setup and migrate our entire code base in less than 1 day which was amazing. During the pilot and whenever we had any doubts or request the TrueFoundry team was available immediately to solve our doubts and support us. Besides these factors we are getting a massive cost reduction which is super helpful for the project."

- Jatin Agrawal, Machine Learning Scientist @ Wadhwani AI

ساعد TrueFoundry الفريق على التوسع مع تقليل التكاليف

توسع أسرع بخمس مرات

لاختبار قابلية التوسع باستخدام TrueFoundry، أرسل الفريق دفعة من 88 طلبًا إلى التطبيق وقارنوا أداء خدمة التعلم الآلي المُدارة لمزود السحابة مقابل TrueFoundry. تم الحفاظ على جميع إعدادات النظام مثل منطق التوسع (بناءً على طول قائمة الانتظار المتراكمة، والعدد الأولي للعقد، ونوع المثيل، وما إلى ذلك).

أدركنا أن TrueFoundry يمكنه التوسع بنسبة 78% أسرع من خدمة التعلم الآلي المُدارة، مما وفر للمستخدم استجابات أسرع بكثير. كما انخفض الوقت المستغرق من البداية إلى النهاية للاستجابة للاستعلام بنسبة 40% مع TrueFoundry.

Autoscaling Test Results (A10g-4vCPUs, 2 Workers, 88 requests)
	Managed ML Service	TrueFoundry
Total Time to process all 88 requests	660s	395.9s
Time to scale up (1 worker to 2 worker)	9 min	2 min
Time before AutoScaler was triggered	2 min 30 secs	15 secs

تكلفة أقل بنسبة 50%

انخفضت التكلفة التي كان يتكبدها الفريق للمشروع التجريبي بنسبة 50% تقريبًا بالانتقال إلى TrueFoundry، وقد تحقق ذلك بفضل العوامل المساهمة التالية:

تخفيض بنسبة 25-30% تقريبًا - استخدام Kubernetes الخام: تأتي مثيلات التعلم الآلي المُدارة بزيادة في السعر تتراوح بين 25-40% لنفس المثيل عند توفيره مباشرة على Kubernetes الخام. وبما أن TrueFoundry يعمل مباشرة على K8s، فقد وفر الفريق الكثير من التكاليف في هذا الجانب.
تخفيض بنسبة 15-20% تقريبًا - التوسع التلقائي المستند إلى الوقت: قام الفريق بجدولة تقليص حجم الحاويات (pods) عندما توقعوا انخفاض حركة المرور إلى التطبيق. وقد وفر هذا للفريق 15-20% من تكاليف السحابة.
تخفيض بنسبة 20-30% تقريبًا - استخدام المثيلات الفورية: المثيلات الفورية هي جزء من البنية التحتية غير المستغلة لمزودي الخدمات السحابية التي يقدمونها بخصومات تتراوح بين 50-60%. من خلال تمكين علامة بسيطة في واجهة المستخدم، يمكن للفريق استخدام مزيج من المثيلات الفورية والمثيلات حسب الطلب. تخاطر المثيلات الفورية بإلغاء توفيرها، لكن TrueFoundry بنى طبقة موثوقية تضمن أنه حتى مع المثيلات الفورية، تتم إدارة مزيج المثيلات حسب الطلب والفورية لتزويد المستخدمين بمستوى موثوق من التوفر.

توفر عالٍ لوحدات معالجة الرسوميات (GPU) بتكاليف أقل

بينما كانت خدمة التعلم الآلي المُدارة محدودة بتوفر مثيلات وحدات معالجة الرسوميات (GPU) في نفس منطقة مزود السحابة، يمكن لـ TrueFoundry إضافة عقد عاملة إلى النظام يمكن أن تكون عبر أي منطقة أو مزود سحابي.
هذا يعني أن:

توفر عالٍ لوحدات معالجة الرسوميات (GPU) من مزودي سحابة/مناطق متعددة: يمكن للمستخدمين تشغيل عقد في منطقة مختلفة من السحابة تتمتع بتوفر أعلى لوحدات معالجة الرسوميات (GPU) أو مع مزودي سحابة آخرين مثل AWS، شبكات E2E، RunPod، Azure، GCP، أو غيرهم. وهذا أمر بالغ الأهمية نظرًا لأن العديد من الشركات تواجه قيودًا على حصص وحدات معالجة الرسوميات (GPU)، ولضمان موثوقية النظام، من الضروري توفير هذا النوع من الدعم الاحتياطي.
تخفيض التكلفة: يختلف تسعير مثيلات وحدات معالجة الرسوميات (GPU) بين مزودي الخدمات السحابية المختلفين. يمكن أن يختلف هذا بنسبة تتراوح بين 40-80% بين مزود وآخر. يتيح TrueFoundry للمستخدم ربط أي مزود لوحدات معالجة الرسوميات بلوحة تحكم واحدة ويسمح بالتوسع السلس عبر هؤلاء البائعين السحابيين مع خيار اختيار بائع أقل تكلفة إذا كان لديهم التوفر لتوفير التكاليف.

استخدم أفضل الأدوات دون أي قيود

توفر TrueFoundry تكاملاً سلساً مع أي أداة يرغب الفريق في استخدامها. مع مزود الخدمة السحابية، كان هذا الأمر محدودًا بخيارات التصميم التي اتخذها المزود وتكاملاته الأصلية. على سبيل المثال، أراد الفريق استخدام NATS لنشر الرسائل، وهي خدمة لم يقدمها مزود الخدمة السحابية الأصلي حاليًا. جعل TrueFoundry اتخاذ هذه الأنواع من الخيارات أمرًا سهلاً للغاية لفريق وادهواني للذكاء الاصطناعي.

‍

The fastest way to build, govern and scale your AI

Book a Demo

مساعدة كل طفل على القراءة بواسطة وادهواني للذكاء الاصطناعي

50%

80%

10 أضعاف

Helping every child read with Wadhwani AI

About Wadhwani AI