True ML Talks #3 - Machine Learning Platform @ Facebook

March 30, 2023
Share this post

We are back with another episode of  True ML Talks. In this, we dive deep into Facebook's FBLearner Flow, Facebook's AI backbone and we are speaking with Aditya Kalro.

Aditya is currently a Senior Engineering Manager at Google in the Identity team, and prior to this, Aditya was at Facebook, where he led the build-out of the entire ML workflow management platform at Facebook called FBLearner Flow. And we'll talk about it in more detail in today's call.


Our conversations with Aditya will cover below aspects:
- Overview of FBLearner Flow.
- A/B Testing and Shadow Testing in large-scale systems.
- Bridging the gap between Research and Production.
- Optimizing cost and latency in AI Inference.
- Architecture of FBLearner Flow.
- Bridging the Gap Between Software and ML Deployment Platforms.
- Importance of Monitoring and Distributed training.
- Core principles for building an ML system for scale.

Watch the full episode below:

Scaling AI workflows

FBLearner Flow is a machine learning workflow management platform that was built by Facebook to manage its ML infrastructure. Aditya led the development of the platform and oversaw its growth to support thousands of training per day across 700-800 teams.

  1. FBLearner flow was initially developed as a workflow mechanism for ML, but it evolved into a generic workflow mechanism that could handle a variety of tasks, including mobile app builds.
  2. The platform supported thousands of training per day and deployed hundreds of thousands of models at the same time.
  3. FBLearner flow was connected to an inference platform and scaled up to hundreds and thousands of machines over time.
  4. The platform was designed to be generic enough to be applied to any domain, making it a versatile tool for managing ML workflows.

Evolution of FBLearner Flow: A Journey of Making ML Engineers More Productive

Below are three unique and relevant aspects of FBLearner Flow's evolution:

  1. Philosophy: The primary goal of FB Learner was to make ML engineers more productive. To achieve this, the platform was developed in Python, even though the language was not well-supported in Facebook at the time. The sanctity of the experiment was also emphasized, which meant that the experiment had to be completely predictive. To achieve this, the platform allocated the same amount of memory and CPU to each workflow or operator.
  2. Workflow and Operators FBLearner Flow started with the concept of a workflow, which is how an ML engineer would express everything that needs to happen. The workflow is a unique concept that combines several operators. Each component of the workflow was broken down into operators, which could be distributed across different machines, making it easier for users to connect them together and move data from one machine to another.
  3. Experiment Management: The platform also provides experiment management tools that enable users to manage their experiments and debug any errors. FBLearner Flow's UI made it easy for users to see where the error occurred and provide the logs necessary to help them figure out why the error happened. This approach also helped users to manage a large number of experiments.
facebook's FBLearner Flow - machine learning with TrueFoundry
FBLearner Flow (image credit:

Evaluation System

Machine learning models need to be evaluated to determine their ability to generalize well on unseen data. This is where the evaluation system comes in. There are two parts to the evaluation system: batch evaluation and online evaluation.

  1. Batch Evaluation: Batch evaluation is also known as offline evaluation. It involves reading data, processing it, and then putting it back into another entity. This approach is well-understood and is ideal for models with few parameters. However, it may not be effective when dealing with thousands of parameters and features.
  2. Online Evaluation: The online evaluation system, on the other hand, is a bit more complex, and it allows users to track the deployment of a particular ML model and send an example to the model to get immediate results. This makes it a great tool for ML practitioners to test their models quickly, even experimental models. The Ads team at Facebook had an inference system that was very specific to them. To make it available for the rest of Facebook, they created the first version of the inference platform, which was closely tied to FBLearner flow.

Building an Effective A/B Testing Framework for Machine Learning Models

Deploying machine learning models in real-world scenarios can be challenging, especially when dealing with large-scale systems like Facebook, where even a small error can have significant consequences. In this regard, building an effective A/B testing framework is crucial to ensure the optimal performance of machine learning models.

Two essential components of the FB Learner platform that helped achieve this goal were A/B testing and shadow testing. A/B testing allowed for a comparison of two versions of the system or model to determine which one performs better, while shadow testing allowed for deploying the new model in parallel with the existing one to evaluate its performance without affecting user experience. Doing so, helped mitigate the risk of deploying a faulty model in production.

Another unique feature of the FBLearner platform was its ability to facilitate the exchange of models between ML practitioners and developers. It enabled developers to easily deploy the models to production and test them using the existing quick experiment infrastructure. This allowed them to compare the performance of their existing system with the newly deployed ML model quickly, ensuring optimal performance of the system.

How Facebook Bridged the Gap between Research and Production using FBLearner Flow

Facebook's AI research team faced a major challenge in bridging the gap between the needs of researchers and the production team. While researchers needed a system that was fast and allowed them to deploy new models quickly, the production team required stability, reliability, and predictability.


To address this challenge, Facebook built a Slurm-like interface on top of its machine learning platform.

Slurm is a command-line interface used extensively in academia for experiment management. By creating a similar command-line interface for the platform, Facebook made it easy for researchers to use the platform.

Despite the fundamental differences in the requirements of both teams, having a common interface made it easier for researchers to migrate their models to FBLearner Flow for production. The system gave them access to a large feed of machines, unlike Slurm, which was designed to run on a small set of machines.

The Slurm-like interface on the platform allowed researchers to experiment with different models quickly and migrate them to the production environment when they were satisfied with the results.

Optimizing Cost and Latency in AI Inference: The Role of Containerization and Microservices

In the field of AI, achieving cost optimization without compromising on latency is a perpetual challenge. However, with the advent of new technologies and architectural designs, solutions have been developed to address this issue.


Containerization and microservices have proven effective in optimizing costs and reducing latency in AI inference.

Containerization is a method of packaging software code along with its dependencies into a single unit, known as a container. This container can be moved easily from one computing environment to another, making it highly scalable and flexible. By using containerization, organizations can pack multiple AI models into a single container, enabling them to be deployed and scaled quickly and efficiently.

Moreover, containerization also enables bin packing, which optimizes resource allocation by placing multiple containers on a single physical machine. This allows organizations to make the most of their available resources and reduce costs. In addition, with auto-scaling, organizations can quickly scale up or down based on demand, further optimizing costs.

Furthermore, microservices, which are small, independent components of an application, can be used to create an efficient and agile inference platform. By breaking down complex applications into smaller, modular services, each service can be independently scaled, managed, and updated. This not only makes the platform more resilient but also helps reduce latency.

Understanding the Architecture of FBLearner Flow: A Closer Look

Building a robust and scalable AI system requires a solid infrastructure. In this regard, Facebook's FBLearner Flow has been at the forefront of AI innovation, providing a unique solution for training and deploying AI models at scale.

The architecture of FBLearner Flow was largely built in-house, leveraging Facebook's existing infrastructure. They started with Kronos, an internal scheduler, but had to move towards containerization to address stampeding herd and noisy neighbor problems. The system was then built around the idea of operators and workflows, with Hive tables initially being used as channels for structured data, but file clusters and Blob storage eventually being used for non-structured data like images.

The system's execution mechanism was self-contained and versioned, allowing for comparisons between different versions of the model. Experiment management was made easier by the system's ability to change features, model versions, or training paradigms while keeping the output metrics and evaluation sets the same.

If the FBLearner Flow were to be rebuilt today, Kubernetes and Kubeflow would be the preferred solutions. Kubeflow provides a more self-contained paradigm, making it easier to deploy, and it can use other connectors to connect to different pieces of infrastructure.

The inference platform was built on top of Tupperware, Facebook's services infrastructure, with each model being its own container. The auto-scaling feature was borrowed from Kubernetes to ensure the platform was able to scale up and down as required.

Overall, the architecture of FBLearner Flow provides a unique solution for building AI systems at scale. It's a testament to the importance of infrastructure in building robust and scalable AI systems.

Bridging the Gap Between Software Engineering and Machine Learning Deployment Platforms

As the field of machine learning continues to grow and evolve, there is a growing need to develop platforms and processes for deploying machine learning models in production environments. However, many companies view machine learning platforms as separate from software engineering platforms, which can lead to confusion and inefficiencies.


While there are some differences in the tools used, the processes for developing and deploying ML models can be very similar to those used in software engineering.

One of the key aspect is the importance of being opinionated about how ML models are developed and deployed. By introducing concepts like testing and required steps, such as the train test split, companies can streamline the ML development process and ensure that models are deployed in a consistent and effective manner.

Another similarity is the need for monitoring and telemetry in both software engineering and ML deployment platforms. Just as developers monitor the performance of their applications and microservices, ML developers and MLOps engineers need to monitor the performance of their models and infrastructure.

By recognizing the commonalities between the two, and developing processes and tools that are consistent across both, companies can streamline their development and deployment processes, reduce errors and inefficiencies, and ensure that their ML models are deployed in a consistent and effective manner.

The Importance of Monitoring in ML Deployment

Monitoring is a critical aspect of any ML deployment process. It helps to ensure that the models are working as expected and delivering the desired results. Monitoring the infrastructure and the output of the predictions is crucial to ensure that the system is working correctly. The metrics generation part of the FBLearner system, which auto-generated metrics for monitoring the performance of the models.

There is a need for customizable metrics that allow developers to monitor their models' performance effectively. With monitoring in place, developers can quickly identify and fix any issues that may arise during the deployment process, leading to better performance and more accurate predictions.

In conclusion, monitoring is an essential part of ML deployment, and it is crucial to have a system in place that allows developers to monitor their models effectively. Customizable metrics and close monitoring of the infrastructure and output of predictions can help ensure that ML models are working as expected, leading to better performance and more accurate predictions.

Distributed Training and its Impact on Workflow Architecture

Distributed training and distributive inference were major paradigm shifts for the FBLearners Flow system. Initially, the system was designed to work on a single machine, and training was meant to occur on the same machine. However, as more data was added to the system, it became necessary to re-architect the system to support distributed training.

The biggest challenge in this regard was around structure beta. The team had to employ both model parallel and data parallel training for structured and unstructured data, respectively. They also had to add special rules for parameter servers, which were treated differently compared to the training itself. The parameter servers would gather all of the output from each of the training things and put them back together. The team experimented with several different paradigms, eventually settling on a parameter server paradigm that allowed for checkpointing.

Reliability became a major concern with distributed training, as even if one machine failed, the entire workflow would fail. The team had to build APIs to enable check-pointing, as it wasn't completely automated. They also had to write a restarting mechanism to ensure that the workflow could be restarted from the appropriate checkpoint in case of a failure.

While distributed training made it possible to train more complex models, it also made things harder for the team, as training now took 5-6 times longer than in the past. Nevertheless, the team was able to adapt to these changes and ensure that the system continued to work reliably.

Building an ML system for scale: Core principles for success

  1. Continuous feedback and disruption: To build a successful ML system, continuous feedback from users is essential. Aditya's team had an advantage as their customers were within the same company, enabling them to receive feedback and disrupt themselves. As such, building an ML system for scale requires continuous engagement with users, allowing them to provide feedback and adjust accordingly.
  2. Customer service orientation: An essential component of building an ML system for scale is ensuring that users' questions are answered promptly to unblock them. Aditya's team adopted a customer service orientation, which enabled them to address any issues and receive feedback quickly. This approach is critical as it fosters trust with users and enhances their experience.
  3. Evolution and adaptation: Finally, building an ML system for scale requires a willingness to evolve and adapt as ML is a field that changes rapidly. Aditya's team recognized that they could not design everything in a day and that the system's evolution would be continuous. Hence, organizations must be open to change and embrace new technologies as they emerge.

Below are some interesting reads on Facebook's Machine Learning:

  1. Introducing FBLearner Flow: Facebook’s AI backbone  
  2. TWIML podcast with Aditya Kalro
  3. Talk by Aditya & Pierre on Applied Machine Learning at scale

Read our previous post in the series.

Keep watching the TrueML youtube series and reading all the TrueML blog series.

TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.  

Topics: FBLearner Flow, ML Platform at Facebook, ML for large data sets

Build, Train, and Deploy LLM/ML Faster
Start Your Free 7-Day Trial Now!

Discover More

June 13, 2024

Leveraging AI/ML for Revolutionary Logistics at Sennder

True ML Talks
May 16, 2024

Evolution of Machine Learning: A Deep Dive into Savin's Journey

True ML Talks
March 28, 2024

Applications of GenAI at Google

True ML Talks
March 22, 2024

Programmatic Data Labelling and Training LLMs at

True ML Talks

Related Blogs

No items found.

Blazingly fast way to build, track and deploy your models!