True ML Talks #2 - Machine Learning Workflow @ Stitch Fix

Published: August 6, 2024

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

We received a very encouraging response to our first episode of True ML Talks. In this series, we dive deep into the ML pipeline of a few leading ML companies, and in today's episode, we are speaking with Stefan Krawczyk.

Stefan is building DAGWorks, a collaborative open-source platform for data science teams to build and maintain model pipelines, plugging into existing MLOps and data infrastructure (read their YC Launch). He has over 15 years of experience at companies such as Nextdoor, LinkedIn, and Stitch Fix in the field of data and machine learning. He previously led the Model Lifecycle team at Stitch Fix, where he gained extensive experience building self-service tooling for an internal MLOps machine learning platform. He's also a regular conference speaker and author of the popular open-source framework, Hamilton.

📌

Our conversations with Stefan will revolve around four key themes:
1. Machine Learning usecases for the business.
2. How Stitch Fix's team is structured to optimize the business outcomes.
3. Challenges faced in the build-out of ML stack with specific challenges that come pertaining to the industry.
4. An overview of cutting-edge innovations applied during the process of building and scaling ML infrastructure.

Watch the full episode below:

Machine Learning Use Cases and WorkFlow @Stitch Fix

ML Usecases @Stitch Fix: An Online Personal Styling Service

Recommendation System → Models for forecasting and simulation, providing a breakdown of size and allocation ranges.
Forecasting and simulation → When buyers decide how many clothes to buy, they must determine the sizes and population to serve. This involves a forecasting cascade of things people use internally to help them simulate or forecast. The algorithms team at Stitch Fix builds models for this purpose, providing a breakdown of size and allocation ranges. Read more about it here
Internal enterprise resource planning system → Stitch Fix has essentially built its own internal ERP system using algorithms. This system helps plan for inventory, warehousing, and optimizing shipping rates. Read more about it here.
Route optimization through the warehouse → Stitch Fix has a team that optimizes the routes through the warehouse, ensuring that the shortest path is taken to pick orders. This optimization helps in minimizing the cost and processing time for orders.
Warehouse logistics → The algorithms team at Stitch Fix also works on optimizing the process of picking clothes from bins to satisfy orders. Stitch Fix can save costs and minimize order processing times by speeding up this process.

ML System Workflow at Stitch Fix

Stitch Fix has two teams working on their machine learning (ML) systems - the data science and platform teams.

Data Science Team: The data science team, organized into verticals helping different parts of the business, is tasked with building models to help other teams in the business make decisions.
Platform Team: The Platform Team builds abstractions and tooling so that the data scientists do not have to engineer as much to get their work done. The team is divided into several components, such as Spark, Hadoop, Kafka, infrastructure, orchestration system, and environments. There is also a team responsible for the recommendations stack and microservice deployment, as well as A/B testing and experimentation. The team focuses on operationalization elements, such as making it easier to train models, deploy them, and maintain them.

👉

The Data Science team is responsible for owning the models and owning the results. The Platform team ensures the availability of deployment infrastructure and its components.

There was no "handoff" of the model between the Data Science team and the Platform team. This helps in getting many more iterations with the model.

Handling infrastructure Allocation

The infrastructure allocation is handled by the Platform team, who generally owned the quotas of what was available. Data scientists could request nodes or other Spark clusters on a UI. Some accounting is done to ensure that costs do not spiral out without justification. The platform team tried to enable people to get what they wanted easily without getting into too much labor, and there were teams that owned the quotas and ensured that costs were kept under control.

Unique Challenges in Machine Learning and MLOps at Stitch Fix

At Stitch Fix, there are over 100 data scientists working on developing models. With each data scientist owning a model that has an average lifespan of at least two years, there were a lot of teams building a lot of models, and enabling ownership for a team was a significant challenge. The platform team at StitchFix has built many solutions in-house and only bought a few things because of the heterogeneity in libraries and frameworks used. Standardization would have made some platforms easier, but different teams had different problems and libraries that were better suited to those problems. The platform team at StitchFix had to build their own solutions, like Model Envelope and Hamilton, because there were no open-source solutions that fit their needs well.
While it is easy to create models, someone has to maintain all the ETLs, and if someone leaves the company, it can cause problems. Throwing away code is wasteful, and new team members are slowed down until they can figure out and understand what was there before them. Additionally, since people often build on top of other teams' models to use as features, there are interdependencies between ETLs that need to be taken into account.

"One reason why I stayed so long at Stitch Fix was precisely because of those challenges and figuring out how to solve them." - Stefan

Innovations in Stitch Fix ML Platform

Model Envelopes: Stitch Fix's Platform team built a system that enabled data scientists to package their models with other necessary components and ship them easily. It is an API-based system that captures a lot of things, and with one button and extra configuration, data scientists can deploy their models in production in under an hour. The system also enabled for batch jobs and made it easy to run models in a distributed fashion on Spark.
Configuration-Driven ETLs: Stitch Fix's Platform team used YAML and Jinja to enable data scientists to describe the ETL process, which included SQL and Python code for model fitting. The idea was to abstract the orchestration system and allow the platform team to swap different components out of the Docker container, which made it easy to manage and maintain the code base.
Hamilton: Hamilton is a declarative micro-framework for describing data flows. Stitch Fix's Platform team developed it to help with feature engineering, especially for time series forecasting, where it's easy to take thousands of features. Hamilton helps structure the problem, similar to DBT for SQL, but for Python transforms. It enables the data flow to be described without managing Python scripts. Functions declare a data flow, and the definitions can be easily shared and run in offline and online contexts.

👉

For Docker component swapping, the platform team tried to create a golden API where data scientists could describe what they wanted to happen without having to worry about platform dependencies. This was done through configuration-driven model pipelines, where data scientists could provide text containing subsets of changes. The platform team could then change things without requiring the data scientist team to update or upgrade their Docker containers. The platform team could also upgrade logs or meta information without users having to redeploy or rewrite their pipelines. This removed the need for teams to manage and update things and allowed the platform team to more efficiently manage and update things without requiring migration from the data scientist team.

Improving Docker container build and debugging time in remote development: strategies employed at Stitch Fix

Caching to speed up container creation - Stitch Fix's Platform team attempted to speed up the creation of docker containers by caching libraries and environments that were frequently used. For instance, they added caching to ensure that if the same requirements were needed again, they wouldn't have to be reinstalled. They also saved environment files, zipped them, and then pulled them back to speed up the process.
Local development to minimize container build and debugging time - Stitch Fix's Platform team encouraged developers to develop locally or find ways to avoid waiting for a full cycle of the docker container build, run, and debug cycle. They enabled local loops to run for individual steps of the ETL by pulling down some data to cache it for developers, among other usability features. This approach helped to speed up the development and debugging process.
Exploring new approaches to improve container build and debug time - While Stitch Fix's Platform team attempted to make improvements to the build and debugging time with caching and local development, they acknowledged that the process could be faster. Some developers are working on changes to Docker to improve this process, especially for models. This approach involves changing Docker and rebooting it to make it better and more efficient.

MLOps Tooling Recommendations from Stefan Krawczyk

It is important to pick up tools based on business impact and SLAs. Things that can be done with a single node should be optimized with an orchestration system. For data versioning and model registries, saving things to S3 in a structured path structure and storing metadata with them may work. When it comes to model registry, open-source tools like MLflow should help, but there are also hosted management solutions like TrueFoundry.

It is important to have A/B testing system in the stack to understand the value that their model brings to their business. This will help in making decisions about where to invest in MLOps practices based on the impact that their model has.

MLOPS tools Recommendation

Consider the heterogeneity in libraries and frameworks: If there is significant heterogeneity in libraries and frameworks used, building in-house solutions may be necessary as there may not be off-the-shelf solutions that fit your needs.
Standardization may make some platforms easier: While different teams may have different problems, standardizing on certain tools and processes can make MLOps platforms easier to build and maintain.
No open-source solutions that fit your needs? Build your own: If there are no open-source solutions that fit your needs well, building your own solutions, like Model Envelope and Hamilton, may be necessary to achieve your MLOps goals.

Challenges around ETL in Machine Learning

In machine learning, Extract, Transform, Load (ETL) is a critical process for transforming raw data into valuable insights. However, ETL in machine learning systems poses several challenges that need to be addressed.

ETLs in machine learning systems can fall silent due to upstream changes, making it hard for the data practitioners to trace inputs to the model, leading to difficulties in maintaining ETLs. The complexity of ETLs also grows over time, making it difficult for teams to keep up.

DAGWorks: An Open-Source ETL Platform for Data Science Teams

To solve for the challenges mentioned in the previous section, DAGWorks is building an open-source ETL platform for data science teams that reduces their reliance on engineering, the open source is around Hamilton stated above. Hamilton, enables data practitioners without software engineering backgrounds to write code that goes into production and manage machine learning ETL artifacts on their existing infrastructure. Hamilton also provides a central feature definition store and lineage, which helps to make debugging easier and can be used for compliance cases.

Hamilton is designed to be an abstraction layer that can be used with various orchestration tools such as Airflow or Argo. Stefan believes that data scientists should not have to care about the topology of where things run and instead, focus on building models and iterations. DAGWorks is trying to figure out ways and abstractions to make it easier to change data quality providers without having to rewrite things.

While Hamilton complements tools like Metaflow, it is not trying to replace them. Instead, it enables people to be more productive on top of those systems by allowing them to model the micro within a task. Overall, DAGWorks is trying to make it easier for data science teams to manage and maintain machine learning ETL artifacts.

Below are some interesting reads from Stefan & his team:

What I Learned Building Platforms at Stitch Fix

Deployment for Free - A Machine Learning Platform for Stitch Fix's Data Scientists

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Read our previous post in the series

‍

True ML Talks #1 - Machine Learning Workflow @ Gong

Learn how Gong, a revenue intelligence platform, uses machine learning to analyze customer interactions and provide insights to revenue teams. Discover the challenges of managing ML workflows, data privacy, and data security.

TrueFoundry Blog TrueFoundry

‍

Keep watching the TrueML youtube series and reading all the TrueML blog series.

TrueFoundry is a ML Deployment PaaS over Kubernetes to speed up developer workflows while allowing them full flexibility in testing and deploying models while ensuring full security and control for the Infra team. Through our platform, we enable Machine learning Teams to deploy and monitor models in 15 minutes with 100% reliability, scalability, and the ability to roll back in seconds - allowing them to save cost and release Models to production faster, enabling real business value realisation.

Discuss About your ML Pipeline Challenges with us here

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now