TrueFoundry makes it really easy to deploy applications on Kubernetes clusters in your own cloud provider account. It does so by abstracting out the infrastructure components for data scientists and developers while enforcing the best practices from security, infrastructure, and cost optimization perspectives.
The key motivations behind the current architecture of TrueFoundry are:
GKE Autopilot enforces having same values for requests and limits for resources, while AKS, EKS and GKE Standard do not.
EKS and GKE have an option for auto-node provisioning, while AKS doesn't provide a way for that.
Node provisioning time is quite high for AKS, which leads to very slow autoscaling behaviour.
Being cloud-native allows us to have access to the differnet hardware provided by different cloud providers specially in case of GPUs.
4. Integrate rather than reinvent: TrueFoundry integrates with most of the commonly used systems instead of reinventing the wheel. This philosophy drives a lot of our architecture decisions. It does sometimes make the journey harder for us since its not always easy to build integrations where solid APIs are not available - but we do the hard work of building those APIs and interfaces so that our users don't need to learn yet another tool.
Machine Learning requires a complicated stack to be setup for datascientists to experiment and deliver rapidly.
Ideally, developers should be spending more in the top green layer while the lower layers should be completely abstracted away from them. TrueFoundry provides a open and customizable stack that works with what you are currently using and helps datascientists iterate on the applications without focusing on the underlying infrastucture layers.
In the diagram below, TrueFoundryprovides the Model Training, Serving and Model registry to make it easier for datascientists to build, track and deploy models.
The key set of integrations that TrueFoundrycurrently provides are:
TrueFoundry provides a split plane architecture which comprises of the following major components:
3. Client Interfaces: Developers and datascientists can communicate with the UI using a python SDK, or our web UI or using the TrueFoundry CLIs (servicefoundry and mlfoundry). TrueFoundry also exposes APIs for clients to build automation workflows which are documented here: https://docs.truefoundry.com/reference
4. Authentication Server: There is a central authentication and licensing server that keeps track of all the organizations and their members. This server is hosted by TrueFoundry and can also integrate with external IDPs to provide a single sign-on experience to all our users.
Secure Networking
The tfy-agent component has no ingress and is responsible for initiating the connection to the control plane. It sets up a persistent encrypted connection to the control plane over which the communication happens. This allows the system to work even if the compute-plane clusters are private or in different VPCs. The only constraint is that the control plane url should be accessible to all the compute plane clusters. You can also control the permissions granted to tfy-agent using Kubernetes RBAC to have access to certain namespaces.
Soft dependency on Truefoundry control plane
The Truefoundry control plane is only responsible for orchestrating the deployments to the compute-plane. It doesn't lie in the critical path of the request flow to the deployed services. So even if you remove the Truefoundry Control plane, all the deployed services continue to run smoothly. This decoupling of service reliability from Truefoundry helps ensure that Truefoundry doesn't lie in the critical path of service reliability.
Efficient Multi-Cluster Management
Truefoundry Control Plane provides a single pane of glass to view all the kubernetes clusters across cloud providers and on-prem in the company. This also makes it quite easy to move workloads from one cluster to another using our Clone and Promote feature.
Lower cost and maintainenance
The Truefoundry agent is a very lightweight component that sits on every single cluster, while there only needs to be a single copy of the control plane. The control plane needs more resources (3CPU, 6GB RAM) while the agent needs only 0.2 CPU and 400MB RAM. As scale of traffic and teams increase, we usually need to add more clusters based on regions or teams. But the control plane doesn't need to be replicated, thus enabling lower cost and maintainenance.
Truefoundry Control Plane comprises of multiple microservices which orchestrate the deployments, model metadata storage, etc. The key components of the Truefoundry Control plane are:
2. Microservices for orchestrating deployments: The control plane comprises of a few microservices to orchestrate the deployments across the clusters and also caches the live updates from all the connected clusters in compute-plane.
3. Postgres Database: This is used to store all the information about teams, services deployed and their metadata.
We need to install a few components on the Compute-Plane cluster to reap the full benefits of Truefoundry. The list is as follows:
Truefoundry can work with any underlying AMIs including Bottlerocket on AWS. Since the agent is just like any other helm chart running on Kubernetes, hence we don't have any constraints or requirements from the underlying AMIs and can run on any AMIs including bare metal machines.
The tfy-agent performs all actions on the Kubernetes user on behalf of the users logged into the truefoundry platform. Hence it requires admin access on a certain set of namespaces on which the users are allowed to deploy the applications. We have functionality to blacklist or whitelist a certain set of namespaces and the agent can only perform actions on those namespaces.
Truefoundry relies on a authentication server that lives on our servers for licensing and authentication.
Truefoundry provides a fine grained RBAC control at tenant, cluster, workspace and ML repo level. To understand the RBAC mechanisms in Truefoundry, you can read our docs here: https://docs.truefoundry.com/docs/collaboration-and-access-control
All the authorization rules reside in the Postgres table in the control-plane and every API call is checked to see if the user is authorized to perform that connection.
Truefoundry provides a basic image building pipeline that is optimized to build images really fast on Kubernetes. In case you want to customize the pipeline of building images to include static checks or other vulnerability scanning tools, there are two approaches:
Truefoundry integrates with the most popular secret management stores. It doesn't store the secret values with it and only stores the path to those secrets.
Join AI/ML leaders for the latest on product, community, and GenAI developments