Skip to main content
The architecture of a TrueFoundry compute plane is as follows:
PolicyDescription
ELBControllerPolicyRole assumed by load balancer controller to provision ELB when a service of type LoadBalancer is created
KarpenterPolicy and SQSPolicyRole assumed by Karpenter to dynamically provision nodes and handle spot node termination
EFSPolicyRole assumed by EFS CSI to provision and attach EFS volumes
EBSPolicyRole assumed by EBS CSI to provision and attach EBS volumes
RolePolicy with policies for:- ECR, S3, SSM, EKS
Use the trust relationship.
Role assumed by TrueFoundry to allow access to ECR, S3, and SSM services. If you are using TrueFoundry’s control plane the role will be assumed by arn:aws:iam::416964291864:role/tfy-ctl-euwe1-production-truefoundry-deps otherwise it will be your control plane’s IAM role
ClusterRole with policies:
- AmazonEKSClusterPolicy
- AmazonEKSVPCResourceControllerPolicy
- EncryptionPolicy
Role that provides Kubernetes permissions to manage the cluster lifecycle, networking, and encryption
NodeRole with policies: AmazonEC2ContainerRegistryReadOnlyPolicy, AmazonEKS_CNI_Policy, AmazonEKSWorkerNodePolicy, AmazonSSMManagedInstanceCorePolicyRole assumed by EKS nodes to work with AWS resources for ECR access, IP assignment, and cluster registration
EncryptionPolicy to create and manage key for encryption:
{  
    "Statement": [  
        {  
            "Action": [  
                "kms:Encrypt",  
                "kms:Decrypt",  
                "kms:ListGrants",  
                "kms:DescribeKey"  
            ],  
            "Effect": "Allow",  
            "Resource": "arn:aws:kms:<region>:<aws_account_id>:key/<key_id>"  
        }  
    ],  
    "Version": "2012-10-17"  
}

Requirements:

The requirements to setup compute plane in each of the scenarios is as follows:
  • Billing and STS must be enabled for the AWS account.
  • Please make sure you have enough quotas for GPU/Inferentia instances on the account depending on your usecase. You can check and increase quotas at AWS EC2 service quotas
  • Please make sure you have created a certifcate for your domain in AWS Certificate Manager (ACM) and have the ARN of the certificate ready. This is required to setup TLS for the load balancer.
  • You need to have enough permissions on the AWS account to create the resources needed for the compute plane. Check this for more details. We usually recommend admin permission on the AWS account, but if you need the exact set of fine-grained permissions, you can check the list of permissions below:
json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudformation:DescribeStacks",
                "cloudformation:ListStacks",
                "eks:*",
                "ec2:*",
                "iam:GetRole",
                "iam:ListPolicies",
                "elasticfilesystem:*",
                "kms:*",
                "route53:AssociateVPCWithHostedZone",
                "s3:ListAllMyBuckets",
                "sts:GetCallerIdentity",
                "tag:GetResources"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "events:*"
            ],
            "Resource": [
                "arn:aws:events:$REGION:$ACCOUNT_ID:rule/$CLUSTER_NAME*",
                "arn:aws:events:$REGION:$ACCOUNT_ID:rule/Karpenter*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:AddRoleToInstanceProfile",
                "iam:CreateInstanceProfile",
                "iam:DeleteInstanceProfile",
                "iam:GetInstanceProfile",
                "iam:TagInstanceProfile",
                "iam:UntagInstanceProfile",
                "iam:RemoveRoleFromInstanceProfile"
            ],
            "Resource": "arn:aws:iam::$ACCOUNT_ID:instance-profile/$CLUSTER_NAME*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudformation:CreateStack",
                "cloudformation:DeleteStack",
                "cloudformation:GetTemplate"
            ],
            "Resource": "arn:aws:cloudformation:$REGION:$ACCOUNT_ID:stack/$CLUSTER_NAME*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreateOpenIDConnectProvider",
                "iam:DeleteOpenIDConnectProvider",
                "iam:GetOpenIDConnectProvider",
                "iam:TagOpenIDConnectProvider"
            ],
            "Resource": "arn:aws:iam::$ACCOUNT_ID:oidc-provider/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:CreatePolicy",
                "iam:DeletePolicy",
                "iam:GetPolicy",
                "iam:TagPolicy",
                "iam:UntagPolicy",
                "iam:GetPolicyVersion",
                "iam:ListPolicyVersions"
            ],
            "Resource": [
                "arn:aws:iam::$ACCOUNT_ID:policy/$CLUSTER_NAME*",
                "arn:aws:iam::$ACCOUNT_ID:policy/tfy-*",
                "arn:aws:iam::$ACCOUNT_ID:policy/truefoundry-*",
                "arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_Karpenter_Controller_Policy*",
                "arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_CNI_Policy*",
                "arn:aws:iam::$ACCOUNT_ID:policy/AmazonEKS_AWS_Load_Balancer_Controller*",
                "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryFullAccess"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:AttachRolePolicy",
                "iam:CreateRole",
                "iam:CreateServiceLinkedRole",
                "iam:DeleteRole",
                "iam:DetachRolePolicy",
                "iam:ListAttachedRolePolicies",
                "iam:ListInstanceProfilesForRole",
                "iam:ListRolePolicies",
                "iam:PutRolePolicy",
                "iam:GetRolePolicy",
                "iam:DeleteRolePolicy",
                "iam:TagRole",
                "iam:PassRole"
            ],
            "Resource": "arn:aws:iam::$ACCOUNT_ID:role/$CLUSTER_NAME*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:*"
            ],
            "Resource": [
                "arn:aws:logs:$REGION:$ACCOUNT_ID:log-group:/aws/eks/$CLUSTER_NAME*",
                "arn:aws:logs:$REGION:$ACCOUNT_ID:log-group::log-stream:"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:*"
            ],
            "Resource": [
                "arn:aws:s3:::$CLUSTER_NAME*",
                "arn:aws:s3:::$CLUSTER_NAME*/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": "sqs:*",
            "Resource": "arn:aws:sqs:$REGION:$ACCOUNT_ID:$CLUSTER_NAME*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetParameter",
                "ssm:GetParameters"
            ],
            "Resource": "arn:aws:ssm:$REGION::parameter/aws/service/*"
        }
    ]
}
Regarding the VPC and EKS cluster, you can decide between the following scenarios:
  1. The new VPC should will have a CIDR range of /20 or larger, at least 2 availability zones and private subnets with CIDR /24 or larger. This is to ensure capacity for ~250 instances and 4096 pods.
  2. If you are using custom networking, you need to have CGNAT IP address in each AZ. CGNAT space and route tables will be setup in the VPC.
  3. A NAT gateway will be provisioned to provide internet access to the private subnets.
  4. We should have egress access to public.ecr.aws, quay.io, ghcr.io, tfy.jfrog.io, docker.io/natsio, nvcr.io, registry.k8s.io so that we can download the docker images for argocd, nats, gpu operator, argo rollouts, argo workflows, istio, keda, etc.

Setting up compute plane

TrueFoundry compute plane infrastructure is provisioned using OpenTofu/Terraform. You can download the OpenTofu/Terraform code for your exact account by filling up your account details and downloading a script that can be executed on your local machine.
1

Enable Deployment Feature in the Platform (Optional)

To enable the deployment feature which allows you to deploy services through the platform, you need to enable it;
  • In the left hand navigation, go to Settings then Platform Feature Visibility under Preferences
  • Click on Edit button. Then enable the toggle for Enable Deployment
  • Click on Save button.
This will enable the deployment feature in the platform and allow you to create either a control plane and compute plane.
2

Choose to create a new cluster or attach an existing cluster

Go to the platform section in the left panel and click on Clusters. You can click on Create New Cluster or Attach Existing Cluster depending on your use case. Read the requirements and if everything is satisfied, click on Continue.
3

Get Domain and Certificate ARN

We will need a domain and certificate ARN to point to the load balancer that we will be creating in the next step. Let’s say you have a domain like *.services.example.com - we will be creating a DNS record with this later in Step 6. We recommend using AWS Certificate Manager (ACM) to create the certificate since it’s easier to manage and renew the certificates automatically. To generate a certificate ARN, please follow the steps below. If you are not using AWS Certificate Manager, you can skip this step and continue to the next step.
  1. Navigate to AWS Certificate Manager in the AWS console
  2. Request a public certificate
  3. Specify your domain (e.g., *.services.example.com)
  4. Choose DNS validation (recommended)
  5. Add the CNAME records provided by ACM to your DNS provider. Follow the official AWS guide for DNS validation. For detailed steps on adding CNAME records, see AWS documentation on DNS validation
  6. Wait for the certificate to change to “Active” status (this may take 30 minutes or longer)
  7. Copy the certificate ARN for the next step (format will be like: arn:aws:acm:region:account:certificate/certificate-id)
4

Fill up the form to generate the OpenTofu/Terraform code

A form will be presented with the details for the new cluster to be created. Fill in with your cluster details. Click Submit when done
The key fields to fill up here are:
  • Cluster Name - A name for your cluster.
  • Region - The region where you want to create the cluster.
  • Network Configuration - Choose between New VPC or Existing VPC depending on your use case.
  • Authentication - This is how you are authenticated to AWS on your local machine. It’s used to configure OpenTofu/Terraform to authenticate with AWS.
  • S3 Bucket for OpenTofu/Terraform State - OpenTofu/Terraform state will be stored in this bucket. It can be a preexisting bucket or a new bucket name. The new bucket will automatically be created by our script.
  • Load Balancer Configuration - This is to configure the load balancer for your cluster. You can choose between Public or Private Load Balancer, it defaults to Public. You can also add certificate ARNs and domain names for the load balancer but these are optional.
  • Platform Features - This is to decide which features like BlobStorage, ClusterIntegration, ParameterStore, DockerRegistry and SecretsManager will be enabled for your cluster. To read more on how these integrations are used in the platform, please refer to the platform features page.
Enter the domain and the certificate ARN that we got in previous step in the form as shown below.
5

Copy the curl command and execute it on your local machine

You will be presented with a curl command to download and execute the script. The script will take care of installing the pre-requisites, downloading OpenTofu/Terraform code and running it on your local machine to create the cluster. This will take around 40-50 minutes to complete.
6

Verify the cluster is showing as connected in the platform

Once the script is executed, the cluster will be shown as connected in the platform.
7

Create DNS Record

We can get the load-balancer’s IP address by going to the platform section in the bottom left panel under the Clusters section. Under the preferred cluster, you’ll see the load balancer IP address under the Base Domain URL section.
Create a DNS record in your route 53 or your DNS provider with the following details
Record TypeRecord NameRecord value
CNAME*.tfy.example.comLOADBALANCER_IP_ADDRESS
8

Start deploying workloads to your cluster

You can start by going here

FAQ

Set the tags variable in the generated OpenTofu/Terraform code to a map of your tags:
tags = {
  environment = "production"
  team        = "ml-platform"
  cost-center = "1234"
}
This works for both new and existing clusters — applying only adds or updates tags in place; no resources are recreated. For an existing cluster, run tofu plan (or terraform plan) first and confirm the diff is tag-only before applying.

How tagging works across all resources

Tags flow through three layers so every AWS resource — whether managed by Terraform or launched at runtime — gets consistent labels:
LayerMechanismResources covered
1 — Terraform modulesvar.tags is passed into every TrueFoundry module (EKS, VPC, EFS, RDS, IAM, load balancer controller, platform features)All module-managed AWS resources
2 — AWS provider default_tagsThe provider "aws" block sets default_tags { tags = var.tags } as a catch-allAny resource not explicitly tagged by a module
3 — Helm / runtimevar.tags is threaded into inframold Helm values and propagated at runtimeEC2 nodes (via Karpenter extraTags), EBS PVC volumes (via EBS CSI extraVolumeTags)
EC2 nodes and EBS volumes: These are launched at runtime by Karpenter and the EBS CSI driver, so they can’t be tagged via Terraform state. The tags are applied through Helm values and take effect on newly launched nodes and newly provisioned PVC volumes after you apply. Existing nodes need a rolling replacement to pick up new tags.

Suppressing built-in TrueFoundry audit tags

By default, TrueFoundry modules add truefoundry-managed, truefoundry-cluster-name, and truefoundry-terraform-module tags. To suppress these without affecting your own tags or the provider default_tags, set:
disable_default_tags = true
This is useful if your organization enforces a strict tag allowlist.
Yes, you can use cert-manager to add TLS to the load balancer and not use AWS Certificate Manager. You can follow the instructions here to install cert-manager and add TLS to the load balancer.
If you have your own certificate files (for example, from another certificate provider or self-signed), you can use them directly with TrueFoundry.
  1. Create a Kubernetes secret with your certificate and key, or create a self-signed certificate:
    # Generate a self-signed certificate
    openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
      -keyout tls.key -out tls.crt \
      -subj "/CN=*.example.com" \
      -addext "subjectAltName = DNS:example.com,DNS:*.example.com"
    
    # Create secret from local certificate files
    kubectl create secret tls example-com-tls \
      --cert=path/to/cert/file \
      --key=path/to/key/file \
      -n istio-system
    
  2. Once the secret is created, head over to the cluster page and navigate to the tfy-istio-ingress add-on. Add the secret name in the tfyGateway.spec.servers[1].tls.credentialName section and ensure that tfyGateway.spec.servers[1].port.protocol is set to HTTPS. Here we are using example-com-tls as the secret name, which contains the certificate and key.
        servers:
          - <REDACTED>
          - hosts:
              - "*"
            port:
              name: https-tfy-wildcard
              number: 443
              protocol: HTTPS
            tls:
              mode: SIMPLE
              credentialName: example-com-tls
    
Self-signed certificates will cause browser warnings. They should only be used for testing or internal systems. To connect to services with self-signed certificates, you have to pass the CA certificate to verify the SSL certificate.
In Step 4 in the guide above, when you run the curl command, the OpenTofu/Terraform code will be downloaded to your local machine. The script will ask you before executing the OpenTofu/Terraform code at which point you can stop the execution and review the OpenTofu/Terraform code generated by the platform.
By default, Amazon VPC CNI assigns Pods an IP address from the primary subnet. If the primary subnet CIDR is too small, the CNI may not be able to acquire enough secondary IP addresses for your Pods. Custom networking solves this by using a secondary CIDR for Pod IPs. For more details, see the Amazon EKS Custom Networking documentation.Steps:
  1. Attach a secondary CIDR to the VPC (e.g. 100.64.0.0/16 per RFC 6598) and create new subnets with that CIDR in the same AZs as your primary subnets.
  2. Ensure the secondary subnets are added to the route tables of your primary subnets (this typically happens automatically).
  3. Configure the AWS VPC CNI to use the secondary subnets. If you used TrueFoundry’s OpenTofu/Terraform code, add this to the EKS module:
cluster_addons_vpc_cni_additional_configurations = {
    env = {
      AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG = "true"
      ENI_CONFIG_LABEL_DEF               = "topology.kubernetes.io/zone"
    }
}
Otherwise, set the env variables directly:
kubectl set env daemonset aws-node -n kube-system "AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG=true"
kubectl set env daemonset aws-node -n kube-system "ENI_CONFIG_LABEL_DEF=topology.kubernetes.io/zone"
  1. Create an ENIConfig resource for each AZ where your EKS nodes run:
apiVersion: crd.k8s.amazonaws.com/v1alpha1
kind: ENIConfig
metadata:
  name: us-east-1a   # AZ of the EKS nodes
spec:
  securityGroups:
    - sg-0dff111a1d11c1c11   # security group of the EKS nodes
  subnet: subnet-011b111c1f11fdf11   # subnet with the secondary CIDR
  1. Restart the nodes one by one to apply the changes. Pods will be rescheduled onto nodes with secondary IP addresses.
Remove infrastructure managed by Terraform, and Kubernetes-created resources (for example load balancers and Karpenter nodes).
1

Connect to the EKS cluster

aws eks update-kubeconfig --region <region> --name <cluster-name>
2

Delete LoadBalancer services

kubectl get svc -A --field-selector spec.type=LoadBalancer
kubectl delete svc tfy-istio-ingress -n istio-system
3

Delete Karpenter NodePools

kubectl delete nodepool --all
kubectl delete ec2nodeclasses --all
Make sure the nodes are gone (if they are stuck, please delete them manually from the EC2 instances. This is only for the karpenter nodes):
kubectl get nodeclaims
4

From the folder with your generated OpenTofu/Terraform code

terraform destroy
# or
tofu destroy
While creating the S3 bucket, terraform apply (or tofu apply) fails with an error like:
Error: creating S3 Bucket (<bucket-name>) Public Access Block:
operation error S3: PutPublicAccessBlock, https response error StatusCode: 403,
api error AccessDenied: User: arn:aws:iam::<account-id>:user/truefoundry is not
authorized to perform: s3:PutBucketPublicAccessBlock on resource:
"arn:aws:s3:::<bucket-name>" with an explicit deny in a service control policy
Why it happens: Your AWS Organization blocks s3:PutBucketPublicAccessBlock, but the TrueFoundry module tries to set it on each bucket — so the call is denied and the apply fails.The fix: Set the variable below to false so the module skips this step. This is safe — your account already blocks public access, so the per-bucket setting is redundant.Control plane — add truefoundry_s3_attach_public_policy = false to the tfy-control-plane module:
module "tfy-control-plane" {
  source = "truefoundry/truefoundry-control-plane/aws"

  # ... existing arguments ...

  truefoundry_s3_attach_public_policy = false
}
Compute plane — add blob_storage_attach_public_policy = false to the tfy-platform-features module:
module "tfy-platform-features" {
  source = "truefoundry/truefoundry-platform-features/aws"

  # ... existing arguments ...

  blob_storage_attach_public_policy = false
}
Both variables default to true, so you must set them explicitly to false. Then re-run terraform apply (or tofu apply).
The Istio ingress service can remain in Pending state and fail to receive an external IP when the AWS account is missing the Elastic Load Balancing service-linked role and the AWS Load Balancer Controller cannot create it.
kubectl get svc -n istio-system tfy-istio-ingress
Output:
NAME                TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)
tfy-istio-ingress   LoadBalancer   172.xx.xx.xx    <pending>     80:30490/TCP,443:30955/TCP
Describing the Service shows a FailedDeployModel warning:
kubectl describe svc tfy-istio-ingress -n istio-system
Warning  FailedDeployModel  12m  service  Failed deploy model due to operation error Elastic Load Balancing v2: CreateLoadBalancer,
https response error StatusCode: 403, api error AccessDenied:
User: arn:aws:sts::123456789012:assumed-role/example-controller-role/example-session
is not authorized to perform: iam:CreateServiceLinkedRole
on resource: arn:aws:iam::123456789012:role/aws-service-role/elasticloadbalancing.amazonaws.com/AWSServiceRoleForElasticLoadBalancing
because no permissions boundary allows the iam:CreateServiceLinkedRole action

Root Cause

The AWS account did not contain the required service-linked role: AWSServiceRoleForElasticLoadBalancing.When AWS Load Balancer Controller attempted to create a Network Load Balancer (NLB) for the Istio ingress Service, AWS tried to automatically create this service-linked role. The controller was running with an IAM role that was restricted by a permissions boundary, which prevented execution of iam:CreateServiceLinkedRole. As a result, AWS rejected the load balancer creation request and the Kubernetes Service remained in the Pending state.

Verification

Verify whether the service-linked role exists:
aws iam get-role \
  --role-name AWSServiceRoleForElasticLoadBalancing
If the command returns NoSuchEntity, the service-linked role is missing.

Resolution

1

Create the AWS service-linked role

aws iam create-service-linked-role \
  --aws-service-name elasticloadbalancing.amazonaws.com
This is a one-time operation per AWS account. After the role exists, the AWS Load Balancer Controller does not need iam:CreateServiceLinkedRole permission to provision load balancers.
2

Confirm the Service receives an external endpoint

After the service-linked role is created, AWS Load Balancer Controller will successfully provision the NLB and the Istio ingress Service will receive an external endpoint.
kubectl get svc -n istio-system tfy-istio-ingress
The EXTERNAL-IP column should show the NLB hostname instead of <pending>.
CoreDNS handles in-cluster DNS resolution. As your cluster grows, DNS query load increases and a fixed number of CoreDNS replicas can become a bottleneck or single point of failure. EKS supports autoscaling the CoreDNS addon so the replica count scales with cluster demand.If you used TrueFoundry’s OpenTofu/Terraform code, add the cluster_addons_coredns_additional_configurations block to the EKS module:
module "eks" {
  source  = "truefoundry/truefoundry-cluster/aws"

  # ... existing arguments ...

  cluster_addons_coredns_additional_configurations = {
    autoScaling = {
      enabled     = true
      minReplicas = 2
      maxReplicas = 10
    }
  }
}
  • enabled — turns on CoreDNS autoscaling.
  • minReplicas / maxReplicas — lower and upper bounds for the replica count. Tune these values based on your cluster size and DNS query volume.
After updating the module, run terraform apply (or tofu apply) to apply the change.