Prerequisites
Before starting the migration, ensure the following:- OpenTofu 1.10+ is installed (recommended). Terraform also works if OpenTofu is not available.
- AWS CLI is installed and configured with appropriate credentials.
- Your modules are currently on the versions listed in the From column of the Module Version Reference table below.
- You have sufficient IAM permissions to manage EKS, IAM roles, SQS, and related resources.
- Your current infrastructure has a clean plan — run
tofu plan(orterraform plan) and confirm it shows no pending changes before starting. If there is drift between your state and your infrastructure, resolve it first to avoid mixing unrelated changes into the migration.
We recommend confirming that S3 bucket versioning is enabled on your OpenTofu/Terraform state bucket as a best practice before beginning. This allows you to recover any prior state file version if something goes wrong.
Module Version Reference
The following table summarizes the version changes for each module. Modules marked with * have an intermediate version that must be applied before the final upgrade.| Module | From | Intermediate | To (v6 compatible) |
|---|---|---|---|
| network | v0.3.10 | — | v0.4.0 |
| eks * | v0.7.20 | v0.7.21 | v0.8.1 |
| efs | v0.4.5 | — | v0.5.0 |
| aws-load-balancer-controller | v0.1.5 | — | v0.2.0 |
| karpenter * | v0.3.12 | v0.3.13 | v0.4.0 |
| tfy-platform-features | v0.4.13 | — | v0.5.0 |
| control-plane (if applicable) | v0.4.24 | — | v0.5.0 |
Karpenter Upgrade Strategy
Upgrading Karpenter requires releasing an intermediate version before the final release to ensure zero downtime. The migration transitions Karpenter from IRSA to Pod Identity. How the phased upgrade works:- Version v0.3.13 is deployed first. It creates new resources (SQS queue, IAM role with Pod Identity) that run simultaneously alongside the older resources.
- The Karpenter Helm chart values are updated to point to the newly created resources.
- A
disable_old_changesflag controls the cleanup of old resources. When set totrue, the older IRSA-based resources are removed. - Version v0.4.0 is the final release that is fully AWS provider v6 compatible.
Resource transition details
Resource transition details
The following table shows how each Karpenter-managed resource transitions during the migration:
| Resource | Old (disable_old_changes = false) | New (disable_old_changes = true) |
|---|---|---|
| SQS queue | <cluster_name>-karpenter | <cluster_name>-karpenter-queue |
| Controller IAM role | <cluster_name>-karpenter | <cluster_name>-karpenter-controller |
| Role trust | IRSA only | Pod Identity |
| Instance profile | <cluster_name>-karpenter-initial | Same (unchanged, in-place update) |
| CloudWatch rules | Managed individually | Managed by sub-module |
| IRSA module | module.karpenter_irsa_role[0] | Removed |
Migration Steps
Pin modules to intermediate versions
Ensure all modules are at the following versions before proceeding. If any module is on an older version, update it to the version shown here and apply first.
| Module | Version |
|---|---|
| network | v0.3.10 |
| eks | v0.7.21 |
| efs | v0.4.5 |
| aws-load-balancer-controller | v0.1.5 |
| karpenter | v0.3.12 |
| tfy-platform-features | v0.4.13 |
| control-plane (if applicable) | v0.4.24 |
Run
tofu plan (or terraform plan) and review the output before applying. Verify no unexpected changes are shown.Prepare EKS and Karpenter modules
This step prepares the EKS cluster for Pod Identity and sets up the intermediate Karpenter version.1. Update the cluster moduleRemove the 2. Update the Karpenter moduleMove the Karpenter module to v0.3.13 with the following settings:Apply the changes after reviewing the plan.
node_security_group_additional_rules block from the cluster module and bump it to v0.7.21 to install the EKS Pod Identity Agent:Run
tofu plan (or terraform plan) and review the output. You should see new resources being created (new SQS queue, new IAM role with Pod Identity) while existing resources remain unchanged.Update Karpenter Helm chart values
These changes apply to the Karpenter Helm chart values, not the Karpenter Config Helm chart values.
serviceAccount annotationsFind and remove the following lines from your Karpenter Helm chart values:interruptionQueue nameAppend -queue to the end of the interruptionQueue value in your Karpenter Helm chart values:Clean up old Karpenter resources
Run the Karpenter OpenTofu/Terraform module with Apply the changes after confirming the plan only removes old resources.
disable_old_changes = true to remove the old IRSA-based resources:Run
tofu plan (or terraform plan) and review the output. You should see the old IRSA module, old SQS queue, and old CloudWatch rules being destroyed. The new resources created in the previous step should remain untouched.Upgrade modules to final v6-compatible versions
Now upgrade all modules to their final AWS provider v6-compatible versions.1. Version-only bumpsUpdate the following modules to their new versions. These require only a version change with no other configuration modifications:2. EFS moduleThe EFS module now requires the 3. EBS moduleThe EBS module requires 4. Karpenter module (final version)Upgrade Karpenter to the final v0.4.0 release. The module configuration is simplified since the migration is now complete:5. Update tfy-karpenter Helm chartWhen upgrading the Karpenter Terraform module to v0.4.0, also update the tfy-karpenter Helm chart to version 0.5.11.6. AWS Load Balancer Controller moduleUpgrade the AWS Load Balancer Controller module to the final v0.2.0 release. The module configuration is simplified since the migration is now complete:7. TrueFoundry moduleUpdate the TrueFoundry module to reference the new EBS IAM role ARN:Apply all changes after reviewing the plan.
cluster_oidc_issuer_arn input:use_name_prefix = false to prevent the IAM role from being recreated, and a policy_name parameter:Run
tofu plan (or terraform plan) and review the output carefully. If you see any unexpected resource deletions, investigate before applying. The use_name_prefix = false addition in the EBS module is specifically to avoid an unnecessary role recreation.Post-migration validation
After completing all upgrade steps, verify that everything is working correctly.1. Verify Karpenter pods are healthyAll Karpenter pods should be in Check that existing NodeClaims are in a healthy state. If you have pending pods that require new nodes, verify that Karpenter provisions them.3. Verify Pod Identity associationsConfirm that a Pod Identity association exists for the Karpenter service account.4. Verify all module resourcesRun a final plan to confirm no further changes are pending. The output should show
Running status with all containers ready.2. Verify nodes can be provisionedNo changes. Your infrastructure matches the configuration.Rollback
If you encounter issues during the migration, you can revert to the previous state. Before Step 3 (old resources still exist):- Revert the Karpenter Helm chart values to restore the
serviceAccount.annotationsand originalinterruptionQueuename (without the-queuesuffix). - Revert the Karpenter module version to v0.3.12 in your
.tffiles. - Revert the cluster module to restore the
node_security_group_additional_rulesblock and set it back to v0.7.20 if needed. - Run
tofu plan(orterraform plan) to confirm the rollback scope, then apply.
- Set
disable_old_changes = falseon the Karpenter module (still at v0.3.13) and apply to recreate the old resources. - Revert the Karpenter Helm chart values to restore the
serviceAccount.annotationsand originalinterruptionQueuename. - Once Karpenter is healthy with the old resources, revert the module version to v0.3.12 and apply.
The phased Karpenter upgrade is designed so that you can revert the Helm chart values at any point before Step 3 to fall back to the old IRSA-based resources without disruption.
Troubleshooting
Karpenter pods crashlooping after cleaning up old resources
Karpenter pods crashlooping after cleaning up old resources
If Karpenter pods are crashlooping after Step 3, verify that:
- The Karpenter Helm chart values were updated before running Step 3. The
serviceAccount.annotationsshould be removed andinterruptionQueueshould have the-queuesuffix. - The Pod Identity association was created successfully. Check with:
- The new SQS queue exists:
disable_old_changes to false and apply to recreate the old resources, then follow the steps in order.OpenTofu/Terraform plan shows unexpected resource deletions
OpenTofu/Terraform plan shows unexpected resource deletions
If
tofu plan (or terraform plan) shows resources being destroyed that you do not expect, do not apply. Common causes include:- IAM role recreation: Ensure the EBS module has
use_name_prefix = falseset. Without this, the role name gets a random suffix and OpenTofu/Terraform sees it as a new resource. - State drift: If resources were modified outside of OpenTofu/Terraform, the plan may show unexpected changes. Run
tofu refresh(orterraform refresh) to sync state before re-running the plan. - Module source changes: Verify all module
sourceandversionfields match the values in this guide exactly.
SQS queue name mismatch or interruption handler not working
SQS queue name mismatch or interruption handler not working
If Karpenter is not processing spot interruption events:
- Confirm the
interruptionQueuevalue in the Karpenter Helm chart matches the actual SQS queue name. After migration, it should be<cluster_name>-karpenter-queue. - Verify the queue exists and has the correct permissions:
- Check Karpenter logs for queue-related errors:
Pod Identity not taking effect
Pod Identity not taking effect
If Karpenter is unable to assume its IAM role after the migration:
- Verify the EKS Pod Identity Agent addon is installed and running:
- Confirm the Pod Identity association exists:
- Restart the Karpenter pods to pick up the Pod Identity credentials:
- If the EKS Pod Identity Agent is missing, verify that the cluster module was upgraded to v0.7.21 or later in Step 1, which installs this addon.