ベアメタルAIのオーケストレーション：TrueFoundryとOracle Cloud Infrastructureの統合

Published: July 4, 2026

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

Oracle Cloud Infrastructure (OCI) takes a different approach to AI compute than the VM-first hyperscalers. The differentiated layer is bare-metal: OCI's GPU shapes — like BM.GPU.H100.8 — run with zero hypervisor overhead and connect through NVIDIA ConnectX SmartNICs over a custom RDMA over Converged Ethernet (RoCE v2) cluster network.

That performance ceiling has an operational cost. Bare metal means you're now responsible for the layer that VMs usually abstract: GPU drivers and the OFED stack, scheduling within the cluster network's topology constraints, federating identity through OCI IAM, and choosing among several storage paths for model weights. None of this is exotic, but it's substantial Kubernetes work that doesn't show up on managed-VM offerings.

TrueFoundry's role on this stack is the Kubernetes-native operational layer. The Compute Plane is your own Oracle Cloud Infrastructure Kubernetes Engine (OKE) cluster running on OCI bare metal. The platform packages a set of open-source and CNCF-affiliated components (ArgoCD, Argo Workflows, NVIDIA GPU Operator, Prometheus, KEDA, Istio, and others) into a managed deployment, adds a unified UI and GitOps workflow, and provides observability across services, jobs, and GPU utilization. It does not replace OCI primitives — it sits on top of them.

This post walks through the architecture you end up with when you run TrueFoundry on OCI bare metal: the control/compute plane split, how RDMA training fits into Kubernetes, how workload identity works, and the practical patterns for loading model weights at scale.

Deployment Model: Control Plane and Compute Plane

TrueFoundry uses a split-plane architecture. The Control Plane (TrueFoundry-managed or self-hosted) holds metadata, RBAC, the API server, and the deployment manifest store. The Compute Plane is one or more Kubernetes clusters in your own cloud environment — in this case, an OKE cluster on OCI bare metal. Workloads, model weights, and customer data stay in your tenancy.

The link between them is the tfy-agent, which runs on the Compute Plane cluster and opens an outbound-only WebSocket to the Control Plane. The agent pulls deployment manifests and pushes back Kubernetes resource updates. Because the connection is outbound, you don't need to open inbound ports on your VCN or expose the cluster API server to the public internet.

When TrueFoundry sets up a Compute Plane, the agent installs and manages a set of open-source addons via ArgoCD:

ArgoCD for GitOps-style application delivery
Argo Workflows for the Jobs feature (training runs, batch pipelines)
Argo Rollouts for canary and blue-green deployments
Prometheus / kube-prometheus-stack for metrics that power autoscaling and observability
KEDA for event-driven autoscaling
Istio as the primary ingress controller and for traffic management
NVIDIA GPU Operator for GPU driver lifecycle, DCGM-based health checks, and GPU node labels via Node Feature Discovery (the OCI-specific RDMA topology labels are separately exposed by OCI's own node provisioning — see the Networking section below)
Victoria Logs + Vector (optional) for log aggregation

You can also bring your own existing instances of any of these — TrueFoundry documents the configuration needed to coexist with an existing ArgoCD, Prometheus, or Istio install.

**Figure 1.** Split-plane architecture. The Compute Plane runs in the customer's OCI tenancy on an OKE cluster; the tfy-agent connects outbound to the Control Plane. No inbound ports are required on the VCN.

Bare-Metal Networking: RDMA on OKE

OCI's cluster network is the differentiated networking layer that makes large-scale distributed training viable. Oracle has published internal measurements showing single-digit-microsecond latency in this fabric — as low as 2 microseconds for single-rack clusters, and typically 2.5 to 9 microseconds across multi-rack superclusters — using RoCE v2 over NVIDIA ConnectX SmartNICs. Actual numbers in production depend on cluster topology, message size, and contention; Oracle's First Principles writeup covers the underlying design.

To use RDMA effectively, two things have to be true:

The nodes must be in the same OCI cluster network. OCI exposes topology labels on OKE nodes — oci.oraclecloud.com/rdma.local_block_id, network_block_id, and hpc_island_id — and Oracle's oci-hpc-oke quickstart shows how to use them with Kueue's topology-aware scheduling for best NCCL performance. For tightly coupled training, prefer co-locating pods within the same Local Block.
The training pod has to access the RDMA devices. On OCI's H100 nodes, the Mellanox/NVIDIA driver exposes RDMA devices under /dev/infiniband/ (the path naming reflects the underlying IB verbs API, even though the transport is RoCE v2 over Ethernet).

The standard pod pattern from Oracle's quickstart looks like this:

spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet
  containers:
  - name: trainer
    securityContext:
      privileged: true
      capabilities:
        add: ["IPC_LOCK"]
    volumeMounts:
    - { mountPath: /dev/infiniband, name: devinf }
    - { mountPath: /dev/shm, name: shm }
  volumes:
  - { name: devinf, hostPath: { path: /dev/infiniband }}
  - { name: shm, emptyDir: { medium: Memory, sizeLimit: 32Gi }}

These elevated privileges — privileged: true, IPC_LOCK, host networking — are specific to HPC/RDMA workloads. In production, isolate these pods to dedicated GPU namespaces with admission policies (e.g. Pod Security Admission set to privileged on the namespace, restricted elsewhere; or an OPA/Kyverno policy gating on a label) so that unrelated workloads can't inherit the same context.

‍

Figure 2. RDMA network flow between two bare-metal H100 nodes in the same OCI cluster network. Training pods use the userspace RDMA stack (libibverbs) to set up queue pairs through /dev/infiniband; once set up, data-path operations bypass the kernel and the ConnectX SmartNIC offloads RoCE v2 onto the cluster network fabric.

‍

TrueFoundry's role here is to make running these training pods part of a normal deployment workflow — you author the workload, the platform pushes it through Argo Workflows, Prometheus scrapes GPU metrics from the DCGM Exporter shipped with the GPU Operator, and ArgoCD versions the manifests. Workload-level signals like NCCL counters are scraped only when the application exposes them. The RDMA-specific pieces (hostPath mounts, IPC_LOCK, topology affinity) are configured in your job spec following the standard pattern above; the platform doesn't replace those fields, it deploys whatever you specify.

For multi-node distributed training specifically, you'll typically install an operator on the cluster via its helm chart — MPI Operator for MPIJob-based runs (PyTorch DDP, DeepSpeed, NCCL), Kubeflow Training Operator for PyTorchJob/TFJob, or KubeRay for Ray-based training. TrueFoundry doesn't bundle these by default. Once installed, the platform deploys MPIJob/PyTorchJob/RayJob resources just like any other Kubernetes workload, with the same GitOps and observability story. Distributed training with RDMA on OCI is not a first-class TrueFoundry feature today — it's an implementation pattern based on Oracle's published reference manifests, with the platform handling the surrounding operational stack rather than the RDMA-specific orchestration.

Identity Federation: OKE Workload Identity

For supported cluster types, OCI now recommends OKE Workload Identity over distributing long-lived API keys to pods. The mechanism works similarly to AWS IRSA or GKE Workload Identity: a Kubernetes ServiceAccount is mapped to an OCI IAM policy, and the OCI SDK exchanges the pod's projected ServiceAccount token for a short-lived OCI access token at call time. There's no static credential in the pod.

Two important constraints from Oracle's docs:

Workload Identity only works on Enhanced OKE clusters, not Basic clusters.
Authentication via Workload Identity is only supported through OCI SDKs (Java, Python, Go, etc.) — not the OCI CLI or the Console.

**Figure 3.** The OKE Workload Identity authentication sequence. The OCI SDK reads the projected ServiceAccount token, exchanges it for a short-lived OCI access token via the OCI Auth Service, and uses that token for subsequent calls to OCI services.

On the Kubernetes side, workloads are bound to namespace-scoped ServiceAccounts, and the platform can automate their creation as part of the deployment flow. The OCI side — the IAM policy with the request.principal.type='workload' rule and the cluster/namespace/serviceaccount selectors — is configured per Oracle's standard Workload Identity setup. Once both sides are in place, deployments that need OCI access obtain short-lived tokens transparently through the SDK.

Storage and Model Loading

OCI bare metal gives you several storage choices for getting model weights into VRAM. Each has trade-offs, and TrueFoundry's volume abstractions (PVCs, volume mounts, init containers) work with all of them — the right pattern depends on your workload:

Local NVMe. BM.GPU.H100.8 ships with 16 NVMe SSDs (3.84 TB each, ~61 TB local). For workloads where each node loads its own copy of the weights from a checkpoint cache, local NVMe is the fastest option — no network involved on the hot path.
OCIファイルストレージサービス（FSS）と高性能マウントターゲット FSS HPMTは、高いスループットでRWX（ReadWriteMany）ファイルストレージを提供し、多数のベアメタルノードに同時にマウント可能です。これは、KubernetesのRWX PersistentVolumesを中心に構築されたTrueFoundryのボリューム抽象化に最適です。共有モデルの重み、データセット、トレーニング中のチェックポイントストレージに適しています。Oracleは、フルマネージドの Lustreファイルサービスも提供しています。
OCIブロックボリューム（マルチアタッチ）。 単一のブロックボリュームは、同じアベイラビリティドメイン内の最大32のインスタンスに読み取り専用で共有アタッチできます。パフォーマンスはボリュームごと（アタッチされたすべてのインスタンスで共有）であり、ベアメタルはiSCSIとマルチパスを使用してアタッチします。ファイルシステム層を介さずに、ジョブ内のすべてのノードがローカルディスクのように読み取り専用でマウントしたい固定されたモデルアーティファクトがある場合に役立ちます。
OCIオブジェクトストレージ。 ポッド起動時にオブジェクトストレージから重みをローカルNVMeまたはメモリにプルします。最もシンプルなパターンで、TrueFoundryのinit-containerフックと相性が良いです。各ポッドは独立した帯域幅を得るため、多くのリーダーで単一ボリュームのIOPS予算を共有するよりも優れていることが多いです。

適切なパターンはワークロードに依存します。ほとんどのトレーニング実行では、ローカルNVMe（一時的なホットデータ用）とFSS（共有チェックポイントおよび重み用）の組み合わせが本番環境のセットアップとなります。ブロックボリュームのマルチアタッチは、単一の不変アーティファクトが多くのリーダーにとってローカルディスクのように見える必要がある特定のケースで選択肢となります。

運用上の考慮事項

生のOCIプリミティブ上でベアメタルGPUワークロードを実行することは可能です。Oracleは TerraformベースのHPCスタックとoci-hpc-okeクイックスタートを提供していますが、かなりのKubernetes運用レイヤーを自分で管理することになります。以下の表は、TrueFoundryがその上に何を追加するかを示しています。

Task	Raw OCI / OKE	With TrueFoundry on OKE
Cluster bootstrap	Provision OKE cluster + GPU node pools + RDMA-enabled images via Terraform / Resource Manager	Same OKE + node pool setup, plus TrueFoundry's Generic compute-plane attach flow — helm-based agent + addon install generated by the platform UI
Distributed training	Install MPI Operator / Training Operator / KubeRay via helm; author MPIJob/PyTorchJob/RayJob manifests with hostPath mounts and RDMA topology affinity	Same operator install (via helm) and same manifests; the platform handles deployment lifecycle, GitOps versioning, run history, and Prometheus-backed observability — not the operator install itself
Model weight access	Mount FSS, Block Volume, or pull from Object Storage; wire init containers manually	Standard Kubernetes PVCs and init-container hooks managed through the platform UI
Workload identity	Create OCI IAM policy, ServiceAccount, configure SDK	Create ServiceAccount through the UI; OCI IAM binding configured per Oracle's standard Workload Identity setup
Autoscaling	Configure OKE Cluster Autoscaler and GPU node pool scaling manually	OKE Cluster Autoscaler triggered by Kubernetes resource requests; KEDA + Prometheus drive application-level autoscaling and scale-to-zero
GPU health	Install NVIDIA GPU Operator, configure DCGM	GPU Operator deployed and managed by the agent; DCGM Exporter metrics are scraped by the platform's kube-prometheus-stack and surface in the observability UI

パターンは一貫しています。OCIはベアメタルのコンピュートおよびネットワークプリミティブを提供し、TrueFoundryはその上にKubernetes運用レイヤーを提供します。

まとめ

OCI上のTrueFoundryは、Oracleのベアメタルスタック上で動作するKubernetesネイティブプラットフォームです。コンピュートプレーンはOKEクラスターであり、ワークロードはOCIのRoCE v2クラスターネットワークと標準的なKubernetesパターンによるワークロードアイデンティティを使用します。また、このプラットフォームは、オープンソースおよびCNCF関連の運用レイヤー（GitOps、可観測性、オートスケーリング、GPUオペレーター）をマネージドデプロイメントとしてパッケージ化します。その結果、ベアメタルGPUワークロードを運用するためのプラットフォームエンジニアリングのオーバーヘッドが削減され、OCI固有の設定は隠蔽されず、透過的に保たれます。

実用的な評価パスとしては、小規模なOKEクラスター（通常、1つまたは2つのBM.GPUシェイプと、プラットフォームアドオン用のCPUプール）でのリファレンスデプロイメントが挙げられます。これにより、フルスーパークラスターにスケールする前にアーキテクチャを検証できます。

‍

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now