Fractional GPUs in Kubernetes

By Shubham Rai

Published: August 6, 2024

Built for Speed: ~10ms Latency, Even Under Load

Blazingly fast way to build, track and deploy your models!

Handles 350+ RPS on just 1 vCPU — no tuning needed
Production-ready with full enterprise support

Get Started with Truefoundry Now Talk to the Expert

Overview

The GenAI revolution has led to a surge in GPU demand across the industry. Companies want to train, fine-tune and deploy LLMs in massive quantities. This has meant lower availability and consequent increase in prices for the latest GPUs. Companies running workloads on public cloud have suffered from high prices and increasing uncertainty wrt GPU availability.

These new realities make being able to utilize available GPUs to the maximum extent absolutely critical. Partitioning or sharing a single GPUs between multiple processes helps with this. Implementing it on top of kubernetes gives a winning combination where we get autoscaling and a sophisticated scheduler to help with optimizing GPU utilization.

Options for sharing GPUs

In order to share a single GPU with multiple workloads in kubernetes, these are the options we have -

MIG

Multi-Instance GPU (MIG) allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into separate GPU Instances for CUDA applications. Each partition is completely memory and compute isolated and can provide predictable througput and latency

A single NVIDIA A100 GPU can be partitioned in upto 7 isolated GPU instances. Each partition appears as a separate GPU to the software running on a partitioned node. Other MIG supported GPUs and number of supported partitions are listed here.
More info here

Pros

Full compute and memory isolation that can support predictable latency and throughput
nvidia-device-pluginfor kubernetes has native support for MIG

Cons

Only supported for recent GPUs like A100, H100, A30. This ends up limiting the options one has
Number of partitions has a hard limit of 7 for most architectures. This is fairly less if we are running smaller workloads with limited memory and compute requirements

Time slicing

Time slicing enables multiple workloads to be scheduled on the same GPU. Compute time is shared between the multiple processes and the processes are interleaved in time. A cluster administrator can configure a cluster or node to advertise a certain number of replicas/GPU which reconfigures the nodes accordingly.

Pros

No upper limit to the number of pods that can share a single GPU
Can work with older versions of NVIDIA GPUs

Cons

No memory or fault isolation. There is no in built way to make sure a workload doesn’t overrun the memory assigned to it.
Time slicing provides equal time to all running processes. A pod running multiple processes can hog the CPU much more than intended

There are other options available to us for GPU sharing like MPS and vGPUs but they don’t have native support in `nvidia-device-plugin` and we won’t be discussing them here.

Time slicing Demo

Lets go through a short walkthrough on how we can utilize time sharing on Azure Kubernetes Service. We start with an already existing kubernetes cluster.

1. Add a GPU enabled node pool in the cluster

 
$ az aks nodepool add \
    --name <nodepool-name> \
    --resource-group <resource-group-name> \
    --cluster-name <cluster-name> \
    --node-vm-size Standard_NC4as_T4_v3 \
		--node-count 1

‍‍This will add a new node pool with a single node to the existing AKS cluster with a single NVIDIA T4 GPU. This can be verified by running the following

 
$ kubectl get nodes <gpu-node-name> -o 'jsonpath={.status.allocatable.nvidia\.com\/gpu}'

2. Install the gpu operator


$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
	&& helm repo update
$ helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.enabled=false \
--set toolkit.enabled=false \
--set operator.runtimeClass=nvidia-container-runtime

3. Once the operator is installed, we create a time slicing configuration and configure the whole cluster to slice the GPU resources where available‍‍


$ kubectl apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    flags:
      migStrategy: none
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 10
EOF

# Reconfigure gpu operator to pick up the config map
$ kubectl patch clusterpolicy/cluster-policy \
-n gpu-operator --type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "time-slicing-config", "default": "any"}}}}'

4. Verify that the existing node has been successfully reconfigure‍


$ kubectl get nodes <gpu-node-name> -o 'jsonpath={.status.allocatable.nvidia\.com\/gpu}'
10

5. We can verify the configuration by creating a deployment with 4 replicas with each asking for 2 nvidia.com/gpu resource


$ kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: time-slicing-verification
  labels:
    app: time-slicing-verification
spec:
  replicas: 4
  selector:
    matchLabels:
      app: time-slicing-verification
  template:
    metadata:
      labels:
        app: time-slicing-verification
    spec:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      hostPID: true
      containers:
        - name: cuda-sample-vector-add
          image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
          command: ["/bin/bash", "-c", "--"]
          args:
            - while true; do /cuda-samples/vectorAdd; done
          resources:
           limits:
             nvidia.com/gpu: 1
EOF

Verify that all the pods of this deployment have come up on the same already created node and it was able to accommodate them.

Conclusion

The GenAI revolution has changed the landscape of GPU requirements and made being responsible with resource utilization more critical than ever. There are shortcomings to both the approaches outlined here but there is no way around being responsible with GPU costs in the current scenario.

TrueFoundry AI Gateway delivers ~3–4 ms latency, handles 350+ RPS on 1 vCPU, scales horizontally with ease, and is production-ready, while LiteLLM suffers from high latency, struggles beyond moderate RPS, lacks built-in scaling, and is best for light or prototype workloads.

Built for Speed: ~10ms Latency, Even Under Load

Schedule your Demo Now