GPU Software on Kubernetes for NVIDIA H100/H200

Overview

Kubernetes v1.34 (beta) is now available on FPT Cloud Managed Kubernetes with GPU. This version introduces a new GPU Software option (GPU Operator), a new GPU Driver Installation Type (Managed-install), and a GPU Sharing Strategy (MIG).

1. GPU Software: GPU Operator

To use GPU Operator, you must select Kubernetes version 1.34. When creating a v1.34 cluster, the GPU Software section will include a GPU Operator option (in addition to the existing options).

警告

Managed-install Driver requires GPU Operator to function. If another GPU Software option is selected, the Managed-install option will not appear.

2. GPU Driver Installation Type

Type	Description	Allow version update after cluster creation?	Notes
Pre-install	Driver is pre-installed at the selected version	No	If the version is deprecated, the corresponding worker loses support
User-install	User installs the driver manually	No	Incompatibilities may occur; FCI will not support such cases
Managed-install	Driver is installed automatically and can be updated after cluster creation	Yes	Requires GPU Operator. Applies to H100 and H200 GPUs only

Driver versions supported for Managed-install

Version	CUDA
535.247.01	12.2
550.163.01	12.4
570.158.01	12.8
580.82.07	13.0

Notes on upgrading Managed-install drivers

The system will wait for all GPU workloads on the node to be undeployed before upgrading the driver.
To trigger a manual upgrade: go to the fptcloud-gpu-operator namespace and delete the nvidia-driver-* pods on the corresponding node.

3. Relationship between Base Worker Group and Non-base Worker Group

Base Worker Group constraints

Base worker group does not support GPU H100/H200.
Because Managed-install is only compatible with H100/H200, the base worker group cannot select Managed-install. Valid options for the base worker group are Pre-install or User-install.

How Driver Type is synchronized between Worker Groups

The driver type of the base worker group governs non-base worker groups according to the following rules:

Base driver	GPU of non-base WG	Assigned driver
Pre-install	Any	Pre-install
User-install	Any	User-install

4. DRA on Kubernetes v1.34

DRA (Dynamic Resource Allocation) is a Kubernetes feature (GA from v1.34) that allows workloads to request GPU resources more flexibly compared to the traditional device plugin mechanism (nvidia.com/gpu). The Kubernetes DRA API enables dynamic GPU allocation between pods and fine-grained resource control, improving GPU utilization and reducing costs. Examples: requesting GPUs by attribute (driver version, memory, etc.), sharing a GPU across multiple containers, or dynamic allocation on demand.

Prerequisites for using DRA on Managed Kubernetes with GPU:

Cluster must run Kubernetes v1.34 or later.
Use driver installation type Managed-install with major version >= 570. See the driver version table above for current versions.
kubectl

For detailed installation and usage instructions, refer to: Using DRA for GPU

Overview​

1. GPU Software: GPU Operator​

2. GPU Driver Installation Type​

Driver versions supported for Managed-install​

Notes on upgrading Managed-install drivers​

3. Relationship between Base Worker Group and Non-base Worker Group​

Base Worker Group constraints​

How Driver Type is synchronized between Worker Groups​

4. DRA on Kubernetes v1.34​