GPU Software on Kubernetes for NVIDIA H100/H200
Overview
Kubernetes v1.34 (beta) is now available on FPT Cloud Managed Kubernetes with GPU. This version introduces a new GPU Software option (GPU Operator), a new GPU Driver Installation Type (Managed-install), and a GPU Sharing Strategy (MIG).
1. GPU Software: GPU Operator
To use GPU Operator, you must select Kubernetes version 1.34. When creating a v1.34 cluster, the GPU Software section will include a GPU Operator option (in addition to the existing options).
Managed-install Driver requires GPU Operator to function. If another GPU Software option is selected, the Managed-install option will not appear.
2. GPU Driver Installation Type
| Type | Description | Allow version update after cluster creation? | Notes |
|---|---|---|---|
| Pre-install | Driver is pre-installed at the selected version | No | If the version is deprecated, the corresponding worker loses support |
| User-install | User installs the driver manually | No | Incompatibilities may occur; FCI will not support such cases |
| Managed-install | Driver is installed automatically and can be updated after cluster creation | Yes | Requires GPU Operator. Applies to H100 and H200 GPUs only |
Driver versions supported for Managed-install
| Version | CUDA |
|---|---|
| 535.247.01 | 12.2 |
| 550.163.01 | 12.4 |
| 570.158.01 | 12.8 |
| 580.82.07 | 13.0 |
Notes on upgrading Managed-install drivers
- The system will wait for all GPU workloads on the node to be undeployed before upgrading the driver.
- To trigger a manual upgrade: go to the
fptcloud-gpu-operatornamespace and delete thenvidia-driver-*pods on the corresponding node.
3. Relationship between Base Worker Group and Non-base Worker Group
Base Worker Group constraints
- Base worker group does not support GPU H100/H200.
- Because Managed-install is only compatible with H100/H200, the base worker group cannot select Managed-install. Valid options for the base worker group are Pre-install or User-install.
How Driver Type is synchronized between Worker Groups
The driver type of the base worker group governs non-base worker groups according to the following rules:
| Base driver | GPU of non-base WG | Assigned driver |
|---|---|---|
| Pre-install | Any | Pre-install |
| User-install | Any | User-install |
4. DRA on Kubernetes v1.34
DRA (Dynamic Resource Allocation) is a Kubernetes feature (GA from v1.34) that allows workloads to request GPU resources more flexibly compared to the traditional device plugin mechanism (nvidia.com/gpu). The Kubernetes DRA API enables dynamic GPU allocation between pods and fine-grained resource control, improving GPU utilization and reducing costs. Examples: requesting GPUs by attribute (driver version, memory, etc.), sharing a GPU across multiple containers, or dynamic allocation on demand.
Prerequisites for using DRA on Managed Kubernetes with GPU:
- Cluster must run Kubernetes v1.34 or later.
- Use driver installation type Managed-install with major version >= 570. See the driver version table above for current versions.
- kubectl
For detailed installation and usage instructions, refer to: Using DRA for GPU