メインコンテンツまでスキップ

GPU Software on Kubernetes for NVIDIA H100/H200

Overview

Kubernetes v1.34 (beta) is now available on FPT Cloud Managed Kubernetes with GPU. This version introduces a new GPU Software option (GPU Operator), a new GPU Driver Installation Type (Managed-install), and a GPU Sharing Strategy (MIG).

1. GPU Software: GPU Operator

To use GPU Operator, you must select Kubernetes version 1.34. When creating a v1.34 cluster, the GPU Software section will include a GPU Operator option (in addition to the existing options).

警告

Managed-install Driver requires GPU Operator to function. If another GPU Software option is selected, the Managed-install option will not appear.

2. GPU Driver Installation Type

TypeDescriptionAllow version update after cluster creation?Notes
Pre-installDriver is pre-installed at the selected versionNoIf the version is deprecated, the corresponding worker loses support
User-installUser installs the driver manuallyNoIncompatibilities may occur; FCI will not support such cases
Managed-installDriver is installed automatically and can be updated after cluster creationYesRequires GPU Operator. Applies to H100 and H200 GPUs only

Driver versions supported for Managed-install

VersionCUDA
535.247.0112.2
550.163.0112.4
570.158.0112.8
580.82.0713.0

Notes on upgrading Managed-install drivers

  • The system will wait for all GPU workloads on the node to be undeployed before upgrading the driver.
  • To trigger a manual upgrade: go to the fptcloud-gpu-operator namespace and delete the nvidia-driver-* pods on the corresponding node.

3. Relationship between Base Worker Group and Non-base Worker Group

Base Worker Group constraints

  • Base worker group does not support GPU H100/H200.
  • Because Managed-install is only compatible with H100/H200, the base worker group cannot select Managed-install. Valid options for the base worker group are Pre-install or User-install.

How Driver Type is synchronized between Worker Groups

The driver type of the base worker group governs non-base worker groups according to the following rules:

Base driverGPU of non-base WGAssigned driver
Pre-installAnyPre-install
User-installAnyUser-install

4. DRA on Kubernetes v1.34

DRA (Dynamic Resource Allocation) is a Kubernetes feature (GA from v1.34) that allows workloads to request GPU resources more flexibly compared to the traditional device plugin mechanism (nvidia.com/gpu). The Kubernetes DRA API enables dynamic GPU allocation between pods and fine-grained resource control, improving GPU utilization and reducing costs. Examples: requesting GPUs by attribute (driver version, memory, etc.), sharing a GPU across multiple containers, or dynamic allocation on demand.

Prerequisites for using DRA on Managed Kubernetes with GPU:

  • Cluster must run Kubernetes v1.34 or later.
  • Use driver installation type Managed-install with major version >= 570. See the driver version table above for current versions.
  • kubectl

For detailed installation and usage instructions, refer to: Using DRA for GPU