Chuyển tới nội dung chính

Monitoring

The monitoring feature is bundled with AI Infrastructure – Metal Cloud service.

Collecting and visualizing metrics, logs, and events can help identify potential issues and optimize future workloads. You may select an observability solution that best fits their needs.

MetricsA Cluster (in the same VPC)A single Server
Total number of nodes and down nodes
GPU model, Driver & CUDA version
Power state
Uptime
Total number of GPUs and down GPUs
GPU Utilization
GPU Memory
CPU Utilization
System Memory
Root Storage Usage
Local Disk Usage
Details of each GPUs Power consumption, Temperature, GPU Utilization, VRAM usage
Network Bandwidth Inbound/ Outbound
Network Packets Sent/Received
Network Error rate Receive/Transmit
Network InfiniBand Bandwidth/Packet/Error
System Fan Speed
System Voltage
Common Alerts

*For custom or advanced metrics as requested, we offer a Cloud Monitoring (FMON) service available for an additional charge.


Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET https://ai-docs.fptcloud.com/fpt-gpu-cloud/metal-cloud/tutorials/monitoring.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.