Monitoring

The monitoring feature is bundled with AI Infrastructure – Metal Cloud service.

Collecting and visualizing metrics, logs, and events can help identify potential issues and optimize future workloads. You may select an observability solution that best fits their needs.

Metrics	A Cluster (in the same VPC)	A single Server
Total number of nodes and down nodes	✔
GPU model, Driver & CUDA version		✔
Power state	✔
Uptime		✔
Total number of GPUs and down GPUs	✔	✔
GPU Utilization	✔	✔
GPU Memory	✔	✔
CPU Utilization	✔	✔
System Memory	✔	✔
Root Storage Usage	✔	✔
Local Disk Usage	✔	✔
Details of each GPUs Power consumption, Temperature, GPU Utilization, VRAM usage		✔
Network Bandwidth Inbound/ Outbound	✔	✔
Network Packets Sent/Received	✔	✔
Network Error rate Receive/Transmit		✔
Network InfiniBand Bandwidth/Packet/Error		✔
System Fan Speed		✔
System Voltage		✔
Common Alerts	✔

*For custom or advanced metrics as requested, we offer a Cloud Monitoring (FMON) service available for an additional charge.

Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the ask query parameter:

GET https://ai-docs.fptcloud.com/fpt-gpu-cloud/metal-cloud/tutorials/monitoring.md?ask=<question>

The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.