Skip to content

Monitoring HPE GreenLake Servers Running GPUs Using Grafana Cloud

Overview

HPE GreenLake Compute Ops Management provides a cloud-native platform for managing and monitoring compute infrastructure with built-in tools and dashboards. While Compute Ops Management offers comprehensive native monitoring capabilities, organizations can also leverage the Compute Ops Management REST API to integrate with popular open-source tools like Grafana and Prometheus. This approach enables teams to consolidate monitoring data across hybrid environments, utilize existing observability workflows, and create customized dashboards tailored to specific operational needs—particularly for GPU-accelerated workloads running on HPE servers.

Grafana Cloud Dashboard

The following screenshot shows the Grafana Cloud dashboard monitoring HPE ProLiant servers, leveraging both HPE GreenLake Compute Ops Management APIs and NVIDIA Data Center GPU Manager (DCGM) exporter metrics.

Grafana Cloud Dashboard

Benefits of Grafana Cloud Integration

Grafana Cloud provides a centralized monitoring platform that bridges both public cloud APIs and private infrastructure metrics, offering several key advantages:

  • HPE GreenLake API Integration: Using the Infinity data source plugin, Grafana Cloud can directly query HPE GreenLake Compute Ops Management REST APIs to retrieve server health, firmware status, and hardware telemetry data from your public cloud-managed infrastructure

  • Internal Metrics Aggregation: Simultaneously receives GPU utilization, Kubernetes cluster metrics, and application performance data from your on-premises Prometheus instances via remote write protocol

  • Single Pane of Glass: Consolidates metrics from both HPE GreenLake-managed servers and local GPU workloads into unified dashboards

API Accessing HPE GreenLake Compute Ops Management

In order for Grafana Cloud to access HPE GreenLake Compute Ops Management, we need to first create client API user and obtain user secret. For details, please see HPE GreenLake access token instructions

The following screen shows the "Personal API endpoints" creation in the workspace dashboard.

Create new personal API Client

You need to note down the client ID and secret so you can configure Grafana later using this information.

Personal API Client ID and Secret

Infinity Data Source

The Grafana Infinity data source enables direct integration with HPE GreenLake REST APIs, allowing users to query and visualize data from any HTTP/REST endpoint without requiring custom backend development. Originally a community-driven plugin, Infinity was officially adopted by Grafana Labs and is now maintained as part of the core Grafana ecosystem. This transition ensures long-term support, regular updates, and better integration with Grafana's authentication frameworks.

Key capabilities include:

  • REST API integration - Query HPE GreenLake Compute Ops Management APIs directly from Grafana

  • OAuth2 authentication - Native support for token-based authentication flows

  • Data transformation - Parse JSON responses and transform them into visualization-ready formats

This Infinity plugin supports automatic token refresh when accessing HPE GreenLake Compute Ops Management. The implementation leverages Go's built-in golang.org/x/oauth2 library, which automatically handles token refresh logic when the access token expires. The OAuth2 token source automatically refreshes the access token using the refresh token before making API requests, ensuring uninterrupted authentication without manual intervention.

The configuration of the Infinity data source is shown below. Users need to provide the HPE GreenLake API Client ID/Secret and also allow access to the API endpoints.

Grafana Cloud Infinity Datasource

Grafana Cloud Dashboard

With the Infinity data source successfully configured, users can configure the dashboard to query information following the HPE Compute Ops Management API Guide

Grafana Cloud Dashboard

Kubernetes and Helm Setup

Kubernetes cluster setup

This demonstration environment utilizes a high-availability Kubernetes cluster consisting of three control plane nodes and two worker nodes.

$ k get node -o wide
NAME                                STATUS   ROLES           AGE   VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION       CONTAINER-RUNTIME
c2-cp-01.hst.enablement.local       Ready    control-plane   80d   v1.32.5   10.16.160.51   <none>        Ubuntu 22.04.5 LTS   5.15.0-144-generic   containerd://2.0.5
c2-cp-02.hst.enablement.local       Ready    control-plane   80d   v1.32.5   10.16.160.52   <none>        Ubuntu 22.04.5 LTS   5.15.0-144-generic   containerd://2.0.5
c2-cp-03.hst.enablement.local       Ready    control-plane   80d   v1.32.5   10.16.160.53   <none>        Ubuntu 22.04.5 LTS   5.15.0-144-generic   containerd://2.0.5
c2-worker-01.hst.enablement.local   Ready    <none>          80d   v1.32.5   10.16.160.54   <none>        Ubuntu 22.04.5 LTS   5.15.0-144-generic   containerd://2.0.5
c2-worker-02.hst.enablement.local   Ready    <none>          80d   v1.32.5   10.16.160.55   <none>        Ubuntu 22.04.5 LTS   5.15.0-144-generic   containerd://2.0.5

Kubernetes namespace setup

The cluster is equipped with the gpu-operator namespace for NVIDIA GPU management and the monitoring namespace hosting the Prometheus stack, with external access enabled via NodePort services.

$ kubectl get ns | grep -vE '^(kube-|default)'
NAME              STATUS   AGE
gpu-operator      Active   80d
monitoring        Active   56d

$ k get svc -n gpu-operator 
NAME                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)    AGE
gpu-operator           ClusterIP   10.233.44.80   <none>        8080/TCP   78d
nvidia-dcgm-exporter   ClusterIP   10.233.15.59   <none>        9400/TCP   78d

$ k get svc --field-selector spec.type=NodePort -n monitoring
NAME                               TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)                         AGE
kube-prometheus-stack-grafana      NodePort   10.233.22.241   <none>        80:30080/TCP                    56d
kube-prometheus-stack-prometheus   NodePort   10.233.8.106    <none>        9090:30090/TCP,8080:30398/TCP   56d

Helm Charts

The environment uses Helm to manage two key components: the NVIDIA GPU Operator for GPU resource management and the Kube Prometheus Stack for monitoring and observability.

$ helm list -A
NAME                    NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                           APP VERSION
gpu-operator-1753140595 gpu-operator    4               2025-08-14 19:20:42.329819669 -0700 MST deployed        gpu-operator-v25.3.2            v25.3.2    
kube-prometheus-stack   monitoring      5               2025-08-15 13:06:31.169338089 -0700 MST deployed        kube-prometheus-stack-76.3.
0    v0.84.1    

GPU Operator chart customization

The NVIDIA GPU Operator Helm chart deploys a DCGM (Data Center GPU Manager) exporter by default, but there are important nuances:

  • The DCGM exporter Pod will be created automatically when the operator detects a node with an NVIDIA GPU and the dcgm-exporter component is enabled in its values.

    $ k -n gpu-operator get pods -o wide | grep dcgm
    nvidia-dcgm-exporter-gkg6d                                        1/1     Running     0          56d   10.233.117.209   c2-worker-01.hst.enablement.local   <none>           <none>
    nvidia-dcgm-exporter-r2np6                                        1/1     Running     0          56d   10.233.114.15    c2-worker-02.hst.enablement.local   <none>           <none>
    

  • In the stock gpu-operator Helm chart from NVIDIA's repository, the DCGM exporter is enabled by default (dcgmExporter.enabled: true), but the ServiceMonitor is disabled by default (serviceMonitor.enabled: false). See the NVIDIA GPU Operator Documentation.

GPU Operator Configuration

You can also verify these default built-in values using the helm show values command.

$ helm show values nvidia/gpu-operator | grep -A 15 dcgmExporter
dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 4.3.1-4.4.0-ubuntu22.04
  imagePullPolicy: IfNotPresent
  env: []
  resources: {}
  service:
    internalTrafficPolicy: Cluster
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []

We need to enable the ServiceMonitor (dcgmExporter.serviceMonitor.enabled: true) in order for Prometheus to automatically scrape the DCGM exporter.

The gpu-operator is configured with custom values to enable Prometheus integration. The DCGM exporter runs as a ClusterIP service with ServiceMonitor enabled for automatic metrics discovery by Prometheus.

$ helm get values gpu-operator-1753140595 -n gpu-operator
USER-SUPPLIED VALUES:
dcgmExporter:
  service:
    type: ClusterIP
  serviceMonitor:
    enabled: true

GPU utilization simulation

To simulate GPU load and verify monitoring functionality, we deployed a test pod running the gpu-burn utility. This tool performs intensive GPU computations, allowing us to observe GPU utilization metrics in our monitoring dashboards.

The following YAML manifest creates a pod that clones the gpu-burn repository, compiles it, and runs continuous GPU stress testing:

apiVersion: v1
kind: Pod
metadata:
    name: gpu-burn
spec:
    containers:
        - name: gpu-burn
            image: nvidia/cuda:12.2.0-devel-ubuntu22.04 
            command: ["/bin/bash", "-c"]
            args:
                - |
                    apt update && apt install -y git build-essential && \
                    git clone https://github.com/wilicc/gpu-burn.git && \
                    cd gpu-burn && make && ./gpu_burn 999999 
            resources:
                limits:
                    nvidia.com/gpu: 1
    restartPolicy: Never

Key configuration details: - Base image: nvidia/cuda:12.2.0-devel-ubuntu22.04 provides the CUDA development environment
- GPU allocation: nvidia.com/gpu: 1 requests a single GPU from the cluster
- Runtime: gpu_burn 999999 runs for approximately 277 hours (effectively continuous)
- Restart policy: Never ensures the pod completes its run without automatic restarts

Deploy the pod using:

kubectl apply -f gpu-burn.yaml

Grafana Cloud Integration

While the local Grafana deployment provides comprehensive monitoring capabilities, organizations often need to share dashboards with team members who cannot directly access the internal infrastructure. Grafana Cloud offers an ideal solution by enabling metrics to be pushed from the local Prometheus instance to a cloud-hosted environment, making dashboards accessible to remote teams without requiring VPN or direct network access.

Prometheus Remote Write

Grafana Cloud supports Prometheus remote write protocol, allowing local Prometheus to continuously push metrics to the cloud. This approach offers several advantages:

  • No inbound firewall rules required - Metrics are pushed outbound from the lab
  • Real-time data synchronization - Metrics appear in Grafana Cloud within seconds
  • Selective metric filtering - Control which metrics are sent to manage costs
  • Multi-cluster aggregation - Consolidate metrics from multiple environments

The integration involves two main steps:

  1. Configure Grafana Cloud credentials - Create a Prometheus remote write endpoint and API key in Grafana Cloud
  2. Update Prometheus configuration - Add remote write settings to the kube-prometheus-stack Helm values

Prometheus Remote Write Diagram

The following diagram from Grafana Cloud documentation shows the architecture of Prometheus remote write.

Prometheus Remote Write

How it works: 1. Prometheus scrapes metrics from all targets (DCGM exporter, kube-state-metrics, etc.) 2. Stores metrics locally in TSDB (Time Series Database) 3. Simultaneously pushes metrics to Grafana Cloud via remote write 4. Local Grafana and Prometheus UI can query local data 5. Grafana Cloud receives a copy of all metrics

Configure Grafana Cloud credentials

Users should log in to grafana.com to access their Grafana stack and retrieve the Prometheus cloud instance API endpoint by selecting "Details".

Grafana Cloud Portal

Select "Prometheus Details"

Stack Dashboard

In the Prometheus Instance Details page, please note "Remote Write Endpoint", "Instance ID" and generate "API Token"

Prometheus Instance Details

Updating Helm release with configured value

  1. Create a Kubernetes secret to store the username and password to access the Prometheus cloud instance

    $ GRAFANA_CLOUD_USER="<your userid here>"
    $ GRAFANA_CLOUD_PASSWORD="<you token here>"
    
    $ kubectl create secret generic grafana-cloud-credentials \
      --from-literal=username="${GRAFANA_CLOUD_USER}" \
      --from-literal=password="${GRAFANA_CLOUD_PASSWORD}" \
      -n monitoring
    secret/grafana-cloud-credentials created
    

  2. Create a Helm custom values YAML file

    $ cat prometheus-remote-write-values.yaml 
    prometheus:
      prometheusSpec:
        remoteWrite:
          - url: https://<your-prometheus-id-here>.grafana.net/api/prom/push
            basicAuth:
              username:
                name: grafana-cloud-credentials
                key: username
              password:
                name: grafana-cloud-credentials
                key: password
    

  3. Upgrade the Helm release

    $ helm upgrade kube-prometheus-stack prometheus-community/kube-prometheus-stack   -n monitoring   -f prometheus-remote-write-values.yaml   --reuse-values
    

Proliant Server Nvidia GPU utilization dashboard

The following screen shows the Grafana Cloud dashboard of HPE Proliant Server Nvidia GPU Utilization.

GPU Utilization Panel

The corresponding server GPU utilization from nvidia-smi is shown below.

$ ssh user01@c2-worker-02.hst.enablement.local nvidia-smi
Mon Oct 20 13:55:11 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L40S-4C                 On  |   00000000:02:00.0 Off |                    0 |
| N/A   N/A    P0            N/A  /  N/A  |    3002MiB /   4096MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A         3712514      C   ./gpu_burn                             2997MiB |
+-----------------------------------------------------------------------------------------+

Acknowledgements

Special thanks to HPE's Lionel Jullien for his excellent work on HPE GreenLake Compute Ops Management. The Grafana dashboard in this article was based on his previous work and articles as below.