Enabling metrics collection for Prometurbo
Create the necessary Custom Resources (CRs) to enable the collection of performance metrics. With CRs, you can quickly update the configurations used for metric collection without needing to restart Prometurbo.
Task overview
To enable metrics collection, perform the following tasks:
-
Create the Prometurbo CRDs.
-
Create the Prometurbo CRs.
This topic describes the CRs that expose metrics for the NVIDIA DCGM, TGI, and Istio exporters.
-
Verify metrics collection.
Creating the Prometurbo CRDs
Create the required Custom Resource Definitions (CRDs). These CRDs ensure the validity of the Prometurbo custom resources (CRs) that you will create in the next task.
kubectl apply -f https://raw.githubusercontent.com/IBM/turbonomic-container-platform/refs/heads/main/turbo-metrics/crd/metrics.turbonomic.io_prometheusquerymappings.yaml
kubectl apply -f https://raw.githubusercontent.com/IBM/turbonomic-container-platform/refs/heads/main/turbo-metrics/crd/metrics.turbonomic.io_prometheusserverconfigs.yaml
Overview of Prometurbo CRs
Prometurbo requires the following custom resources (CRs):
-
PrometheusQueryMappingThis CR specifies how Prometurbo maps Prometheus queries to Turbonomic entities.
-
PrometheusServerConfigThis CR specifies configuration options for the Prometheus server.
Creating CRs to export GPU and LLM metrics
Create the required CRs to export GPU and LLM metrics. These metrics are required to enable the scaling of LLM inference workloads.
-
Create a secret in the namespace where Prometurbo is deployed.
oc -n turbo create secret generic ocp-thanos-authorization-token --from-literal=authorizationToken=$(oc -n openshift-monitoring create token prometheus-k8s --duration=87600h) -
Create the following
PrometheusQueryMappingCRs in the namespace of your choice, preferably in the namespace where Prometurbo is deployed. -
Create the
PrometheusServerConfigCR in the namespace wherePrometheusQueryMappingis created.apiVersion: metrics.turbonomic.io/v1alpha1 kind: PrometheusServerConfig metadata: name: {Prometheus_server_name} spec: address: {Prometheus_server_address} bearerToken: secretKeyRef: key: authorizationToken name: ocp-thanos-authorization-token clusters: - identifier: clusterLabels: {} id: {cluster_ID} queryMappingSelector: matchExpressions: - key: mapping operator: In values: - nvidia-dcgm-exporter - text-generation-inferenceUpdate the following information:
-
metadata: name: {Prometheus_server_name}Specify the name of your Prometheus server.
-
spec: address: {Prometheus_server_address}Specify the address of your Prometheus server, such as
https://prometheus.us-east.containers.appdomain.cloud. -
clusters: - identifier: clusterLabels: {} id: {cluster_ID}Specify the cluster ID. To obtain the ID, run the following command:
kubectl -n default get svc kubernetes -ojsonpath='{.metadata.uid}'
-
Creating CRs for the Istio exporter
Create the following CRs to collect metrics exposed by the Istio exporter.
-
This CR specifies the location of the Prometheus server and a label selector to exclude
jmx-tomcatfrom the Prometheus query mapping resource.
Verifying metrics collection
Check the Prometurbo logs to verify that Prometurbo started collecting metrics from the Prometheus server.
The following example verifies the collection of metrics.
I0328 18:42:04.003329 1 provider.go:60] Discovered 4 PrometheusQueryMapping resources.
I0328 18:42:04.007689 1 provider.go:71] Discovered 2 PrometheusServerConfig resources.
I0328 18:42:04.007903 1 serverconfig.go:19] Loading PrometheusServerConfig turbo-community/prometheusserverconfig-emptycluster.
I0328 18:42:04.007927 1 client.go:68] Creating client for Prometheus server: http://prometheus.istio-system:9090
I0328 18:42:04.007935 1 serverconfig.go:36] There are 1 PrometheusQueryMapping resources in namespace turbo-community
I0328 18:42:04.007943 1 serverconfig.go:19] Loading PrometheusServerConfig turbo/prometheusserverconfig-singlecluster.
I0328 18:42:04.007947 1 client.go:68] Creating client for Prometheus server: http://prometheus.istio-system:9090
I0328 18:42:04.007950 1 serverconfig.go:36] There are 3 PrometheusQueryMapping resources in namespace turbo
I0328 18:42:04.008048 1 clusterconfig.go:39] Excluding turbo/jmx-tomcat.