Enabling metrics collection for Prometurbo

Create the necessary Custom Resources (CRs) to enable the collection of performance metrics. With CRs, you can quickly update the configurations used for metric collection without needing to restart Prometurbo.

Task overview

To enable metrics collection, perform the following tasks:

  1. Create the Prometurbo CRDs.

  2. Create the Prometurbo CRs.

    This topic describes the CRs that expose metrics for the NVIDIA DCGM, TGI, and Istio exporters.

  3. Verify metrics collection.

Creating the Prometurbo CRDs

Create the required Custom Resource Definitions (CRDs). These CRDs ensure the validity of the Prometurbo custom resources (CRs) that you will create in the next task.

oc create -f https://raw.githubusercontent.com/turbonomic/turbo-metrics/main/config/crd/bases/metrics.turbonomic.io_prometheusquerymappings.yaml
oc create -f https://raw.githubusercontent.com/turbonomic/turbo-metrics/main/config/crd/bases/metrics.turbonomic.io_prometheusserverconfigs.yaml

Overview of Prometurbo CRs

Prometurbo requires the following custom resources (CRs):

  • PrometheusQueryMapping

    This CR specifies how Prometurbo maps Prometheus queries to Turbonomic entities.

  • PrometheusServerConfig

    This CR specifies configuration options for the Prometheus server.

Creating CRs to export GPU and TGI metrics

Create the required CRs to export GPU and TGI metrics. These metrics are required to enable the scaling of LLM inference workloads.

  1. Deploy a Kubernetes secret that specifies the token that is used to access the Prometheus server.

    apiVersion: v1
    data:
      authorizationToken: {authorization_token}
    kind: Secret
    metadata:
      name: ocp-thanos-authorization-token
    type: Opaque
    

    Replace {authorization_token} with the token in the prometheus-k8s-token-##### (service-account-token) secret in the openshift-monitoring namespace.

  2. Create the following PrometheusQueryMapping CRs in the namespace of your choice, preferably in the namespace where Prometurbo is deployed.

  3. Create the PrometheusServerConfig CR in the namespace where PrometheusQueryMapping is created.

    apiVersion: metrics.turbonomic.io/v1alpha1
    kind: PrometheusServerConfig
    metadata:
      name: {Prometheus_server_name}
    spec:
      address: {Prometheus_server_address}
      bearerToken:
        secretKeyRef:
          key: authorizationToken
          name: ocp-thanos-authorization-token
      clusters:
      - identifier:
          clusterLabels: {}
          id: {cluster_ID}
        queryMappingSelector:
          matchExpressions:
          - key: mapping
            operator: In
            values:
            - nvidia-dcgm-exporter
            - text-generation-inference
    

    Update the following information:

    • metadata:
        name: {Prometheus_server_name}

      Specify the name of your Prometheus server.

    • 
      spec:
        address: {Prometheus_server_address}

      Specify the address of your Prometheus server, such as https://prometheus.us-east.containers.appdomain.cloud.

    • 
        clusters:
        - identifier:
            clusterLabels: {}
            id: {cluster_ID}

      Specify the cluster ID. To obtain the ID, run the following command:

      `kubectl -n default get svc kubernetes -ojsonpath='{.metadata.uid}'`

Creating CRs for the Istio exporter

Create the following CRs to collect metrics exposed by the Istio exporter.

Verifying metrics collection

Check the Prometurbo logs to verify that Prometurbo started collecting metrics from the Prometheus server.

The following example verifies the collection of metrics.

I0328 18:42:04.003329 1 provider.go:60] Discovered 4 PrometheusQueryMapping resources.
I0328 18:42:04.007689 1 provider.go:71] Discovered 2 PrometheusServerConfig resources.
I0328 18:42:04.007903 1 serverconfig.go:19] Loading PrometheusServerConfig turbo-community/prometheusserverconfig-emptycluster.
I0328 18:42:04.007927 1 client.go:68] Creating client for Prometheus server: http://prometheus.istio-system:9090
I0328 18:42:04.007935 1 serverconfig.go:36] There are 1 PrometheusQueryMapping resources in namespace turbo-community
I0328 18:42:04.007943 1 serverconfig.go:19] Loading PrometheusServerConfig turbo/prometheusserverconfig-singlecluster.
I0328 18:42:04.007947 1 client.go:68] Creating client for Prometheus server: http://prometheus.istio-system:9090
I0328 18:42:04.007950 1 serverconfig.go:36] There are 3 PrometheusQueryMapping resources in namespace turbo
I0328 18:42:04.008048 1 clusterconfig.go:39] Excluding turbo/jmx-tomcat.