Container platform service

A service in container platform environments is a logical set of pods that represents a given application. The service exposes a single entry point for the application process. While the pods that comprise the service are ephemeral, the service is persistent. The service entity also gives historical tracking of the number of replicas that run to support the service.

Note:

For details about services discovered through APM targets, see this topic.

Synopsis

Container platform service in the supply chain
Synopsis
Provides: Response time and transactions to Business Transactions and Business Applications
Consumes from: Container pods and the underlying nodes
Discovered through: Kubeturbo agent that you deployed to your cluster

Monitored resources

Turbonomic monitors the following resources:

  • Response time

    Response time is the elapsed time between a request and the response to that request. Response time is typically measured in seconds (s) or milliseconds (ms).

    For LLM inference workloads, response time is the turnaround time for each request, including both queuing time and service time. When there is no request, response time is unavailable.

  • Transaction

    Transaction is a value that represents the per-second utilization of the transactions that are allocated to a given entity.

    For LLM inference workloads, Transaction is the total number of tokens per second, which includes both input tokens and generated tokens. When there is no request, Transaction is zero.

  • Number of replicas

    Number of replicas is the number of Application Component replicas running over a given time period.

  • Concurrent queries

    For LLM inference workloads, concurrent queries is the number of concurrent queries to a workload. When there is no request, concurrent queries is zero.

  • Queueing time

    For LLM inference workloads, queueing time is the amount of time that a request spends in a queue before it is processed. When there is no request, queueing time is zero.

  • Service time

    For LLM inference workloads, service time SLO is the amount of processing time needed to generate the next token. This metric is relatively stable for a given model and GPU resource. When there is no request, service time is unavailable.

Actions

Scale (through workload controllers)

Turbonomic recommends actions to scale the replicas that back the container platform services. Set the scope to the associated Workload Controller to view and execute the actions.

For details, see Workload Controller Scale Actions.