Container platform service
A service in container platform environments is a logical set of pods that represents a given application. The service exposes a single entry point for the application process. While the pods that comprise the service are ephemeral, the service is persistent. The service entity also gives historical tracking of the number of replicas that run to support the service.
For details about services discovered through APM targets, see this topic.
Synopsis
Synopsis | |
---|---|
Provides: | Response time and transactions to Business Transactions and Business Applications |
Consumes from: | Container pods and the underlying nodes |
Discovered through: | Kubeturbo agent that you deployed to your cluster |
Monitored resources
Turbonomic monitors the following resources:
-
Response time
Response time is the elapsed time between a request and the response to that request. Response time is typically measured in seconds (s) or milliseconds (ms).
For LLM inference workloads, response time is the turnaround time for each request, including both queuing time and service time. When there is no request, response time is unavailable.
-
Transaction
Transaction is a value that represents the per-second utilization of the transactions that are allocated to a given entity.
For LLM inference workloads, Transaction is the total number of tokens per second, which includes both input tokens and generated tokens. When there is no request, Transaction is zero.
-
Number of replicas
Number of replicas is the number of Application Component replicas running over a given time period.
-
Concurrent queries
For LLM inference workloads, concurrent queries is the number of concurrent queries to a workload. When there is no request, concurrent queries is zero.
-
Queueing time
For LLM inference workloads, queueing time is the amount of time that a request spends in a queue before it is processed. When there is no request, queueing time is zero.
-
Service time
For LLM inference workloads, service time SLO is the amount of processing time needed to generate the next token. This metric is relatively stable for a given model and GPU resource. When there is no request, service time is unavailable.
Actions
Scale (through workload controllers)
Turbonomic recommends actions to scale the replicas that back the container platform services. Set the scope to the associated Workload Controller to view and execute the actions.
For details, see Workload Controller Scale Actions.