watsonx Assistant Prometheus queries
- Who needs to complete this task?
- Cluster administrator A cluster administrator must perform this task.
- How frequently should you perform this task?
- Repeat as needed You should run Prometheus queries as often as necessary to monitor your Cloud Pak for Data deployments. It is recommended that you perform this task at least once per day or once per shift.
Ensure that you enable monitoring for user-defined projects and configure the OpenShift Monitoring stack. For more information about how to complete these tasks, see the documentation in the following table:
| OpenShift Version | Resources |
|---|---|
| Version 4.12 | |
| Version 4.14 | |
| Version 4.15 | |
| Version 4.16 |
To run the following Prometheus queries, go to in the OpenShift Console.
Resource usage
- CPU remaining for a container
- Displays the CPU remaining for a container over a 5-minute
interval.
kube_pod_container_resource_limits{pod=~".*wa.*",resource="cpu"}- on (pod,container) rate(container_cpu_usage_seconds_total{pod=~".*wa.*",container!="POD"}[5m]) - CPU usage for a container
- Displays the total CPU that a container is using over a 5-minute
interval.
rate(container_cpu_usage_seconds_total{pod=~".*wa.*",container!="POD"}[5m]) - CPU usage for a pod
- Displays the total CPU that a pod is using over a 5-minute
interval.
pod:container_cpu_usage:sum{pod=~".*wa.*"} - Memory remaining for a container
- Displays the memory remaining for a container in
GB.
container_spec_memory_limit_bytes{pod=~".*wa.*",container!="POD"} - container_memory_working_set_bytes{pod=~".*wa.*",container!="POD"} - Memory usage for a container
- Displays the total memory that a container is using in
GB.
container_memory_working_set_bytes{pod=~".*wa.*",container!="POD"} - Memory usage for a pod
- Displays the total memory that a pod is using in
GB.
pod:container_memory_usage_bytes:sum{pod=~".*wa.*"}
Store
- 5.0.1 or later Number of HTTP requests
- 5.0.1 or later HTTP requests for observation buckets
- 5.0.1 or later Duration of HTTP requests
- 5.0.1 or later Number of store sessions
- 5.0.1 or later Store sessions for observation buckets
- 5.0.1 or later Size of store session
- 5.0.1 or later PostgreSQL pool
- Number of HTTP requests
- Displays the total number of HTTP requests that occurred since the pod
started.
assistant_http_request_duration_seconds_count - HTTP requests for observation buckets
- The Value column displays the total number of HTTP requests for each
observation bucket since the pod started. The observation buckets are indicated in seconds in the
le column. For example, if the number indicated in the
le column is 10.0, then the Value column indicates the
total number of HTTP request that took 10 seconds or
less.
assistant_http_request_duration_seconds_bucket - Duration of HTTP requests
- Displays the total duration of HTTP requests since the pod started in
seconds.
assistant_http_request_duration_seconds_sum - Number of store sessions
- Displays the total number of stateful v2 sessions that were handled by the pod since the pods
started.
assistant_store_session_size_kilobytes_count - Store sessions for observation buckets
- The Value column displays the total number of stateful v2 sessions for
each observation bucket since the pod started. The observation buckets are indicated in kilobytes in
the le column. For example, if the number indicated in the
le column is 10.0, then the Value column indicates the
total number of sessions with size 10 KB or
less.
assistant_store_session_size_kilobytes_bucket - Size of store session
- Displays the total size of the store session since the pod
started.
assistant_store_session_size_kilobytes_sum - PostgreSQL pool
- Displays a count of the following types of PostgreSQL clients and requests:
- The total type, which is the number of clients that exist in the pool.
- The waiting type, which is the number of queued requests that are waiting on a client when all clients are checked out. It can be helpful to monitor this number to see whether you need to adjust the size of the pool.
- The idle type, which is the number of clients that are not checked out and are idle in the pool.
assistant_store_postgres_pool_counts
etcd
- Disk latency for etcd
- Displays the current disk latency for etcd with watsonx Assistant. This value should stay under 0.01 or
errors can
occur.
rate(etcd_disk_wal_fsync_duration_seconds_sum{pod=~".*wa-etcd-.*"}[5m])/rate(etcd_disk_wal_fsync_duration_seconds_count{pod=~".*wa-etcd-.*"}[5m]) - Failed proposals for etcd
- Displays the total number of failed etcd proposals that occurred. Proposals can include leadership election or sync notices. Failures
typically indicate that a cluster is not
healthy.
etcd_server_proposals_failed_total{pod=~".*wa-etcd-.*"} - Peer latency for etcd
- Displays the current peer latency for etcd with watsonx Assistant. This value should stay under 0.01 or
errors can
occur.
rate(etcd_network_peer_round_trip_time_seconds_sum{pod=~".*wa-etcd-.*"}[5m])/rate(etcd_network_peer_round_trip_time_seconds_count{pod=~".*wa-etcd-.*"}[5m])
EDB Postgres
- Number of EDB Postgres WAL files
- Displays the total number of EDB Postgres WAL (Write-Ahead Log) files that are in use for watsonx Assistant.
cnp_collector_pg_wal{value="count"} - Size of EDB Postgres WAL files
- Displays the total size of all EDB Postgres WAL (Write-Ahead Log) files that are in
use for watsonx Assistant.
cnp_collector_pg_wal{value="size"}
gRPC
- Concurrent gRPC requests
- Displays the total number of gRPC requests that are currently being processed. This query
returns information for Dragonfly and the CLU Embedding
service.
assistant_grpc_server_concurrency_requests - gRPC requests for observation buckets
- The
assistant_grpc_server_request_duration_secondsmetric is a histogram of every gRPC request for the server. It includes information including response codes, methods that were started, and gRPC method types.When you run the
assistant_grpc_server_request_duration_seconds_bucketquery, the Value column displays the total number of gRPC requests for each observation bucket since the pod started. The observation buckets are indicated in seconds in the le column. For example, if the number indicated in the le column is2.5, then the Value column indicates the total number of requests that took 2.5 seconds or less.assistant_grpc_server_request_duration_seconds_bucket - Number of gRPC requests
- The
assistant_grpc_server_request_duration_secondsmetric is a histogram of every gRPC request for the server. It includes information including response codes, methods that were started, and gRPC method types. When you run theassistant_grpc_server_request_duration_seconds_countquery, the Value column displays the total number of requests that occurred since the pod started.assistant_grpc_server_request_duration_seconds_count - Duration of gRPC requests
- The
assistant_grpc_server_request_duration_secondsmetric is a histogram of every gRPC request for the server. It includes information including response codes, methods that were started, and gRPC method types. When you run theassistant_grpc_server_request_duration_seconds_sumquery, the Value column displays the total duration of requests since the pod started in seconds.assistant_grpc_server_request_duration_seconds_sum
ModelMesh
- ModelMesh requests for observation buckets
- The
modelmesh_age_at_eviction_millisecondsmetric is a histogram of every ModelMesh model that was evicted from the least recently used cache. A model is considered evicted when it is removed from the cache. Because this metric is of the least recently used cache, you can expect the oldest models to be evicted. If the model eviction age becomes low, then it might mean that too many evictions are occurring. Generally, a model eviction age of less than 4 to 7 days is significant.When you run the
modelmesh_age_at_eviction_milliseconds_bucketquery, the Value column displays the total number of evicted models for each observation bucket since the pod started. The observation buckets are indicated in seconds in the le column. For example, if the number indicated in the le column is300000, then the Value column indicates the number of evicted models that were used less than 300,000 milliseconds ago.modelmesh_age_at_eviction_milliseconds_bucket - Number of ModelMesh requests
- The
modelmesh_age_at_eviction_millisecondsmetric is a histogram of every ModelMesh model that was evicted from the least recently used cache. When you run themodelmesh_age_at_eviction_milliseconds_countquery, the Value column displays the total number of models that were evicted since the pod started.modelmesh_age_at_eviction_milliseconds_count - Duration of ModelMesh requests
- The
modelmesh_age_at_eviction_millisecondsmetric is a histogram of every ModelMesh model that was evicted from the least recently used cache. It includes information including response codes, methods that were started, and ModelMesh method types. When you run themodelmesh_age_at_eviction_milliseconds_sumquery, the Value column displays the total age of evicted models since the pod started in milliseconds.modelmesh_age_at_eviction_milliseconds_sum
Algorithm training duration
- Algorithm training duration for observation buckets
- The
assistant_algorithm_training_time_secondsmetric is a histogram of every model training that occurs. It measures the time that the training algorithm took to train the model. It includes information including status, service, model language, and an estimate of the workspace size.When you run the
assistant_algorithm_training_time_seconds_bucketquery, the Value column displays the total number of trainings for each observation bucket since the pod started. The observation buckets are indicated in seconds in the le column. For example, if the number indicated in the le column is10.0, then the Value column indicates the total number of model trainings that took 10 seconds or less.assistant_algorithm_training_time_seconds_bucket - Number of algorithm trainings
- The
assistant_algorithm_training_time_secondsmetric is a histogram of every model training that occurs. It measures the time that the training algorithm took to train the model. It includes information including status, service, model language, and an estimate of the workspace size.When you run the
assistant_algorithm_training_time_seconds_countquery, the Value column displays the total number of trainings that occurred since the pod started.assistant_algorithm_training_time_seconds_count - Duration of algorithm trainings
- The
assistant_algorithm_training_time_secondsmetric is a histogram of every model training that occurs. It measures the time that the training algorithm took to train the model. It includes information including status, service, model language, and an estimate of the workspace size.When you run the
assistant_algorithm_training_time_seconds_sumquery, the Value column displays the total duration of model trainings since the pod started in seconds.assistant_algorithm_training_time_seconds_sum
End-to-end model training duration
- Model training duration for observation buckets
- The
assistant_total_training_time_secondsmetric is a histogram of every model training that occurs. It measures how long the training took in seconds and includes information about the status and service.When you run the
assistant_total_training_time_seconds_bucketquery, the Value column displays the total number of trainings for each observation bucket since the pod started. The observation buckets are indicated in seconds in the le column. For example, if the number indicated in the le column is10.0, then the Value column indicates the total number of model trainings that took 10 seconds or less.assistant_total_training_time_seconds_bucket - Number of model trainings
- The
assistant_total_training_time_secondsmetric is a histogram of every model training that occurs. It measures how long the training took in seconds and includes information about the status and service.When you run the
assistant_total_training_time_seconds_countquery, the Value column displays the total number of trainings that occurred since the pod started.assistant_total_training_time_seconds_count - Duration of model trainings
- The
assistant_total_training_time_secondsmetric is a histogram of every model training that occurs. It measures how long the training took in seconds and includes information about the status and service.When you run the
assistant_total_training_time_seconds_sumquery, the Value column displays the total duration of model trainings since the pod started in seconds.assistant_total_training_time_seconds_sum
Volume
<VOLUME> with the regular expression for data store type that
you want to monitor. Use the following regular expressions for the data stores:| Data store | Regular expression |
|---|---|
| EDB Postgres | .*wa-postgres-.* |
| Elasticsearch | data-.*wa-es-.*-.* |
| etcd | data-.*wa-etcd-.* |
| MinIO | export-.*wa-minio-.* |
| All data stores (EDB Postgres, Elasticsearch, etcd, and MinIO) | export-.*wa-minio-.*|data-.*wa-es-.*-.*|.*wa-postgres-.*|data-.*wa-etcd-.* |
- Data remaining for volumes
- Displays the amount of data that is remaining for the specified
volumes.
kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"<VOLUME>"} - Rate of change for volumes
- Displays the rate of change for the specified volumes over a period of 5
minutes.
deriv(kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"<VOLUME>"}[5m]) - Size of volumes after current rate of change
- Displays what the size of a volume will be after 24 hours at the current rate of change. This
query is useful to help determine whether a persistent volume will run out of space if the current
ingestion or growth
continues.
predict_linear(kubelet_volume_stats_available_bytes{persistentvolumeclaim=~"<VOLUME>"}[5m], 24 * 3600)