Multi-Cluster Diagnostics
Multi-Cluster Diagnostics offers a single view into the health of all the clusters you currently monitor with Kubecost.
Health checks include, but are not limited to:
- Whether Kubecost is correctly emitting metrics
- Whether Kubecost is being scraped by Prometheus
- Whether Prometheus has scraped the required metrics
- Whether Kubecost's ETL files are healthy
Configuration
# This is an abridged example. Full example in link below.
diagnostics:
enabled: true
isDiagnosticsPrimary:
enabled: true # Only enable this on your primary cluster
# Ensure you have configured a unique CLUSTER_ID.
prometheus:
server:
global:
external_labels:
cluster_id: YOUR_CLUSTER_ID
# Ensure you have configured a storage config secret. Using `.Values.thanos.storeSecretName` would also work here.
kubecostModel:
federatedStorageConfigSecret: federated-store
Additional configuration options can found in the values.yaml under diagnostics:
.
Architecture
The multi-cluster diagnostics feature is run as an independent deployment (i.e. deployment/kubecost-diagnostics
). Each diagnostics deployment monitors the health of Kubecost and sends that health data to the central object store at the /diagnostics
filepath.
The below diagram depicts these interactions. This diagram is specific to the requests required for diagnostics only. For additional diagrams, see our multi-cluster guide.
API usage
The diagnostics API can be accessed through /model/multi-cluster-diagnostics?window=2d
(or /model/mcd
for short)
The window
query parameter is required, which will return all diagnostics within the specified time window.
Multi-Cluster Diagnostics API
GET http://<your-kubecost-address>/model/multi-cluster-diagnostics
The Multi-cluster Diagnostics API provides a single view into the health of all the clusters you currently monitor with Kubecost.
Name | Required | Type | Description |
---|---|---|---|
window | true | string | Duration of time over which to query. Accepts words like today , week , month , yesterday , lastweek , lastmonth ; durations like 30m , 12h , 7d ; comma-separated RFC3339 date pairs like 2021-01-02T15:04:05Z,2021-02-02T15:04:05Z ; comma-separated Unix timestamp (seconds) pairs like 1578002645,1580681045 . |
200: OK
{
"code": 200,
"data": {
"overview": {
"kubecostEmittingMetricDiagnosticPassed": true,
"prometheusHasKubecostMetricDiagnosticPassed": true,
"prometheusHasCadvisorMetricDiagnosticPassed": true,
"prometheusHasKSMMetricDiagnosticPassed": true,
"dailyAllocationEtlHealthyDiagnosticPassed": true,
"dailyAssetEtlHealthyDiagnosticPassed": true,
"kubecostPodsNotOOMKilledDiagnosticPassed": true,
"kubecostPodsNotPendingDiagnosticPassed": false
},
"clusters": {
"cluster_one": {
"latestRun": "2023-12-12T22:42:32Z",
"kubecostEmittingMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasKubecostMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasCadvisorMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasKSMMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"dailyAllocationEtlHealthy": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"dailyAssetEtlHealthy": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"kubecostPodsNotOOMKilled": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"kubecostPodsNotPending": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
}
},
"cluster_two": {
"latestRun": "2023-12-12T22:40:17Z",
"kubecostEmittingMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasKubecostMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasCadvisorMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasKSMMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"dailyAllocationEtlHealthy": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"dailyAssetEtlHealthy": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"kubecostPodsNotOOMKilled": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"kubecostPodsNotPending": {
"diagnosticPassed": false,
"numFailures": 52,
"firstFailureDate": "2023-12-12T18:25:09Z",
"diagnosticOutput": "RunDiagnostic: checkKubecostPodsNotPending: queryPrometheusCheckResultEmpty: the following query returned a non-empty result sum(kube_pod_status_phase{namespace='kubecost-etl-fed', phase='Pending'}) by (pod,namespace) > 0"
}
},
"cluster_three": {
"latestRun": "2023-12-12T22:40:15Z",
"kubecostEmittingMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasKubecostMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasCadvisorMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"prometheusHasKSMMetric": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"dailyAllocationEtlHealthy": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"dailyAssetEtlHealthy": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"kubecostPodsNotOOMKilled": {
"diagnosticPassed": true,
"numFailures": 0,
"firstFailureDate": "",
"diagnosticOutput": ""
},
"kubecostPodsNotPending": {
"diagnosticPassed": false,
"numFailures": 52,
"firstFailureDate": "2023-12-12T18:24:42Z",
"diagnosticOutput": "RunDiagnostic: checkKubecostPodsNotPending: queryPrometheusCheckResultEmpty: the following query returned a non-empty result sum(kube_pod_status_phase{namespace='kubecost-etl-fed', phase='Pending'}) by (pod,namespace) > 0"
}
}
}
}
}