Guidelines for configuring custom libraries
Custom libraries define how Concert evaluates operational data during resilience assessments and generates resilience scores, remediation recommendations, and resilience actions for supported runtime and infrastructure components.
- How operational data is evaluated
- How resilience scores are calculated
- How resilience actions and remediation recommendations are generated during assessments
- Define requirements that clearly represent operational risks
- Configure metrics that accurately evaluate runtime or infrastructure conditions
- Use aggregation and conversion functions that correctly normalise operational data
- Configure thresholds that support accurate resilience score calculations and resilience actions
The following sections describe the primary configuration areas that affect resilience evaluations and resilience action generation in custom libraries.
Resilience evaluation process
During a resilience assessment, Concert processes operational data in multiple stages before resilience scores and resilience actions are generated.
- Operational facts are collected from configured runtime or infrastructure sources
- Fact aggregation functions evaluate multiple received values for the same operational fact field within the assessment period.
- Facts-to-metric conversion functions transform aggregated operational facts into metric values.
- Metrics are evaluated against configured thresholds.
- Requirement scores are calculated
- Resilience scores, resilience actions, and remediation recommendations are generated for supported components.
Configuring requirement definitions
Requirement definitions identify the operational capability or runtime condition that Concert evaluates during resilience assessments.
- The operational condition being evaluated
- The associated resilience risk
- The expected runtime or infrastructure behavior
- The operational impact of failures or degraded states
Requirement descriptions should provide sufficient operational context to support accurate resilience evaluations and resilience action generation.
For example:- Vague requirement descriptions can result in incomplete remediation recommendations
- Incomplete operational context can reduce action accuracy for critical operational conditions
Requirement categories affect how Concert groups operational risks and prioritizes resilience actions during evaluations.
- Security requirements can generate security-focused remediation actions.
- Availability requirements can prioritise runtime recovery actions.
- Misconfiguration-related requirements can generate configuration remediation recommendations.
Requirement evaluation behaviour is controlled by the criteria_type configuration.
Supported evaluation methods can include:
-
snippet_ref pre_computed- Default evaluation functions
The selected evaluation method determines how metric values are evaluated during resilience assessments.
For example:
- A runtime availability requirement can evaluate whether application instances remain operational during deployment failures.
- A Kubernetes configuration requirement can evaluate whether CPU and memory limits are configured for containers.
- A Java runtime requirement can evaluate unsupported runtime versions or insecure runtime configurations.
The following example shows a requirement definition in a custom library configuration:
{
"name": "message_throughput",
"category": "Performance",
"criteria_type": "snippet_ref",
"expected_score": "85",
"rating_thresholds": "50,70,85,95",
"input_data_keys": "avg_messages_per_second, peak_message_rate"
}
This configuration defines:
- The operational capability being evaluated
- The expected resilience score
- Metric values used during evaluation
- The operational metrics required for scoring and resilience action generation
- The evaluation method used during assessments
Configuring metric definitions
Metric definitions specify the operational data that Concert evaluates for each requirement.
Metrics are configured in the input_data_keys.json file and define:
- Metric identifiers
- Aggregation behavior
- Supported data types
- Operational data sources
- Facts-to-metric conversion behavior
input_data_keys.json file is used when importing or configuring custom libraries through APIs.Metric definitions should include enough operational context to support accurate evaluations and remediation actions.
For example:- A container compliance metric can evaluate the percentage of containers missing CPU limits.
- A metric can aggregate NanoCpus values by using the latest value received during assessments.
- A metric can convert missing container counts into percentage-based evaluation values.
The following example shows how metric thresholds can be defined in the input_data_keys.json configuration file for a custom library:
[
{
"name": "pct_containers_missing_cpu_limit",
"data_type": "double",
"unit": "i18n:pct_unit",
"description": "i18n:pct_containers_missing_cpu_limit_desc_key",
"label": "i18n:pct_containers_missing_cpu_limit_label",
"origin": "i18n:container_scanner_origin_label",
"metadata": {
"sources": [
{
"source": "docker_inspect",
"type": "Container",
"search_key": "HostConfig",
"fact_field": "NanoCpus",
"fact_aggregation": "latest"
}
]
},
"facts_to_metric_conversion": "missing_count_percentage"
}
]
- The operational metric evaluated during assessments
- Metric levels used during metric evaluation
- Operational ranges that can affect resilience scores and resilience actions
Configuring fact aggregation functions
Fact aggregation functions evaluate multiple received values for the same operational fact field within an assessment period. Aggregation determines whether Concert uses the sum, average, latest value, minimum value, or maximum value before metric evaluation occurs. Aggregation does not define how data is collected. It defines how multiple received values for the same operational fact field are processed during evaluation.
For example:- Error count values received multiple times for the same deployment during an assessment period can be summed.
- Response time values received for the same runtime instance during an assessment period can be averaged.
- CPU utilization values received for the same container during an assessment period can use the latest reported value.
- Availability status values received for the same application instance during an assessment period can use the minimum reported value.
| Aggregate functions | Description | Example |
|---|---|---|
sum |
Sums multiple values received for the same metric within an assessment period |
Total error count for the same deployment within an assessment period |
average |
Calculates the mean value across runtime instances |
Average CPU utilization |
min |
Returns the lowest operational value |
Lowest availability score across components |
max |
Returns the highest operational value |
Peak memory utilization |
latest |
Uses the most recent operational value |
Latest runtime configuration state |
Aggregation functions should be selected carefully because incorrect aggregation behavior can result in inaccurate resilience evaluations or incomplete resilience actions.
Configuring facts-to-metric conversion functions
Facts-to-metric conversion functions transform aggregated operational facts into metric values that Concert evaluates during resilience assessments. Conversion configurations are defined by using the facts_to_metric_conversion field in metric definitions.
- Boolean operational states can be converted into percentage-based metrics.
- Runtime health conditions can be converted into numerical resilience scores.
- Operational status values can be converted into threshold-based evaluation metrics.
- Misconfiguration compliance results can be converted into percentage-based compliance metrics.
Incorrect conversion configurations can result in inaccurate resilience scores, unsupported metric evaluations, or incomplete remediation actions.
The following table describes commonly used facts_to_metric_conversion functions and how they transform operational facts into metric values during resilience evaluations.
| S.No | Function | Description | Use-case | Example |
|---|---|---|---|---|
| 1. | |
% of workloads with CPU usage < 90% of limits | CPU resource management | Identifies over-utilized containers |
| 2. | |
% of workloads with memory usage < 90% of limits | Memory resource management | Identifies memory-constrained containers |
| 3. | |
% of images using 'latest' tag | Image versioning compliance | Identifies containers with mutable tags |
| 4. | |
% of workloads running as root | Security compliance | Identifies insecure containers |
| 5. | pct_configmaps_with_secrets_conversion |
% of ConfigMaps with secrets (KSV109) | Security misconfiguration | Identifies ConfigMaps needing remediation |
| 6. | pct_configmap_with_sensitive_content_conversion |
% of ConfigMaps with sensitive content (KSV1010) | Security compliance | Identifies sensitive data exposure |
| 7. | pct_services_with_external_ip_conversion |
% of services with external IPs (KSV108) | Network security | Identifies exposed services |
| 8. | pct_workloads_with_privilege_escalation_conversion |
% of workloads allowing privilege escalation (KSV001) | Security hardening | Identifies privilege escalation risks |
| 9. | percentage_workloads_with_all_capabilities_drop |
% of workloads with all capabilities dropped | Security best practices | Identifies containers with excessive capabilities |
| 10. | percentage_workloads_with_only_net_bind_service_drop |
% of workloads with only NET_BIND_SERVICE (KSV106) | Capability management | Identifies minimal capability compliance |
| 11. | percentage_workloads_with_host_ipc_namespace_access |
% of workloads accessing host IPC (KSV008) | Namespace isolation | Identifies namespace boundary violations |
| 12. | percentage_workloads_with_host_network_access |
% of workloads accessing host network (KSV009) | Network isolation | Identifies network boundary violations |
| 13. | percentage_workloads_with_host_pid_access |
% of workloads accessing host PID (KSV010) | Process isolation | Identifies PID namespace violations |
| 14. | percentage_workloads_with_rootfs_not_read_only |
% of workloads with writable root filesystem (KSV014) | Immutability compliance | Identifies mutable containers |
| 15. | percentage_workloads_with_extra_capabilities_added |
% of workloads with extra capabilities | Least privilege compliance | Identifies over-privileged containers |
| 16. | percentage_workloads_with_hostpath_volumes_mounted |
% of workloads using HostPath volumes | Volume security | Identifies risky volume mounts |
| 17. | percentage_workloads_with_access_to_host_ports |
% of workloads accessing host ports | Port security | Identifies host port exposure |
| 18. | percentage_workloads_with_non_default_proc_masks |
% of workloads with custom proc masks | Process security | Identifies host port exposure |
| 19. | percentage_workloads_with_non_core_volume_types |
% of workloads using non-core volumes | Volume type compliance | Identifies non-standard volumes |
| 20. | percentage_workloads_without_runtime_profile |
% of workloads without seccomp/AppArmor | Runtime security | Identifies unprotected workloads |
| 21. | percentage_workloads_with_default_seccomp_policies |
% of workloads with default seccomp | Security profile compliance | Identifies workloads needing custom profiles |
| 22. | percentage_workloads_with_root_primary_supplementary_gid |
% of workloads running as root GID | GID security | Identifies root group usage |
| 23. | percentage_workloads_with_binding_to_privileged_ports |
% of workloads binding to privileged ports | Port privilege compliance | Identifies privileged port bindings |
| 24. | percentage_clusterroles_managing_secrets |
% of ClusterRoles managing secrets | Secret access control | Identifies roles with secret access |
| 25. | percentage_roles_with_delete_log_permissions |
% of Roles with log deletion permissions | Audit log protection | Identifies roles that can delete logs |
| 26. | percentage_roles_with_wildcard_verbs |
% of Roles with wildcard verbs | Least privilege RBAC | Identifies overly permissive roles |
| 27. | percentage_cluster_roles_with_manage_all_resources |
% of ClusterRoles managing all resources | Cluster-wide permissions | Identifies cluster-admin-like roles |
| 28. | percentage_roles_allowing_privileges_escalation |
% of Roles allowing privilege escalation | RBAC security | Identifies escalation paths |
| 29. | percentage_roles_managing_configmaps |
% of Roles managing ConfigMaps | ConfigMap access control | Identifies ConfigMap management roles |
| 30. | percentage_roles_allowing_exec_into_pods |
% of Roles allowing pod exec | Pod access control | Identifies exec permissions |
| 31. | percentage_roles_managing_kubernetes_networking |
% of Roles managing networking | Network policy control | Identifies network management roles |
| 32. | percentage_roles_with_manage_rbac_roles |
% of Roles managing RBAC | RBAC management control | Identifies RBAC admin roles |
| 33. | percentage_rolebindings_with_admin_access |
% of RoleBindings with admin access | Admin access control | Identifies admin bindings |
| 34. | percentage_clusterroles_managing_webhooks |
% of ClusterRoles managing webhooks | Webhook security | Identifies webhook management roles |
| 35. | percentage_roles_with_manage_all_resources |
% of Roles managing all resources | Namespace-wide permissions | Identifies namespace-admin roles |
| 36. | pct_roles_managing_secrets |
% of Roles managing secrets | Secret access control | Identifies secret management roles |
| 37. | count_expired |
Count expired certificates | Certificate management | Identifies expired certificates |
| 38. | last_hour_avg |
Average value in last hour | Recent performance trends | Identifies current performance issues |
| 39. | last_hour_max_count |
Maximum count in last hour | Error spike detection | Identifies error bursts |
| 40. | missing_count |
Count empty/missing values | Identify gaps in data | Containers without memory limits |
| 41. | missing_count_percentage |
Percentage of missing values | Compliance metrics | % of workloads missing CPU requests |
| 42. | percentage |
Percentage of non-empty values | Compliance metrics | % of containers with limits defined |
| 43. | sum |
Sum of values | Total counts, cumulative metrics | Total error count for a deployment |
| 44. | average |
Average of values | Average utilization, typical values | Average CPU usage |
| 45. | latest |
Latest value | Current state, latest reading | Current configuration value |
| 46. | max |
Maximum value | Peak usage, worst case | Maximum image size |
| 47. | count_date_in_range |
Count dates within a range | Lifecycle window tracking |
Certificates expiring within a target window |
Configuring resilience thresholds
Haynes, you flagged the term 'Thresholds' in the comment. Can you please let me know what exact term should we use here to explain this section. I was thinking metric evaluation levels.
- Healthy operational ranges
- Warning conditions
- Critical operational states
- Remediation trigger conditions
- High error rates can trigger critical resilience actions.
- Low availability values can generate operational risk alerts.
- Failed runtime health checks can generate corrective remediation actions.
- Unsupported runtime configurations can generate security remediation actions.
Validating resilience evaluations and resilience actions
- Metrics are evaluated correctly
- Aggregation behavior produces expected results
- Threshold configurations generate expected severity levels
- Resilience actions are generated correctly for supported components
- Remediation recommendations align with operational conditions
- Component-level operational data
- Runtime-specific evaluations
- Threshold violation scenarios
- Resilience action behavior for critical operational conditions
- Remediation recommendations for infrastructure failures or configuration risks
Regular validation helps maintain accurate resilience evaluations, resilience scores, and operational remediation actions across environments
@Haynes, I have updated the
input_data_keys.jsonthing you flagged earlier. Let me know if this is making sense now.