Guidelines for configuring custom libraries

Edit online

Custom libraries define how Concert evaluates operational data during resilience assessments and generates resilience scores, remediation recommendations, and resilience actions for supported runtime and infrastructure components.

Library configurations affect:

How operational data is evaluated
How resilience scores are calculated
How resilience actions and remediation recommendations are generated during assessments

Incorrect or incomplete library configurations can result in inaccurate resilience evaluations, unsupported operational recommendations, or incomplete resilience actions during assessments.

When you configure custom libraries:

Define requirements that clearly represent operational risks
Configure metrics that accurately evaluate runtime or infrastructure conditions
Use aggregation and conversion functions that correctly normalise operational data
Configure thresholds that support accurate resilience score calculations and resilience actions

The following sections describe the primary configuration areas that affect resilience evaluations and resilience action generation in custom libraries.

Resilience evaluation process

During a resilience assessment, Concert processes operational data in multiple stages before resilience scores and resilience actions are generated.

The evaluation flow typically includes the following stages:

Operational facts are collected from configured runtime or infrastructure sources
Fact aggregation functions evaluate multiple received values for the same operational fact field within the assessment period.
Facts-to-metric conversion functions transform aggregated operational facts into metric values.
Metrics are evaluated against configured thresholds.
Requirement scores are calculated
Resilience scores, resilience actions, and remediation recommendations are generated for supported components.

Configuring requirement definitions

Requirement definitions identify the operational capability or runtime condition that Concert evaluates during resilience assessments.

Requirement definitions should clearly identify:

The operational condition being evaluated
The associated resilience risk
The expected runtime or infrastructure behavior
The operational impact of failures or degraded states

Requirement descriptions should provide sufficient operational context to support accurate resilience evaluations and resilience action generation.

For example:

Vague requirement descriptions can result in incomplete remediation recommendations
Incomplete operational context can reduce action accuracy for critical operational conditions

Requirement categories affect how Concert groups operational risks and prioritizes resilience actions during evaluations.

For example:

Security requirements can generate security-focused remediation actions.
Availability requirements can prioritise runtime recovery actions.
Misconfiguration-related requirements can generate configuration remediation recommendations.

Requirement evaluation behaviour is controlled by the criteria_type configuration.

Supported evaluation methods can include:

snippet_ref
pre_computed
Default evaluation functions

The selected evaluation method determines how metric values are evaluated during resilience assessments.

For example:

A runtime availability requirement can evaluate whether application instances remain operational during deployment failures.
A Kubernetes configuration requirement can evaluate whether CPU and memory limits are configured for containers.
A Java runtime requirement can evaluate unsupported runtime versions or insecure runtime configurations.

Note: Critical or high-severity requirements should include configurations that support accurate resilience actions and remediation generation during resilience assessments.

The following example shows a requirement definition in a custom library configuration:

{
  "name": "message_throughput",
  "category": "Performance",
  "criteria_type": "snippet_ref",
  "expected_score": "85",
  "rating_thresholds": "50,70,85,95",
  "input_data_keys": "avg_messages_per_second, peak_message_rate"
}

This configuration defines:

The operational capability being evaluated
The expected resilience score
Metric values used during evaluation
The operational metrics required for scoring and resilience action generation
The evaluation method used during assessments

Configuring metric definitions

Metric definitions specify the operational data that Concert evaluates for each requirement.

Metrics are configured in the input_data_keys.json file and define:

Metric identifiers
Aggregation behavior
Supported data types
Operational data sources
Facts-to-metric conversion behavior

Note: The input_data_keys.json file is used when importing or configuring custom libraries through APIs.

Draft comment:
@Haynes, I have updated the input_data_keys.json thing you flagged earlier. Let me know if this is making sense now.

Metric definitions should include enough operational context to support accurate evaluations and remediation actions.

For example:

A container compliance metric can evaluate the percentage of containers missing CPU limits.
A metric can aggregate NanoCpus values by using the latest value received during assessments.
A metric can convert missing container counts into percentage-based evaluation values.

The following example shows how metric thresholds can be defined in the input_data_keys.json configuration file for a custom library:

[
  {
    "name": "pct_containers_missing_cpu_limit",
    "data_type": "double",
    "unit": "i18n:pct_unit",
    "description": "i18n:pct_containers_missing_cpu_limit_desc_key",
    "label": "i18n:pct_containers_missing_cpu_limit_label",
    "origin": "i18n:container_scanner_origin_label",
    "metadata": {
      "sources": [
        {
          "source": "docker_inspect",
          "type": "Container",
          "search_key": "HostConfig",
          "fact_field": "NanoCpus",
          "fact_aggregation": "latest"
        }
      ]
    },
    "facts_to_metric_conversion": "missing_count_percentage"
  }
]

This configuration defines:

The operational metric evaluated during assessments
Metric levels used during metric evaluation
Operational ranges that can affect resilience scores and resilience actions

Configuring fact aggregation functions

Fact aggregation functions evaluate multiple received values for the same operational fact field within an assessment period. Aggregation determines whether Concert uses the sum, average, latest value, minimum value, or maximum value before metric evaluation occurs. Aggregation does not define how data is collected. It defines how multiple received values for the same operational fact field are processed during evaluation.

For example:

Error count values received multiple times for the same deployment during an assessment period can be summed.
Response time values received for the same runtime instance during an assessment period can be averaged.
CPU utilization values received for the same container during an assessment period can use the latest reported value.
Availability status values received for the same application instance during an assessment period can use the minimum reported value.

The following aggregation functions are commonly used:

Table 1. Aggregate functions
Aggregate functions	Description	Example
`sum`	Sums multiple values received for the same metric within an assessment period	Total error count for the same deployment within an assessment period
`average`	Calculates the mean value across runtime instances	Average CPU utilization
`min`	Returns the lowest operational value	Lowest availability score across components
`max`	Returns the highest operational value	Peak memory utilization
`latest`	Uses the most recent operational value	Latest runtime configuration state

Aggregation functions should be selected carefully because incorrect aggregation behavior can result in inaccurate resilience evaluations or incomplete resilience actions.

Configuring facts-to-metric conversion functions

Facts-to-metric conversion functions transform aggregated operational facts into metric values that Concert evaluates during resilience assessments. Conversion configurations are defined by using the facts_to_metric_conversion field in metric definitions.

For example:

Boolean operational states can be converted into percentage-based metrics.
Runtime health conditions can be converted into numerical resilience scores.
Operational status values can be converted into threshold-based evaluation metrics.
Misconfiguration compliance results can be converted into percentage-based compliance metrics.

Incorrect conversion configurations can result in inaccurate resilience scores, unsupported metric evaluations, or incomplete remediation actions.

The following table describes commonly used facts_to_metric_conversion functions and how they transform operational facts into metric values during resilience evaluations.

Table 2. Facts_to_metric_conversion function table
S.No	Function	Description	Use-case	Example
1.	`pct_workloads_within_90pct_cpu_limits_calc`	% of workloads with CPU usage < 90% of limits	CPU resource management	Identifies over-utilized containers
2.	`pct_workloads_within_90pct_memory_limits_calc`	% of workloads with memory usage < 90% of limits	Memory resource management	Identifies memory-constrained containers
3.	`latest_tag_percentage`	% of images using 'latest' tag	Image versioning compliance	Identifies containers with mutable tags
4.	`percentage_workload_run_as_root`	% of workloads running as root	Security compliance	Identifies insecure containers
5.	`pct_configmaps_with_secrets_conversion`	% of ConfigMaps with secrets (KSV109)	Security misconfiguration	Identifies ConfigMaps needing remediation
6.	`pct_configmap_with_sensitive_content_conversion`	% of ConfigMaps with sensitive content (KSV1010)	Security compliance	Identifies sensitive data exposure
7.	`pct_services_with_external_ip_conversion`	% of services with external IPs (KSV108)	Network security	Identifies exposed services
8.	`pct_workloads_with_privilege_escalation_conversion`	% of workloads allowing privilege escalation (KSV001)	Security hardening	Identifies privilege escalation risks
9.	`percentage_workloads_with_all_capabilities_drop`	% of workloads with all capabilities dropped	Security best practices	Identifies containers with excessive capabilities
10.	`percentage_workloads_with_only_net_bind_service_drop`	% of workloads with only NET_BIND_SERVICE (KSV106)	Capability management	Identifies minimal capability compliance
11.	`percentage_workloads_with_host_ipc_namespace_access`	% of workloads accessing host IPC (KSV008)	Namespace isolation	Identifies namespace boundary violations
12.	`percentage_workloads_with_host_network_access`	% of workloads accessing host network (KSV009)	Network isolation	Identifies network boundary violations
13.	`percentage_workloads_with_host_pid_access`	% of workloads accessing host PID (KSV010)	Process isolation	Identifies PID namespace violations
14.	`percentage_workloads_with_rootfs_not_read_only`	% of workloads with writable root filesystem (KSV014)	Immutability compliance	Identifies mutable containers
15.	`percentage_workloads_with_extra_capabilities_added`	% of workloads with extra capabilities	Least privilege compliance	Identifies over-privileged containers
16.	`percentage_workloads_with_hostpath_volumes_mounted`	% of workloads using HostPath volumes	Volume security	Identifies risky volume mounts
17.	`percentage_workloads_with_access_to_host_ports`	% of workloads accessing host ports	Port security	Identifies host port exposure
18.	`percentage_workloads_with_non_default_proc_masks`	% of workloads with custom proc masks	Process security	Identifies host port exposure
19.	`percentage_workloads_with_non_core_volume_types`	% of workloads using non-core volumes	Volume type compliance	Identifies non-standard volumes
20.	`percentage_workloads_without_runtime_profile`	% of workloads without seccomp/AppArmor	Runtime security	Identifies unprotected workloads
21.	`percentage_workloads_with_default_seccomp_policies`	% of workloads with default seccomp	Security profile compliance	Identifies workloads needing custom profiles
22.	`percentage_workloads_with_root_primary_supplementary_gid`	% of workloads running as root GID	GID security	Identifies root group usage
23.	`percentage_workloads_with_binding_to_privileged_ports`	% of workloads binding to privileged ports	Port privilege compliance	Identifies privileged port bindings
24.	`percentage_clusterroles_managing_secrets`	% of ClusterRoles managing secrets	Secret access control	Identifies roles with secret access
25.	`percentage_roles_with_delete_log_permissions`	% of Roles with log deletion permissions	Audit log protection	Identifies roles that can delete logs
26.	`percentage_roles_with_wildcard_verbs`	% of Roles with wildcard verbs	Least privilege RBAC	Identifies overly permissive roles
27.	`percentage_cluster_roles_with_manage_all_resources`	% of ClusterRoles managing all resources	Cluster-wide permissions	Identifies cluster-admin-like roles
28.	`percentage_roles_allowing_privileges_escalation`	% of Roles allowing privilege escalation	RBAC security	Identifies escalation paths
29.	`percentage_roles_managing_configmaps`	% of Roles managing ConfigMaps	ConfigMap access control	Identifies ConfigMap management roles
30.	`percentage_roles_allowing_exec_into_pods`	% of Roles allowing pod exec	Pod access control	Identifies exec permissions
31.	`percentage_roles_managing_kubernetes_networking`	% of Roles managing networking	Network policy control	Identifies network management roles
32.	`percentage_roles_with_manage_rbac_roles`	% of Roles managing RBAC	RBAC management control	Identifies RBAC admin roles
33.	`percentage_rolebindings_with_admin_access`	% of RoleBindings with admin access	Admin access control	Identifies admin bindings
34.	`percentage_clusterroles_managing_webhooks`	% of ClusterRoles managing webhooks	Webhook security	Identifies webhook management roles
35.	`percentage_roles_with_manage_all_resources`	% of Roles managing all resources	Namespace-wide permissions	Identifies namespace-admin roles
36.	`pct_roles_managing_secrets`	% of Roles managing secrets	Secret access control	Identifies secret management roles
37.	`count_expired`	Count expired certificates	Certificate management	Identifies expired certificates
38.	`last_hour_avg`	Average value in last hour	Recent performance trends	Identifies current performance issues
39.	`last_hour_max_count`	Maximum count in last hour	Error spike detection	Identifies error bursts
40.	`missing_count`	Count empty/missing values	Identify gaps in data	Containers without memory limits
41.	`missing_count_percentage`	Percentage of missing values	Compliance metrics	% of workloads missing CPU requests
42.	`percentage`	Percentage of non-empty values	Compliance metrics	% of containers with limits defined
43.	`sum`	Sum of values	Total counts, cumulative metrics	Total error count for a deployment
44.	`average`	Average of values	Average utilization, typical values	Average CPU usage
45.	`latest`	Latest value	Current state, latest reading	Current configuration value
46.	`max`	Maximum value	Peak usage, worst case	Maximum image size
47.	`count_date_in_range`	Count dates within a range	Lifecycle window tracking	Certificates expiring within a target window

Configuring resilience thresholds

Draft comment:
Haynes, you flagged the term 'Thresholds' in the comment. Can you please let me know what exact term should we use here to explain this section. I was thinking metric evaluation levels.

Threshold configurations determine how Concert evaluates metric severity during resilience assessments.Threshold configurations can define:

Healthy operational ranges
Warning conditions
Critical operational states
Remediation trigger conditions

For example:

High error rates can trigger critical resilience actions.
Low availability values can generate operational risk alerts.
Failed runtime health checks can generate corrective remediation actions.
Unsupported runtime configurations can generate security remediation actions.

Note: Critical and high-severity thresholds should be configured carefully to support accurate resilience actions and remediation generation for supported runtime and infrastructure components.

Validating resilience evaluations and resilience actions

After configuring a custom library, validate the library by running resilience assessments against representative runtime or infrastructure data. Validation helps confirm that:

Metrics are evaluated correctly
Aggregation behavior produces expected results
Threshold configurations generate expected severity levels
Resilience actions are generated correctly for supported components
Remediation recommendations align with operational conditions

When possible, validate:

Component-level operational data
Runtime-specific evaluations
Threshold violation scenarios
Resilience action behavior for critical operational conditions
Remediation recommendations for infrastructure failures or configuration risks

Regular validation helps maintain accurate resilience evaluations, resilience scores, and operational remediation actions across environments