Technical Concepts and Limitations

CEX configuration ConfigMap updates

From a cluster administration point of view, it is desirable to change the CEX configuration in the cluster-wide crypto ConfigMap. For example, to add or remove CEX resources within a config set or even add or remove whole crypto config sets.

This can be done during regular cluster uptime but with some carefulness. Every CRYPTOCONFIG_CHECK_INTERVAL (default is 120s) the crypto ConfigMap is reread by all the CEX device plug-in instances. The new ConfigMap is verified and if valid, activated as the new current ConfigMap. On successful ConfigMap reread the plug-in logs a message:

CryptoConfig: updated configuration

If the verification of the new CEX ConfigMap fails, the CEX device plug-in logs an error message. One reason for the verification failure might be the failure to read or parse the ConfigMap resulting in error logs like:

CryptoConfig: Can't open config file ...

or

CryptoConfig: Error parsing config file ...

If the verification step fails, the following message is displayed:

Config Watcher: failed to verify new configuration!

These failures result in running the plug-in instances without any configuration map.

The log messages appear periodically until yet another update of the ConfigMap is finally accepted as valid.

Note: After an update of a configuration map, the cluster needs some time (typically up to 2 minutes) to propagate the changes to all nodes. Another, potentially faster, way to update the configuration map for the plug-in is to restart the rollout of the deployment via:
kubectl rollout restart daemonset <name-of-the-cex-plug-in-daemonset> -n cex-device-plugin

This triggers a restart of each instance of the daemonset in a coordinated way by Kubernetes.

Overcommitment of CEX resources

By default, a CEX resource (an APQN) maps to exactly one Kubernetes plug-in-device. This is the administration unit that is known by Kubernetes and in fact a container requests such a plug-in device.

By default, the CEX device plug-in maps each available APQN to one plug-in device and as a result one APQN is assigned to a container requesting a CEX resource.

The CEX device plug-in can provide more than one plug-in-device per APQN, which allows some overcommitment of the available CEX resources.

Setting the environment variable APQN_OVERCOMMIT_LIMIT to a value greater than 1 (default is 1) allows to control how many plug-in devices are announced to the Kubernetes system for each APQN. For example, with three APQNs available within a config set and an overcommit value of 10, 30 CEX plug-in devices are allocatable and up to 30 containers could successfully request a CEX resource. The environment variable is specified in the DaemonSet YAML file via the env parameter.

You can specify the optional ConfigSet parameter "overcommit" to control the overcommit limit at the config set level. If this parameter is omitted, the value defaults to the environment variable.

Eventually, more than one container shares one APQN with overcommitment enabled. This exposes no security weakness, but might result in lower performance for the crypto operations within each container.

Note: Dynamically changing the overcommit value, either by changing the environment variable, or by changing the overcommit parameter of a config set, changes the number of available CEX resources. If the number of available resources increases, containers waiting for resources might be able to run. Whereas already running containers continue to run, even if a used resource is no more available because of the decreased number of available resources. Due to lack of resources, those containers cannot be restarted.

The device node z90crypt

On a compute node, the device node /dev/z90crypt offers access to all zcrypt devices known to the compute running as a KVM guest. The application of a container, which requests a CEX resource will also see and use the device node /dev/z90crypt. However, what is visible inside the container is in fact a newly constructed z90crypt device with limited access to only the APQN assigned.

On the compute node, these constructed z90crypt devices are visible in the /dev directory as device nodes zcrypt-apqn-<card>-<domain>-<overcommitnr>. With the start of the container the associated device node on the compute node is mapped to the /dev/z90crypt device inside the container.

These constructed z90crypt devices are created dynamically with the CEX allocation request that is triggered with the container start and deleted automatically when the container terminates.

With version 1 of the CEX device plug-in, the constructed zcrypt device nodes limit access to exactly one APQN (adapter, usage domain, no control domain), allowing all ioctls.

Note: These settings allow both usage and control actions, which are restricted to the underlying APQN with the /dev/z90crypt device that is visible inside the container, even with overcommited plug-in devices.

The shadow sysfs

The CEX device plug-in manipulates the AP part of the sysfs that a container can explore. The sysfs tree within a container contains two directories that are related to the AP/zcrypt functionality: /sys/bus/ap and /sys/devices/ap.

Tools working with zcrypt devices, like lszcrypt or ivp.e, need to see the restricted world, which is accessible via the /dev/z90crypt device node within the container.

The CEX device plug-in creates a shadow sysfs directory tree for each of these paths on the compute node at /var/tmp/shadowsysfs/<plug-in-device>. With the start of the container, both directories /sys/bus/ap and /sys/devices/ap are overlaid (overmounted) with the corresponding shadow directory on the compute node.

These shadow directory trees are simple static files that are created from the original sysfs entries on the compute node. They lose their sysfs functionality and show a static view of a limited AP/zcrypt world. For example, /sys/bus/ap/ap_adapter_mask is a 256-bit field listing all available adapters (crypto cards). The manipulated file that appears inside the container shows only the adapter that belongs to the assigned APQN. All load and counter values in the corresponding sysfs attributes, for example /sys/devices/ap/card<xx>/<xx>.<yyyy>/request_count, show up as 0 and don't get updates when a crypto load is running.

This restricted sysfs within a container should be sufficient to satisfy the discovery tasks of most applications (lszcrypt, ivp.e, opencryptoki with CCA or EP11 token) but has limits. For example, chzcrypt fails to change sysfs attributes, offline switch of a queue will not work, and applications inspecting counter values might get confused.

An administrator who is logged in to a Kubernetes compute node might figure out the assignment of a CEX resource and a requesting container. For example, by reading the log messages from the plug-ins. Without overcommitment, the counters of an APQN on the compute node reflect the crypto load of the associated container and lszcrypt can be used.

Hot plug and hot unplug of APQNs

The CEX device plug-in monitors the APQNs available on the compute node by default every 30 seconds. This comprises the existence of APQNs and their online state. When the compute node runs as a KVM guest it is possible to live modify the devices section of the guest's xml definition at the KVM host, which results in APQNs appearing or disappearing. The AP bus and zcrypt device driver inside the Linux system recognizes this as hot plug or unplug of crypto cards and domains.

It is also possible to directly change the online state of a card or APQN within a compute node. For example, an APQN might be available but switched to offline by intention by a system administrator.

A dialog on the HMC offers the possibility to configure off and configure on CEX cards that are assigned to an LPAR. A CEX card in config off state is still visible in the LPAR and thus in the compute node but similar to the offline state no longer usable.

All this might cause the CEX device plug-in to deal with varying CEX resources. The plug-in code can handle hot plug, hot unplug, the online state changes of CEX resources, and reports changes in the config set to the Kubernetes system. Because of this handling, APQNs can be included into the CEX config sets, which might not exist at the time of first deployment of the CEX configuration map. Later the card is hot plugged and assigned to the running LPAR. The cluster spots this and makes the appearing APQNs, which are already a member in a config set, available for allocation requests.

The handling of the online state is done by reporting the relevant plug-in devices as healthy (online) or unhealthy (offline). An unhealthy plug-in device is not considered when a CEX resource allocation takes place.

Note: It might happen that a CEX resource becomes unusable (hot unplug or offline state) but is assigned to a running container. The plug-in recognizes the state change, updates the bookkeeping, and reports this to the Kubernetes system but does not stop or kill the running container. It is assumed that the container load fails anyway because the AP bus or zcrypt device driver on the compute node reacts with failures on the attempt to use such a CEX resource device. A well-designed cluster application terminates with a bad return code causing Kubernetes to reestablish a new container, which claims a CEX resource and the situation recovers automatically.

SELinux and the Init Container

The CEX device plug-in prepares various files and directories that become mounted to the pod at an allocation request. Among those mounts are the directories described under The shadow sysfs. These folders are generated on the compute node and mounted into the new pod. Sometimes, special actions are needed for such a mount to be accessible inside the newly created pod. For example, SELinux where the folder, or one of its parent folders, must have the appropriate SELinux label. Other security mechanisms might have different requirements.

Because the security mechanisms and their configuration depend on the cluster instance, the CEX device plug-in does not provide any support for such mechanisms. Instead, in the SELinux case, an Init Container can be used to set the correct label on the shadow sysfs root folder /var/tmp/shadowsysfs that contains all the subfolders that are mapped into pods. See Sample CEX device plug-in daemonset yaml for an example of a daemonset deployment of the CEX device plug-in that contains an init container to set up /var/tmp/shadowsysfs for use in a SELinux-enabled environment.