Technical Concepts and Limitations

CEX configuration ConfigMap updates

From a cluster administration point of view it is desirable to change the CEX configuration in the cluster-wide crypto ConfigMap. For example, to add or remove CEX resources within a config set or even add or remove whole crypto config sets.

This can be done during regular cluster uptime but with some carefulness. Every CRYPTOCONFIG_CHECK_INTERVAL (default is 120s) the crypto ConfigMap is re-read by all the CEX device plug-in instances. The new ConfigMap is verified and if valid, activated as the new current ConfigMap. On successful ConfigMap re-read the plug-in logs a message:

    CryptoConfig: updated configuration

If the verification of the new CEX ConfigMap fails, the CEX device plug-in logs an error message. One reason for the verification failure might be the failure to read or parse the ConfigMap resulting in error logs like:

    CryptoConfig: Can't open config file ...

    CryptoConfig: Error parsing config file ...

If the verification step fails, the following message is displayed:

    Config Watcher: failed to verify new configuration!

These failures result in running the plug-in instances without any configuration map.

The log messages appear periodically until yet another update of the ConfigMap is finally accepted as valid.

Note: After an update of a configuration map, the cluster needs some time (typically up to 2 minutes) to propagate the changes to all nodes. Another, potentially faster, way to update the configuration map for the plug-in is to restart the rollout of the deployment via:

    kubectl rollout restart daemonset <name-of-the-cex-plug-in-daemonset> -n cex-device-plugin

This triggers a restart of each instance of the daemonset in a coordinated way by Kubernetes.

Overcommitment of CEX resources

Edit online

By default, a CEX resource (an APQN) maps to exactly one Kubernetes plug-in-device. This is the administration unit known by Kubernetes and in fact a container requests such a plug-in device.

By default, the CEX device plug-in maps each available APQN to one plug-in device and as a result one APQN is assigned to a container requesting a CEX resource.

The CEX device plug-in can provide more than one plug-in-device per APQN, which allows some overcommitment of the available CEX resources.

Setting the environment variable APQN_OVERCOMMIT_LIMIT to a value greater than 1 (default is 1) allows to control how many plug-in devices are announced to the Kubernetes system for each APQN. For example, with three APQNs available within a config set and an overcommit value of 10, 30 CEX plug-in devices are allocatable and up to 30 containers could successfully request a CEX resource. The environment variable is specified in the DaemonSet YAML file via the env parameter.

You can specify the optional ConfigSet parameter "overcommit" to control the overcommit limit at config set level. If this parameter is omitted, the value defaults to the environment variable.

Eventually, more than one container will share one APQN with overcommitment enabled. This exposes no security weakness, but might result in lower performance for the crypto operations within each container.

Note: Dynamically changing the overcommit value, either by changing the environment variable, or by changing the overcommit parameter of a config set, changes the number of available CEX resources. If the number of available resources increases, containers waiting for resources might be able to run. Whereas already running containers continue to run, even if a used resource is no more available because of the decreased number of available resources. Due to lack of resources, those containers cannot be restarted.

The device node z90crypt

Edit online

On a compute node, the device node /dev/z90crypt offers access to all zcrypt devices known to the compute running as a KVM guest. The application of a container, which requests a CEX resource will also see and use the device node /dev/z90crypt. However, what is visible inside the container is in fact a newly constructed z90crypt device with limited access to only the APQN assigned.

On the compute node, these constructed z90crypt devices are visible in the /dev directory as device nodes zcrypt-apqn-<card>-<domain>-<overcommitnr>. With the start of the container the associated device node on the compute node is mapped to the /dev/z90crypt device inside the container.

These constructed z90crypt devices are created on the fly with the CEX allocation request triggered with the container start and deleted automatically when the container terminates.

With version 1 of the CEX device plug-in, the constructed zcrypt device nodes limit access to exact one APQN (adapter, usage domain, no control domain), allowing all ioctls.

Note: These settings allow both usage and control actions, which are restricted to the underlying APQN with the /dev/z90crypt device that is visible inside the container, even with overcommited plug-in devices.

The shadow sysfs

Edit online

The CEX device plug-in manipulates the AP part of the sysfs that a container can explore. The sysfs tree within a container contains two directories related to the AP/zcrypt functionality: /sys/bus/ap and /sys/devices/ap.

Tools working with zcrypt devices, like lszcrypt or ivp.e, need to see the restricted world, which is accessible via the /dev/z90crypt device node within the container.

The CEX device plug-in creates a shadow sysfs directory tree for each of these paths on the compute node at /var/tmp/shadowsysfs/<plug-in-device>. With the start of the container, both directories /sys/bus/ap and /sys/devices/ap are overlayed (overmounted) with the corresponding shadow directory on the compute node.

These shadow directory trees are simple static files that are created from the original sysfs entries on the compute node. They loose their sysfs functionality and show a static view of a limited AP/zcrypt world. For example, /sys/bus/ap/ap_adapter_mask is a 256 bit field listing all available adapters (crypto cards). The manipulated file that appears inside the container only shows the adapter that belongs to the assigned APQN. All load and counter values in the corresponding sysfs attributes, for example /sys/devices/ap/card<xx>/<xx>.<yyyy>/request_count, show up as 0 and don't get updates when a crypto load is running.

This restricted sysfs within a container should be sufficient to satisfy the discovery tasks of most applications (lszcrypt, ivp.e, opencryptoki with CCA or EP11 token) but has limits. For example, chzcrypt will fail to change sysfs attributes, offline switch of a queue will not work, and applications inspecting counter values might get confused.

An administrator logged into a Kubernetes compute node could figure out the assignment of a CEX resource and a requesting container. For example, by reading the log messages from the plug-ins. Without overcommitment the counters of an APQN on the compute node reflect the crypto load of the associated container and lszcrypt can be used.

Hot plug and hot unplug of APQNs

Edit online

The CEX device plug-in monitors the APQNs available on the compute node by default every 30 seconds. This comprises the existence of APQNs and their online state. When the compute node runs as a KVM guest it is possible to live modify the devices section of the guest's xml definition at the KVM host, which results in APQNs appearing or disappearing. The AP bus and zcrypt device driver inside the Linux system recognizes this as hot plug or unplug of crypto cards and/or domains.

It is also possible to directly change the online state of a card or APQN within a compute node. For example, an APQN might be available but switched to offline by intention by an system administrator.

A dialog on the HMC offers the possibility to configure off and configure on CEX cards assigned to an LPAR. A CEX card in config off state is still visible in the LPAR and thus in the compute node but similar to the offline state no longer usable.

All this might cause the CEX device plug-in to deal with varying CEX resources. The plug-in code is capable of handling hot plug, hot unplug, the online state changes of CEX resources, and reports changes in the config set to the Kubernetes system. Because of this handling, APQNs can be included into the CEX config sets, which might not exist at the time of first deployment of the CEX configuration map. At a later time the card is hot plugged and assigned to the running LPAR. The cluster will spot this and make the appearing APQNs, which are already a member in a config set, available for allocation requests.

The handling of the online state is done by reporting the relevant plug-in devices as healthy (online) or unhealthy (offline). An unhealthy plug-in device is not considered when a CEX resource allocation takes place.

Note: It might happen that a CEX resource becomes unusable (hot unplug or offline state) but is assigned to a running container. The plug-in recognizes the state change, updates the bookkeeping, and reports this to the Kubernetes system but does not stop or kill the running container. It is assumed that the container load fails anyway because the AP bus or zcrypt device driver on the compute node reacts with failures on the attempt to use such a CEX resource device. A well designed cluster application terminates with a bad return code causing Kubernetes to re-establish a new container, which will claim a CEX resource and the situation recovers automatically.

SELinux and the Init Container

Edit online

The CEX device plug-in prepares various files and directories that become mounted to the pod at an allocation request. Among those mounts are the directories descibed under The shadow sysfs. These folders are generated on the compute node and mounted into the new pod. In some cases, special actions are needed for such a mount to be accessible inside the newly created pod. For example, SELinux where the folder, or one of its parent folders, must have the appropriate SELinux label. Other security mechanisms might have different requirements.

Because the security mechanisms and their configuration depend on the cluster instance, the CEX device plug-in does not provide any support for such mechanisms. Instead, in the SELinux case, an Init Container can be used to set the correct label on the shadow sysfs root folder /var/tmp/shadowsysfs that contains all the sub-folders that are mapped into pods. See Sample CEX device plug-in daemonset yaml for an example of a daemonset deployment of the CEX device plug-in that contains an init container to set up /var/tmp/shadowsysfs for use in a SELinux-enabled environment.

Limitations

Edit online

Namespaces and the project field

Edit online

The project field of a CEX config set should match the namespace of the container requesting a member of this set. This results in only blue applications being able to allocate blue APQNs from the blue config set.

Unfortunately, the allocation request forwarded from the Kubernetes system to the CEX device plug-in does not provide any namespace information. Therefore, the plug-in is not able to check the namespace affiliation.

When the container runs, the surveillance loop of the CEX device plug-in detects this mismatch and displays a log entry:

PodLister: Container <aaa> in namespace <bbb> uses a CEX resource <ccc> marked for project <ddd>!.

This behavior can be a security risk as this opens the possibility to use the HSM of another group of applications. However, to really exploit this, more is needed. For example, a secure key from the target to attack or the possibility to insert a self made secure key into the target application.

As a workaround, you can set quotas for all namespaces except for the one that is allowed to use the resource. See the following example:

    apiVersion: v1
    kind: ResourceQuota
    metadata:
      name: cex-blue-quota-no-red
      namespace: blue
    spec:
      hard:
        requests.cex.s390.ibm.com/red: 0
        limits.cex.s390.ibm.com/red: 0

This yaml snippet restricts the namespace blue to allocate zero CEX resources from the crypto config set cex.s390.ibm.com/red. The result is that all containers, which belong to the blue namespace, are not able to allocate red CEX resources any more.

Sample CEX quota restriction script in the appendix shows a bash script that produces a yaml file, which establishes these quota restrictions.