Technical Concepts and Limitations
CEX configuration ConfigMap updates
From a cluster administration point of view, it is desirable to change the CEX configuration in the cluster-wide crypto ConfigMap. For example, to add or remove CEX resources within a config set or even add or remove whole crypto config sets.
This can be done during regular cluster uptime but with some carefulness. Every
CRYPTOCONFIG_CHECK_INTERVAL (default is 120s) the crypto ConfigMap is reread by all
the CEX device plug-in instances. The new ConfigMap is verified and if valid, activated as the new
current ConfigMap. On successful ConfigMap reread the plug-in logs a message:
CryptoConfig: updated configuration
If the verification of the new CEX ConfigMap fails, the CEX device plug-in logs an error message. One reason for the verification failure might be the failure to read or parse the ConfigMap resulting in error logs like:
CryptoConfig: Can't open config file ...
or
CryptoConfig: Error parsing config file ...
If the verification step fails, the following message is displayed:
Config Watcher: failed to verify new configuration!
These failures result in running the plug-in instances without any configuration map.
The log messages appear periodically until yet another update of the ConfigMap is finally accepted as valid.
kubectl rollout restart daemonset <name-of-the-cex-plug-in-daemonset> -n cex-device-pluginThis triggers a restart of each instance of the daemonset in a coordinated way by Kubernetes.
Overcommitment of CEX resources
By default, a CEX resource (an APQN) maps to exactly one Kubernetes plug-in-device. This is the administration unit that is known by Kubernetes and in fact a container requests such a plug-in device.
By default, the CEX device plug-in maps each available APQN to one plug-in device and as a result one APQN is assigned to a container requesting a CEX resource.
The CEX device plug-in can provide more than one plug-in-device per APQN, which allows some overcommitment of the available CEX resources.
Setting the environment variable APQN_OVERCOMMIT_LIMIT to a value greater than 1
(default is 1) allows to control how many plug-in devices are announced to the Kubernetes system for
each APQN. For example, with three APQNs available within a config set and an overcommit value of
10, 30 CEX plug-in devices are allocatable and up to 30 containers could successfully request a CEX
resource. The environment variable is specified in the DaemonSet YAML file via the
env parameter.
You can specify the optional ConfigSet parameter "overcommit" to control the overcommit limit at the config set level. If this parameter is omitted, the value defaults to the environment variable.
Eventually, more than one container shares one APQN with overcommitment enabled. This exposes no security weakness, but might result in lower performance for the crypto operations within each container.
The device node z90crypt
On a compute node, the device node /dev/z90crypt offers access to all zcrypt
devices known to the compute running as a KVM guest. The application of a container, which requests
a CEX resource will also see and use the device node /dev/z90crypt. However, what
is visible inside the container is in fact a newly constructed z90crypt device with limited access
to only the APQN assigned.
On the compute node, these constructed z90crypt devices are visible in the /dev
directory as device nodes zcrypt-apqn-<card>-<domain>-<overcommitnr>. With
the start of the container the associated device node on the compute node is mapped to the
/dev/z90crypt device inside the container.
These constructed z90crypt devices are created dynamically with the CEX allocation request that is triggered with the container start and deleted automatically when the container terminates.
With version 1 of the CEX device plug-in, the constructed zcrypt device nodes limit access to exactly one APQN (adapter, usage domain, no control domain), allowing all ioctls.
/dev/z90crypt device that is visible inside the container, even with
overcommited plug-in devices.The shadow sysfs
The CEX device plug-in manipulates the AP part of the sysfs that a container can explore. The
sysfs tree within a container contains two directories that are related to the AP/zcrypt
functionality: /sys/bus/ap and /sys/devices/ap.
Tools working with zcrypt devices, like lszcrypt or ivp.e, need
to see the restricted world, which is accessible via the /dev/z90crypt device node
within the container.
The CEX device plug-in creates a shadow sysfs directory tree for each of
these paths on the compute node at /var/tmp/shadowsysfs/<plug-in-device>. With
the start of the container, both directories /sys/bus/ap and
/sys/devices/ap are overlaid (overmounted) with the corresponding shadow directory
on the compute node.
These shadow directory trees are simple static files that are created from the original sysfs
entries on the compute node. They lose their sysfs functionality and show a static view of a limited
AP/zcrypt world. For example, /sys/bus/ap/ap_adapter_mask is a 256-bit field
listing all available adapters (crypto cards). The manipulated file that appears inside the
container shows only the adapter that belongs to the assigned APQN. All load and counter values in
the corresponding sysfs attributes, for example
/sys/devices/ap/card<xx>/<xx>.<yyyy>/request_count, show up as 0 and don't
get updates when a crypto load is running.
This restricted sysfs within a container should be sufficient to satisfy the discovery tasks of
most applications (lszcrypt, ivp.e, opencryptoki with CCA or EP11
token) but has limits. For example, chzcrypt fails to change sysfs attributes,
offline switch of a queue will not work, and applications inspecting counter values might get
confused.
An administrator who is logged in to a Kubernetes compute node might figure out the assignment of
a CEX resource and a requesting container. For example, by reading the log messages from the
plug-ins. Without overcommitment, the counters of an APQN on the compute node reflect the crypto
load of the associated container and lszcrypt can be used.
Hot plug and hot unplug of APQNs
The CEX device plug-in monitors the APQNs available on the compute node by default every 30 seconds. This comprises the existence of APQNs and their online state. When the compute node runs as a KVM guest it is possible to live modify the devices section of the guest's xml definition at the KVM host, which results in APQNs appearing or disappearing. The AP bus and zcrypt device driver inside the Linux system recognizes this as hot plug or unplug of crypto cards and domains.
It is also possible to directly change the online state of a card or APQN within a compute node. For example, an APQN might be available but switched to offline by intention by a system administrator.
A dialog on the HMC offers the possibility to configure off and configure on CEX cards that are assigned to an LPAR. A CEX card in config off state is still visible in the LPAR and thus in the compute node but similar to the offline state no longer usable.
All this might cause the CEX device plug-in to deal with varying CEX resources. The plug-in code can handle hot plug, hot unplug, the online state changes of CEX resources, and reports changes in the config set to the Kubernetes system. Because of this handling, APQNs can be included into the CEX config sets, which might not exist at the time of first deployment of the CEX configuration map. Later the card is hot plugged and assigned to the running LPAR. The cluster spots this and makes the appearing APQNs, which are already a member in a config set, available for allocation requests.
The handling of the online state is done by reporting the relevant plug-in devices as healthy (online) or unhealthy (offline). An unhealthy plug-in device is not considered when a CEX resource allocation takes place.
SELinux and the Init Container
The CEX device plug-in prepares various files and directories that become mounted to the pod at an allocation request. Among those mounts are the directories described under The shadow sysfs. These folders are generated on the compute node and mounted into the new pod. Sometimes, special actions are needed for such a mount to be accessible inside the newly created pod. For example, SELinux where the folder, or one of its parent folders, must have the appropriate SELinux label. Other security mechanisms might have different requirements.
Because the security mechanisms and their configuration depend on the cluster instance, the CEX
device plug-in does not provide any support for such mechanisms. Instead, in the SELinux case, an
Init Container can be used to set the correct label on the shadow sysfs root folder
/var/tmp/shadowsysfs that contains all the subfolders that are mapped into pods.
See Sample CEX device plug-in daemonset yaml
for an example of a daemonset deployment of the CEX device plug-in that contains an init container
to set up /var/tmp/shadowsysfs for use in a SELinux-enabled environment.