Preserve Kubernetes API Objects While They Are in Use

By Mike Spreitzer

You can use the Kubernetes API machinery to build a reliable distributed system with the good properties proven in the Kubernetes control plane, but it presents some challenges.

The Kubernetes API machinery does not support an ACID transaction involving multiple Kubernetes API objects, so you can not simply maintain referential integrity among those objects. In other words, you can not completely prevent the system from getting into a state where some object D refers to an object P, and object P does not exist. You can use admission plugins and registry strategies to avoid creating that state in many scenarios, but not all. However, you can do the next best thing—forbid states in which API object D has an implementation that is using object P and object P does not exist. The Kubernetes project has an example of how to do this for Pods and PVCs. That example is an application of a broadly applicable design pattern, and in this blog post, I explain that pattern.

The problem

Suppose you have two Kinds of Kubernetes API objects: Providers and Dependents. Suppose a Dependent's Spec has a truly immutable string-valued field that is intended to hold the name of a Provider object. An admission plugin, for example, could reject any attempt to create a Dependent that refers to a non-existent Provider. Another admission plugin could reject any attempt to delete a Provider that is referenced by an existing Dependent. But both of those plugins run to completion before the actual storage is modified.

It is possible to start in a state in which a Provider exists and is not referenced by any existing Dependent, and one client could request deletion of that Provider while another client concurrently requests creation of a Dependent that refers to that Provider. Both admission plugins could approve before either change is made. For native types, you can narrow the window of vulnerability by doing the checking in the registry strategies (for example see how validation of CREATE PVC is done), but you cannot close that window completely.  Figure 0 below shows what can go wrong.  This drawing exercises the ability of Kubernetes to store different Kinds of objects in different etcd clusters to emphasize how difficult the problem can be.

fig0

Background: finalizers

The Kubernetes API machinery supports things called finalizers, which cause API object deletion to take an extended amount of time and require assent from controllers. Such an extended deletion process starts when a client requests deletion of an object that has one or more finalizers. In this case, the object is not immediately removed from storage. Rather, a certain piece of the object's metadata called its "deletion timestamp" (which is initially unset) is set to a time in the future. The object is not actually removed from storage until all those finalizers are removed. For each finalizer, there should be exactly one controller responsible for removing that finalizer.

Preventing removal of a Provider API object in use by a Dependent implementation

The technique for interlocking between Provider deletion and Dependent implementation involves the use of a finalizer on the Provider. The story starts during creation of the Provider, during which an admission plugin or registry strategy ensures that the Provider object is born with this finalizer. The finalizer is eventually removed by a controller created for this purpose; let us call it the Provider Finalization Controller.

Provider Finalization Controller

The Provider Finalization Controller has informers on both Provider and Dependent objects. Whenever notified about a Provider object change that is relevant to the deletion process, this controller puts a reference to the Provider object into its work queue. Whenever notified about a Dependent object change that has implementation implications, this controller enqueues a reference to the Dependent's Provider.

The Provider Finalization Controller has worker threads that dequeue references to Provider objects and work on them. If a worker finds an existing Provider object that has its deletion timestamp set and the finalizer is present then the worker proceeds to remove the finalizer if appropriate. To determine whether finalizer removal is appropriate, the worker first queries an apiserver (and thus, indirectly, the etcd cluster) for Dependents. The worker is looking for Dependent objects that reference the Provider at hand and whose implementation might now depend on the Provider. You can improve efficiency by increasing the amount of Provider filtering done by the apiserver (e.g., using a field selector); the worker has to do the remainder of the filtering. If, and only if, the worker finds zero Dependents, the worker requests removal of the finalizer. If any of the worker's requests fail, the worker requeues the Provider reference for later retry.

Figure 1 shows a scenario in which the finalizer is kept because that query finds a Dependent object. This drawing omits the apiservers for simplicity, focusing on the controllers and the etcd clusters.

fig1

Dependent implementation controller

A controller implementing Dependent objects has to query an apiserver (and thus, the etcd cluster) for Providers before proceeding to implement a Dependent's dependency on its Provider. If the Provider is not found then the controller can not proceed to implement that dependency. If the Provider is found but its deletion timestamp is set, this also means that the controller can not proceed. Only if the Provider is found and its deletion timestamp is unset can the controller implement the dependency.

Figure 2 shows a scenario in which the finalizer is removed and the implementation never started.

fig2

This is conservative

The Provider object is kept around as long as there are Dependent objects whose implementation might be using that Provider. This includes some cases where the implementation eschews using the Provider because it is absent or being deleted. Figure 3 shows a scenario in which the finalizer is kept and the implementation is avoided.

fig3

In this scenario, the Dependent is permanently broken. Ideally, the Status of Dependent objects can represent this fact in a way that is easy for clients to identify, and clients will eventually delete permanently broken Dependents. Once the Dependent is gone, the deletion of the Provider can proceed.

Correctness

The correctness of this design relies on an important property of an etcd cluster—the changes that it makes are totally ordered. In other words, they occur in a strict sequence. (The story is more complicated if clients do non-quorum reads, but Kubernetes apiservers normally do only quorum reads.)

In order for the finalizer to be removed, the Provider Finalization Controller has to find zero Dependents whose implementation might be using the Provider in the results of a query to the Dependent etcd cluster. A Dependent object could have a relevant state change after that query, in the etcd cluster's ordering. The controller implementing that Dependent will necessarily learn of this change causally after that change. This controller's query to the Provider etcd cluster causally follows the Dependent's relevant state change, which causally follows the Provider Finalization Controller's query, which causally follows the initiation of the Provider's deletion. Thus, the query to the Provider etcd cluster is necessarily answered based on a state that is later than the initiation of deletion of the Provider. The reply to that query will necessarily show that the Provider is absent or in deletion, and in either case, the Dependent's dependency on that Provider will not be implemented.

Runtime costs

This pattern involves two queries to apiservers. One is by controllers implementing Dependents. These queries cost O(1) time and space and may be expected to happen O(1) times per Dependent object. The latter factor is a matter of how often a controller worker thread works on starting the implementation of a given Dependent object. More precisely, it is a matter of how often such a worker thread gets to the point of querying for the Provider before recording the result of that in a way that obviates repeat queries. The details of this are specific to your particular object types.

The other apiserver query is a LIST operation performed whenever the Provider Finalization Controller is considering removing the finalizer from a Provider. For a Provider with N Dependents, such a LIST operation can take O(N) or more (depending on how selective the query is) time and space on the client, more on the apiserver and etcd cluster, and can happen O(N) times (e.g., once per Dependent, as it is deleted). If your N can get large, this can get quite expensive.

An incorrect optimization

When the Provider Finalization Controller is considering removing the finalizer from a given Provider, the controller could look in its local cache, instead of querying an apiserver, for Dependents whose implementation may be using the Provider. Recall that an informer is an object of type SharedIndexInformer. These support custom local indices. The controller can index Dependents on the Provider (if any) that they may be using. That would make it quick and easy to search the local cache for relevant Dependents.

This optimization can lead to premature removal of the finalizer, and thus, premature removal of the Provider object. To see why, look again at Figure 1. At the point where the Provider Finalization Controller queries the apiserver for Dependents, is it guaranteed to have already been notified of the creation of Dependent D1? No. It would also be unsafe to consult any other cache. Correctness hinges on the fact that changes to the source of truth are totally ordered and the relevant controllers cannot miss an impactful change.

A correct optimization

When the Provider Finalization Controller is considering removing the finalizer from a given Provider, the controller can first look in its local cache for relevant Dependents and proceed to query an apiserver only if no relevant Dependents are found locally. I showed above why the controller cannot rely on a negative answer from its local cache, but a positive answer is safe to use.

It is safe to take this early out because when a Dependent appears in the local cache to be in a state where its implementation may be using the Provider, it is guaranteed that the Dependent informer will deliver a relevant notification subsequently if that Dependent ever does exit that state. Sometime after that notification, a worker thread will reconsider that Provider and sense a sufficiently up-to-date Dependent state.

This optimization can make a big reduction in the number of apiserver queries for relevant Dependents of a given Provider. Starting from a state in which there are N such Dependents and proceeding through deletion of all of them, this optimization changes the number of apiserver queries from O(N) to O(1).

Another correct optimization

Consider a controller implementing Dependent D1 that refers to Provider P1. When that controller considers starting to implement this dependency, the baseline design requires the controller to query an apiserver for the state of P1. Suppose that, at that point in time, that controller is currently holding a successful implementation of another Dependent D2 that also refers to P1. In this case, it is not necessary for the controller to make a fresh apiserver query. The controller is guaranteed that the finalizer is on P1 and will not be removed until causally after D1 reaches a state where its implementation is certainly not using P1.

Note a subtlety of this optimization. It breaks a simplicity property of the controller pattern, which is mutual exclusion of work on a given object. Normally an object is processed only by a worker thread that has dequeued a reference to that object, and the work queue ensures that there is at most one such worker at any given time. With this optimization, a worker that has dequeued a reference to D1 also does some processing of D2. Doing that correctly requires adding a layer of synchronization.

This optimization can reduce apiserver traffic, although likely not as much as the previous optimization. When a controller implements a series of K Dependents of one Provider, this optimization reduces the number of apiserver GETs of that Provider from O(K) to O(1).

Generalizing

The pattern above involves a relation from Dependent to Provider that is many-to-one. Generalizing to a many-to-many relation requires no deep conceptual change, but it may involve changing the way the Provider Finalization Controller searches for relevant Dependents. In the simple setting above, a field selector will often be very selective of Dependents. When a Dependent can depend on a general number of Providers, there is no field selector that can test this relationship. Given the design rules for Kubernetes API objects, such a relationship will be stated in a list/slice/array. There is no field selector that can match if any member of a list/slice/array satisfies the necessary predicate.

If a Dependent can depend on only a fairly limited number of Providers, this may be a good motivation for allowing field selectors against computed fields. That is, in the server-side mechanism for implementing field selectors against a given Dependent, an additional field is computed for each dependency on a Provider. The field's pathname includes the Provider's name. The Provider Finalization Controller can then compose a LIST query that selects on the computed field that uses the Provider's name.

If a Dependent can depend on a large number of Providers, then this is a motivation for a more powerful query mechanism in apiservers.

If you are interested in starting to build with Kubernetes, check out the IBM Cloud Kubernetes Service.

Be the first to hear about news, product updates, and innovation from IBM Cloud