Why and how we built an admission controller to make node drains safer when running stateful applications in Kubernetes.
Running stateful applications in Kubernetes is increasingly common and these are often managed using custom resources and operators. However, the dynamic nature of Kubernetes means pods, especially those in stateful workloads, aren’t guaranteed to be long-lived. Events like maintenance or resource pressure can trigger pod evictions, which risk disrupting services if not handled carefully.
The Eviction Reschedule Event is an open source project that aims to address this issue by using Kubernetes Admission Controllers to intercept and reject eviction requests for operator-managed pods, while at the same time notifying the operator that a pod needs to be moved. Its goal is to help preserve service availability and reduce disruption during events like node maintenance.
Read on as I cover:
-
- The problem we’re tackling
- Why the existing options aren’t enough
- An introduction to reschedule hook project
While some familiarity with Kubernetes Couchbase Autonomous Operator will also be referenced throughout as the Operator for brevity and most of the examples and demos will involve Couchbase resources of some sort.
Though originally developed with the Couchbase data platform in mind, the reschedule hook project is intended to be flexible and can be configured to protect other operator-managed stateful applications.
Part 1 – Why evictions are challenging for operator-managed stateful applications
Understanding pod evictions
In Kubernetes, evicting a pod involves triggering the pod’s PreStop
hook then sending a SIGTERM
after its completion. If the pod hasn’t exited after a grace period, this is followed up by a SIGKILL
. Evictions happen both voluntarily and involuntarily for a multitude of reasons, from node pressure to autoscaling, but this project focuses on voluntary evictions triggered when draining nodes. This is the process whereby pods are removed from nodes using the Kubernetes Eviction API, which is often required to clear the way for operations like maintenance or upgrades.
Running kubectl drain
is a common way to prepare a node for maintenance in Kubernetes. It marks the node as unschedulable and sends eviction requests concurrently for all the pods running on that node.
Why stateful applications are challenging
Stateful workloads add complexity to eviction handling:
Unpredictable Shutdown Times: These workloads may involve long-running processes like in-flight queries, which need to finish before a pod can be safely terminated
Application Coordination: The application itself often needs to be notified before pod removal to ensure data consistency and proper rebalancing
Operator Management: Operators managing the application need time to coordinate pod removal with the operator’s internal state
Volume Movement: The migration of Kubernetes volumes from one node to another can take considerable time that exceeds the pod’s terminationGracePeriodSeconds
An Operator automates the lifecycle of complex applications by continuously reconciling their actual state with the desired state defined in custom resources.
Take Couchbase as an example. A Couchbase cluster is composed of multiple pods, each running an instance of Couchbase Server, typically spread across different nodes and availability zones. The Operator ensures the cluster state matches the state defined in a CouchbaseCluster
resource. When a pod needs to be removed, the Operator must notify the cluster to rebalance data and handle any running processes gracefully.
Organizational challenges
Another common challenge is the separation of responsibilities in larger organizations:
-
- Kubernetes administrators manage infrastructure
- Application teams manage stateful applications running on that infrastructure
Meaning Kubernetes administrators need ways to safely drain nodes to perform maintenance without requiring constant coordination with application owners or impacting the availability or health of the underlying application.
Existing Kubernetes protections – why they’re not enough
Kubernetes provides tools like PreStop hooks and Pod Disruption Budgets to help with graceful shutdowns and application availability during evictions.
PreStop Hooks run scripts inside the container before the SIGTERM
signal, giving the pod a chance to gracefully shut down. For stateless applications, this is often enough. But for stateful apps, Operators often need to coordinate pod removal before termination to avoid the issues outlined above, which is tricky to do inside preStop hooks.
One ingenious approach we’ve seen involves adding a preStop script to the pod templates used by the Operator. This script has the pod add the reschedule annotation (discussed later) to itself in order to notify the Operator it should be moved, then loop until it has been safely ejected from the cluster.
Basic outline of script, not exact:
1 2 3 4 5 6 7 8 9 |
preStop: exec: command: - /bin/sh /mnt/kubectl annotate pod $SELF_POD_NAME cao.couchbase.com/reschedule=true \\ while /opt/couchbase/bin/couchbase-cli server-list -c couchbases://localhost\\ do \\ sleep 1; \\ done |
However, this requires mounting kubectl
and couchbase-cli
into each pod, increasing complexity and image size. Networking or other issues could also interrupt the script and result in the pod being prematurely terminated.
Pod Disruption Budgets (PDBs) ensure a minimum number of pods remain available during voluntary disruptions. But they only limit how many pods can be evicted simultaneously, meaning we’re still dealing with the scenario where an operator isn’t able to gracefully handle pod removal.
What happens if pods are evicted today?
The Operator creates PDBs for Couchbase to limit how many pods can be evicted at once. While this ensures a minimal level of availability for the application, pods that do get evicted will still be unsafely removed from the Couchbase cluster by means of failover.
We can also see how Operator handles this by watching what happens in the Couchbase administrator UI.
Even when evictions happen one pod at a time, once a pod restarts on a new node, Kubernetes may allow evictions on the next pod before the new pod has been safely added to the stateful application. At Couchbase, we prevent this with a readiness gate on the pods.
To help with these issues, we introduced the cao.couchbase.com/reschedule
annotation in CAO v2.7.0. When added to a pod, it tells the Operator it should be recreated. However, manually cordoning nodes and adding this annotation is tedious, doesn’t scale well, and only affects Couchbase pods – other pods still require normal draining.
Why not handle this entirely inside the Operator?
We considered embedding eviction validation logic inside the Operator itself, but:
-
-
-
- It’s an anti-pattern in Kubernetes operator design. Operators work via reconciliation loops, not admission controllers
- Adding admission control endpoints to operators complicates deployment and maintenance
<li”>Validating webhooks are cluster scoped resources and therefore require cluster scoped permissions. The Operator is namespace scoped and would require a large expansion of its required permissions
-
-
<li”>Building this logic into it’s own project allows open sourcing and invites community contributions for similar use cases
Part 2 – leveraging Kubernetes Webhooks for intelligent pod evictions
The Eviction Reschedule Hook is an open source Kubernetes admission controller designed to improve how evictions are handled during node drains to avoid the problems outlined in part 1. Working in tandem with an Operator, it automates pod rescheduling by intercepting eviction requests before they reach the pod. These requests are selectively rejected in a way that still allows the standard kubectl drain
command to function as expected.
Instead of immediately terminating pods or testing Pod Disruption Budget (PDB) limits, the admission controller notifies the Operator a pod needs to be moved using the cao.couchbase.com/reschedule
annotation.
Understanding admission control
Admission control is a key stage in the Kubernetes API request lifecycle that occurs after a request has been authenticated and authorized, but before it’s persisted to the cluster.
Admission controllers can validate or mutate these requests based on custom logic.
When a node is drained in Kubernetes, each attempt at a pod eviction triggers a CREATE
request for the pods/eviction
subresource. This is where our admission controller comes into play. It intercepts these eviction requests before they reach the pods and determines whether they should be allowed. In our case, the admission logic involves signalling to the Operator that a pod should be rescheduled safely by adding the reschedule annotation.
We’re using a validating webhook rather than a mutating webhook because our primary goal is to evaluate whether the eviction request is allowed. While we may annotate the pod as part of the decision process, the webhook itself is performing validation, not mutation, on the eviction request.
Handling node drains
As the title of this article makes clear, the primary goal of this project is to enable graceful handling of node drains. With the Eviction Reschedule Hook in place, the standard node drain flow is modified to support safer pod migration.
Eviction requests are intercepted before the PreStop
hook is executed. Instead of the pod being terminated, the eviction request is rejected and the pod annotated. The Operator checks for the reschedule annotation during each reconciliation loop. If the annotation appears, it will handle safely recreating the pod.
As draining a node taints it with node.kubernetes.io/unschedulable
, assuming there is another node in the Kubernetes cluster that matches the scheduling requirements of the pod, when the Operator creates the new pod it will exist on another node.
Importantly, the admission controller is designed to selectively filter only relevant pods. All other workloads on the node continue to follow the default eviction process.
Working with Kubectl
To integrate smoothly with kubectl
, it’s important to understand what happens under the hood when you run kubectl drain <node>
. Looking at the drain.go
source, after the node is cordoned, a separate goroutine is spawned for each pod on that node to handle its eviction.
Whether kubectl
continues attempting to evict a pod depends on the response from the admission controller. If the eviction request succeeds (i.e., no error is returned), the loop in which eviction requests are sent is exited. There is then a follow up check where kubectl
waits for the pod’s deletion. However, for Operator managed pods, we want kubectl
to keep retrying until they have been safely rescheduled, at which point we should end the goroutine before any further checks.
By design, kubectl
treats a 429 TooManyRequests
response to the eviction request in the goroutine as a signal to pause for 5 seconds before retrying. We can leverage this behavior in our admission controller: after pod selection logic, we return a 429
status if the pod is either being annotated for rescheduling or is already annotated and waiting to be moved.
After the pod has been successfully rescheduled, it will have a different name and therefore the admission controller will no longer be able to locate it by name and namespace. At that point, we can return a 404 NotFound
response to exit the goroutine.
In-place pod rescheduling
In some cases, the Operator may delete and recreate a pod using the same name.
At Couchbase, the upgradeProcess
field on a CouchbaseCluster
can be changed to control how the Operator replaces pods. This defaults to SwapRebalance, which tells the Operator to create a new pod with a different name, rebalance it into the cluster and then delete the old pod. InPlaceUpgrade is a faster but less flexible alternative, whereby the Operator will delete the pod and recreate it with the same name, reusing the existing volumes. The creates a challenge for the admission controller.
The eviction requests sent by the goroutine in kubectl
include only the pod’s name and namespace. Because of this limited context, the admission controller can’t assume that a pod lacking a reschedule annotation needs to be moved – it may simply be a newly recreated instance of a pod that was already rescheduled.
To handle this, the admission controller maintains a lightweight tracking mechanism which is used when pods will be rescheduled with the same name. When a pod is annotated for rescheduling, we will also annotates another resource, named the tracking resource, with the reschedule.hook/<podNamespace>.<podName> key
.
If a subsequent eviction request is intercepted for a pod without the reschedule annotation, the presence of this annotation on the tracking resource indicates that the pod has already been rescheduled. When this occurs, the webhook will clean up the tracking annotation for that pod before return a 404 NotFound
response to exit the goroutine.
While the tracking resource type will default to CouchbaseCluster
, the webhook also supports using a pod’s Namespace
. If pods will never be rescheduled with the same name, this feature can be disabled.
Running the Eviction Reschedule Hook
Deploying the Eviction Reschedule Hook and its supporting components is straightforward. The README provides detailed instructions for building the image and deploying the full stack.
The main components that make up the reschedule hook are:
ValidatingWebhookConfiguration: Tells the Kubernetes API server to forward CREATE
requests for the pods/eviction
subresource to our webhook service
Service: Routes incoming webhook requests to the admission controller pod
Admission Controller: Runs the reschedule hook inside a container, typically deployed as a Kubernetes Deployment
ServiceAccount, ClusterRole, and ClusterRoleBinding: These grant the admission controller the necessary permissions. At minimum, get
and patch
permissions for pods are needed. If using a tracking resource, get
and patch
permissions for the tracking resource should also be added
Once deployed, draining a node on which Couchbase pods reside demonstrates how the Reschedule Hook intercepts and rejects eviction requests until the Operator safely recreates and balances pods back into the cluster:
The graceful handling of eviction requests can also be seen in the Couchbase administrator UI. Pods are no longer failed over and instead replaced with zero downtime:
Try it out, get involved
The Eviction Reschedule Hook makes node drains safer and more predictable for stateful applications running on Kubernetes. By coordinating eviction handling with an Operator, it enables graceful rescheduling without disrupting the underlying application.
The project is open source. Check out the repository on GitHub to get started and take a look at the the contributing guide if you’re interested in helping out. Feedback, issues and pull requests are all welcome!