Kubernetes v1.36: Staleness Mitigation and Observability for Controllers
Source: Kubernetes Blog
Staleness in Kubernetes Controllers
Staleness in Kubernetes controllers is a problem that affects many controllers and may influence controller behavior in subtle ways.
It is usually not discovered until it is too late—when a controller in production has already taken an incorrect action—because an underlying assumption made by the controller author turned out to be wrong.
Typical issues caused by staleness include:
- Controllers taking incorrect actions.
- Controllers not taking action when they should.
- Controllers taking too long to act.
I am excited to announce that Kubernetes v1.36 includes new features that help mitigate staleness in controllers and provide better observability into controller behavior.
What is Staleness?
Staleness in controllers comes from an outdated view of the world inside the controller’s cache.
To provide a fast user experience, controllers typically maintain a local cache of the cluster state. This cache is populated by watching the Kubernetes API server for changes to objects that the controller cares about. When the controller needs to take action, it first checks its cache for the latest information; if the cache is out‑of‑date, the controller updates it by watching the API server. This process is known as reconciliation.
When can the cache become outdated?
- Controller restart – the cache must be rebuilt from scratch, leaving a window where it is stale.
- API‑server outage – the cache cannot be refreshed, so the controller operates on stale data.
- Other edge cases – e.g., network partitions, informer lag, or large bursts of events.
During any of these periods the controller may be unable to act correctly.
Improvements in v1.36
Kubernetes v1.36 brings enhancements in both client‑go and in the implementations of highly contended controllers in kube‑controller‑manager, leveraging the new client‑go capabilities.
client‑go Improvements
-
Atomic FIFO processing (feature gate
AtomicFIFO) – builds on the existing FIFO queue implementation.- The queue now atomically handles operations received in batches (e.g., the initial list of objects an informer uses to populate its cache).
- This guarantees that the queue remains in a consistent state even when events arrive out of order.
-
Cache introspection – a new method on the
Storeinterface:
// LastStoreSyncResourceVersion returns the latest resource version that the store has observed.
func (s *Store) LastStoreSyncResourceVersion() string
- Controllers can use this to determine the most recent resource version the cache has seen, forming the basis for the staleness‑mitigation features in
kube‑controller‑manager.
kube‑controller‑manager Improvements
The v1.36 release enables four controllers to use the new capability by default:
| Controller | Feature gate |
|---|---|
| DaemonSet | StaleControllerConsistencyDaemonSet |
| StatefulSet | StaleControllerConsistencyStatefulSet |
| ReplicaSet | StaleControllerConsistencyReplicaSet |
| Job | StaleControllerConsistencyJob |
- The feature can be disabled by setting the corresponding gate to
false. - When the gate is enabled, a controller first checks the latest resource version of its cache before taking any action.
- If the cache’s version is lower than the version the controller has already written to the API server for the object it is reconciling, the controller does not act—its view is stale.
Use for Informer Authors
Informer authors can immediately benefit from these improvements. Below is an example of how the ReplicaSet informer uses the new feature.
type ConsistencyStore interface {
// WroteAt records that the given object was written at the given resource version.
WroteAt(owningObj runtime.Object, uid types.UID,
groupResource schema.GroupResource, resourceVersion string)
// EnsureReady returns true if the cache is up‑to‑date for the given object.
// It is used prior to taking any reconciliation action.
EnsureReady(owningObj runtime.Object, uid types.UID,
groupResource schema.GroupResource) bool
}
- The
ReplicaSetcontroller tracks both the ReplicaSet’s own resource version and the resource versions of the Pods it manages. - For a specific ReplicaSet, it records the latest written resource version of its Pods and any writes to the ReplicaSet itself.
- If the cache’s latest version is lower than what the controller has already written, the controller refrains from acting because its view is stale.
An informer author can employ ConsistencyStore to:
- Record writes (
WroteAt) as they happen. - Check freshness (
EnsureReady) before processing events.
This pattern ensures that controllers act only on up‑to‑date information, reducing the risk of stale‑state bugs.
Consistency Store Interface
type ConsistencyStore interface {
// WroteAt records the latest resource version that the controller has written
// to the API server for a given object.
// * `owningObj` – the object being reconciled.
// * `uid` – UID of the owning object.
// * `resourceVersion` – the resource version just written.
// * `groupResource` – the GroupResource of the object.
WroteAt(owningObj client.Object, uid types.UID,
resourceVersion string, groupResource schema.GroupResource)
// EnsureReady checks whether the cache is up‑to‑date for the given object.
// It is called before reconciliation to decide whether to proceed.
// Returns `true` if the cache is current, otherwise `false`.
EnsureReady(namespacedName types.NamespacedName) bool
// Clear removes the given object from the consistency store.
// It is used when an object is deleted.
Clear(namespacedName types.NamespacedName, uid types.UID)
}
Function Descriptions
| Function | Purpose | Details |
|---|---|---|
| WroteAt | Record the latest resource version written by the controller. | - Called after the controller writes an object to the API server. - Tracks owningObj, its uid, the resourceVersion that was written, and the object’s GroupResource.- Only version information is stored; the object itself is not retained. |
| EnsureReady | Verify that the cache is up‑to‑date before reconciling. | - Invoked prior to reconciliation. - Returns true if the cache reflects the latest version (as recorded by WroteAt), otherwise false. |
| Clear | Remove an entry from the store when an object is deleted. | - Prevents unbounded growth of the store. - Uses the object’s UID to differentiate between a deleted object and a newly created one with the same name.- Not required for EnsureReady, which only cares about the latest version. |
With these three functions, an informer author can implement staleness mitigation in their controller.
Observability
Kubernetes added related instrumentation to kube‑controller‑manager in v1.36.
These metrics are enabled by default and controlled via the same feature gates.
Metrics
| Metric | Description |
|---|---|
stale_sync_skips_total | Number of times a controller skipped a sync because its cache was stale. Exposed per controller that uses the staleness‑mitigation feature, under the controller’s subsystem. |
store_resource_version | Exposes the latest resource version of every shared informer. Labels: group, version, resource.Allows you to compare informer cache versions against the API server’s version to detect staleness. |
These metrics are available via the kube‑controller‑manager metrics endpoint.
What’s Next?
- SIG API Machinery will continue to evolve this feature and expand its adoption across more controllers.
- Feedback is welcomed – please comment below or open an issue on the Kubernetes GitHub repository.
- Collaboration with controller‑runtime is underway to expose these semantics to all controllers built with it, giving every controller “read‑your‑own‑writes” capabilities without extra implementation effort.