Kubernetes v1.36: Staleness Mitigation and Observability for Controllers

Published: (April 28, 2026 at 02:35 PM EDT)
6 min read

Source: Kubernetes Blog

Staleness in Kubernetes Controllers

Staleness in Kubernetes controllers is a problem that affects many controllers and may influence controller behavior in subtle ways.
It is usually not discovered until it is too late—when a controller in production has already taken an incorrect action—because an underlying assumption made by the controller author turned out to be wrong.

Typical issues caused by staleness include:

  • Controllers taking incorrect actions.
  • Controllers not taking action when they should.
  • Controllers taking too long to act.

I am excited to announce that Kubernetes v1.36 includes new features that help mitigate staleness in controllers and provide better observability into controller behavior.


What is Staleness?

Staleness in controllers comes from an outdated view of the world inside the controller’s cache.
To provide a fast user experience, controllers typically maintain a local cache of the cluster state. This cache is populated by watching the Kubernetes API server for changes to objects that the controller cares about. When the controller needs to take action, it first checks its cache for the latest information; if the cache is out‑of‑date, the controller updates it by watching the API server. This process is known as reconciliation.

When can the cache become outdated?

  • Controller restart – the cache must be rebuilt from scratch, leaving a window where it is stale.
  • API‑server outage – the cache cannot be refreshed, so the controller operates on stale data.
  • Other edge cases – e.g., network partitions, informer lag, or large bursts of events.

During any of these periods the controller may be unable to act correctly.


Improvements in v1.36

Kubernetes v1.36 brings enhancements in both client‑go and in the implementations of highly contended controllers in kube‑controller‑manager, leveraging the new client‑go capabilities.

client‑go Improvements

  • Atomic FIFO processing (feature gate AtomicFIFO) – builds on the existing FIFO queue implementation.

    • The queue now atomically handles operations received in batches (e.g., the initial list of objects an informer uses to populate its cache).
    • This guarantees that the queue remains in a consistent state even when events arrive out of order.
  • Cache introspection – a new method on the Store interface:

// LastStoreSyncResourceVersion returns the latest resource version that the store has observed.
func (s *Store) LastStoreSyncResourceVersion() string
  • Controllers can use this to determine the most recent resource version the cache has seen, forming the basis for the staleness‑mitigation features in kube‑controller‑manager.

kube‑controller‑manager Improvements

The v1.36 release enables four controllers to use the new capability by default:

ControllerFeature gate
DaemonSetStaleControllerConsistencyDaemonSet
StatefulSetStaleControllerConsistencyStatefulSet
ReplicaSetStaleControllerConsistencyReplicaSet
JobStaleControllerConsistencyJob
  • The feature can be disabled by setting the corresponding gate to false.
  • When the gate is enabled, a controller first checks the latest resource version of its cache before taking any action.
    • If the cache’s version is lower than the version the controller has already written to the API server for the object it is reconciling, the controller does not act—its view is stale.

Use for Informer Authors

Informer authors can immediately benefit from these improvements. Below is an example of how the ReplicaSet informer uses the new feature.

type ConsistencyStore interface {
    // WroteAt records that the given object was written at the given resource version.
    WroteAt(owningObj runtime.Object, uid types.UID,
            groupResource schema.GroupResource, resourceVersion string)

    // EnsureReady returns true if the cache is up‑to‑date for the given object.
    // It is used prior to taking any reconciliation action.
    EnsureReady(owningObj runtime.Object, uid types.UID,
                groupResource schema.GroupResource) bool
}
  • The ReplicaSet controller tracks both the ReplicaSet’s own resource version and the resource versions of the Pods it manages.
  • For a specific ReplicaSet, it records the latest written resource version of its Pods and any writes to the ReplicaSet itself.
  • If the cache’s latest version is lower than what the controller has already written, the controller refrains from acting because its view is stale.

An informer author can employ ConsistencyStore to:

  1. Record writes (WroteAt) as they happen.
  2. Check freshness (EnsureReady) before processing events.

This pattern ensures that controllers act only on up‑to‑date information, reducing the risk of stale‑state bugs.


Consistency Store Interface

type ConsistencyStore interface {
    // WroteAt records the latest resource version that the controller has written
    // to the API server for a given object.
    // * `owningObj` – the object being reconciled.
    // * `uid` – UID of the owning object.
    // * `resourceVersion` – the resource version just written.
    // * `groupResource` – the GroupResource of the object.
    WroteAt(owningObj client.Object, uid types.UID,
        resourceVersion string, groupResource schema.GroupResource)

    // EnsureReady checks whether the cache is up‑to‑date for the given object.
    // It is called before reconciliation to decide whether to proceed.
    // Returns `true` if the cache is current, otherwise `false`.
    EnsureReady(namespacedName types.NamespacedName) bool

    // Clear removes the given object from the consistency store.
    // It is used when an object is deleted.
    Clear(namespacedName types.NamespacedName, uid types.UID)
}

Function Descriptions

FunctionPurposeDetails
WroteAtRecord the latest resource version written by the controller.- Called after the controller writes an object to the API server.
- Tracks owningObj, its uid, the resourceVersion that was written, and the object’s GroupResource.
- Only version information is stored; the object itself is not retained.
EnsureReadyVerify that the cache is up‑to‑date before reconciling.- Invoked prior to reconciliation.
- Returns true if the cache reflects the latest version (as recorded by WroteAt), otherwise false.
ClearRemove an entry from the store when an object is deleted.- Prevents unbounded growth of the store.
- Uses the object’s UID to differentiate between a deleted object and a newly created one with the same name.
- Not required for EnsureReady, which only cares about the latest version.

With these three functions, an informer author can implement staleness mitigation in their controller.


Observability

Kubernetes added related instrumentation to kube‑controller‑manager in v1.36.
These metrics are enabled by default and controlled via the same feature gates.

Metrics

MetricDescription
stale_sync_skips_totalNumber of times a controller skipped a sync because its cache was stale. Exposed per controller that uses the staleness‑mitigation feature, under the controller’s subsystem.
store_resource_versionExposes the latest resource version of every shared informer.
Labels: group, version, resource.
Allows you to compare informer cache versions against the API server’s version to detect staleness.

These metrics are available via the kube‑controller‑manager metrics endpoint.


What’s Next?

  • SIG API Machinery will continue to evolve this feature and expand its adoption across more controllers.
  • Feedback is welcomed – please comment below or open an issue on the Kubernetes GitHub repository.
  • Collaboration with controller‑runtime is underway to expose these semantics to all controllers built with it, giving every controller “read‑your‑own‑writes” capabilities without extra implementation effort.
0 views
Back to Blog

Related posts

Read more »