Kubernetes v1.35: New level of efficiency with in-place Pod restart

Published: 1 month ago (January 2, 2026 at 01:30 PM EST)

6 min read

Source: Kubernetes Blog

Enabling the Feature

The functionality is available by enabling the RestartAllContainersOnContainerExits feature gate. This alpha feature extends the Container Restart Rules feature, which graduated to beta in Kubernetes 1.35.

# Example feature‑gate configuration
apiVersion: apiserver.k8s.io/v1
kind: AdmissionConfiguration
plugins:
  - name: RestartAllContainersOnContainerExits
    configuration:
      enabled: true

The Problem

When a single‑container restart isn’t enough and recreating Pods is too costly, the existing restart mechanisms fall short.

Kubernetes has long supported restartPolicy at the Pod level and, more recently, at the individual container level.
These policies work well for isolated crashes, but many modern applications have complex inter‑container dependencies.

Typical Scenarios

Init container prepares the environment (e.g., mounts a volume or generates a config file).
If the main application container corrupts this environment, simply restarting that one container isn’t sufficient—the entire initialization process must run again.
Watcher sidecar monitors system health.
If it detects an unrecoverable but retriable error state, it must trigger a restart of the main application container from a clean slate.
Resource‑managing sidecar fails.
Even if the sidecar restarts on its own, the main container may be stuck trying to access an outdated or broken connection.

In all these cases, the desired action is not to restart a single container, but all of them. Previously, the only way to achieve this was to delete the Pod and let a controller (e.g., a Job or ReplicaSet) create a new one—a slow, expensive process involving the scheduler, node‑resource allocation, and re‑initialization of networking and storage.

Impact on Large‑Scale AI/ML Workloads

Scale: ≥ 1,000 Nodes with one Pod per Node.
Requirement: When a failure occurs (e.g., a Node crash), all Pods in the fleet must be recreated to reset state before training can resume, even if most Pods were not directly affected.
Cost: Deleting, creating, and scheduling thousands of Pods simultaneously creates a massive bottleneck. The estimated overhead can cost ≈ $100 k per month in wasted resources.

Handling these failures traditionally requires a complex integration between the training framework and Kubernetes—often fragile and toilsome. The new feature provides a Kubernetes‑native solution, improving robustness and letting developers focus on core training logic.

Additional Benefit

Keeping Pods on their assigned Nodes enables further optimizations, such as node‑level caching tied to a specific Pod identity—something impossible when Pods are unnecessarily recreated on different Nodes.

Introducing the `RestartAllContainers` Action

Kubernetes v1.35 adds a new action to the container restart rules: RestartAllContainers.
When a container exits in a way that matches a rule with this action, the kubelet initiates a fast, in‑place restart of the Pod.

What Is Preserved During an In‑Place Restart?

Pod UID, IP address, and network namespace
Pod sandbox and any attached devices
All volumes, including emptyDir and mounted volumes from PVCs

After terminating all running containers, the Pod’s startup sequence is re‑executed from the very beginning:

All init containers run again in order.
Sidecars and regular containers start thereafter, ensuring a completely fresh start in a known‑good environment.

Note: Ephemeral containers are terminated. All other containers—including those that previously succeeded or failed—are restarted, regardless of their individual restart policies.

Use Cases

1. Efficient Restarts for ML/Batch Jobs

Problem: Rescheduling a worker Pod on failure is costly; on a 1,000‑node training cluster, the overhead can waste > $100 k in compute resources per month.
Solution: Use RestartAllContainers to enable a fast, hybrid recovery strategy:
- Recreate only the “bad” Pods (e.g., those on unhealthy Nodes).
- Trigger RestartAllContainers for the remaining healthy Pods.
Result: Benchmarks show recovery overhead drops from minutes to a few seconds.

2. Watcher‑Sidecar‑Driven Reset

A watcher sidecar can monitor the main training process. If it encounters a specific, retriable error, the watcher exits with a designated code that triggers a fast reset of the worker Pod, allowing it to restart from the last checkpoint without involving the Job controller. This capability is now natively supported by Kubernetes.

Read more: Future development and JobSet features are described in KEP‑467 – JobSet in‑place restart.

Example Pod Specification

apiVersion: v1
kind: Pod
metadata:
  name: ml-worker-pod
spec:
  restartPolicy: Never
  initContainers:
    # This init container will re‑run on every in‑place restart
    - name: setup-environment
      image: my-repo/setup-worker:1.0
  containers:
    - name: watcher-sidecar
      image: my-repo/watcher:1.0
      # Container‑level restart policy (still respected for individual restarts)
      restartPolicy: Always
  restartPolicyRules:
    - action: RestartAllContainers
      # Example rule: trigger when the watcher exits with code 42
      exitCodes: [42]

The snippet above demonstrates how to declare a pod that will restart all containers when the watcher-sidecar exits with exit code 42.

Bottom Line

RestartAllContainers gives Kubernetes users a lightweight, in‑place pod reset mechanism that:

Saves compute resources and money at scale.
Reduces recovery time from minutes to seconds.
Preserves critical pod identity (UID, IP, volumes, etc.).
Enables richer sidecar‑driven recovery patterns without extra controller logic.

This feature marks a significant step forward for building robust, efficient AI/ML platforms on Kubernetes.

`RestartAllContainers`

onExit

exitCodes:
  operator: In          # A specific exit code from the watcher triggers a full pod restart
  values: [88]
containers:
  - name: main-application
    image: my-repo/training-app:1.0

1. Re‑running Init Containers for a Clean State

Imagine a scenario where an init container is responsible for fetching credentials or setting up a shared volume.
If the main application fails in a way that corrupts this shared state, you need the init container to rerun.

By configuring the main application to exit with a specific code upon detecting such a corruption, you can trigger the RestartAllContainers action, guaranteeing that the init container provides a clean setup before the application restarts.

2. Handling a High Rate of Similar Task Executions

Some workloads are best represented as a Pod execution, where each task requires a clean environment (e.g., game‑session backends or queue‑item processors).
When the task rate is high, the full cycle of Pod creation, scheduling, and initialization becomes too expensive—especially for short‑lived tasks.

The ability to restart all containers from scratch gives a Kubernetes‑native way to handle this scenario without custom solutions or frameworks.

How to Use It

Enable the feature gate
- Set RestartAllContainersOnContainerExits on your Kubernetes control‑plane components (API server and kubelet).
- Requires Kubernetes v1.35+.
- This alpha feature extends the ContainerRestartRules feature, which graduated to beta in v1.35 and is enabled by default.
Add restartPolicyRules to any container (init, sidecar, or regular) and use the RestartAllContainers action.
Best‑practice checklist
- Ensure all containers are re‑entrant.
- Verify external tooling can handle init containers re‑running.
- Remember that preStop hooks are NOT executed when a full‑container restart occurs; containers must tolerate abrupt termination.

Observing the Restart

A new Pod condition AllContainersRestarting is added to the Pod’s status.
- Becomes True when a restart is triggered.
- Reverts to False once all containers have terminated and the Pod is ready to start its lifecycle anew.
All containers restarted by this action will have their restart count incremented in the container status.

Learn More

Pod Lifecycle – official documentation.
KEP‑5532 – Restart All Containers on Container Exits (detailed proposal).
JobSet in‑place restart – discussion in JobSet issue #467.

We Want Your Feedback!

As an alpha feature, RestartAllContainers is ready for experimentation.
Your use‑cases and feedback are welcome. This feature is driven by the SIG Node community.

Get involved:

Slack: #sig-node
Mailing list: (link to SIG Node mailing list)

Kubernetes v1.35: New level of efficiency with in-place Pod restart

Enabling the Feature

The Problem

Typical Scenarios

Impact on Large‑Scale AI/ML Workloads

Additional Benefit

Introducing the `RestartAllContainers` Action

What Is Preserved During an In‑Place Restart?

Use Cases

1. Efficient Restarts for ML/Batch Jobs

2. Watcher‑Sidecar‑Driven Reset

Example Pod Specification

Bottom Line

`RestartAllContainers`

onExit

1. Re‑running Init Containers for a Clean State

2. Handling a High Rate of Similar Task Executions

How to Use It

Observing the Restart

Learn More

We Want Your Feedback!

Related posts

Kubernetes v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager

Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)

Deployment strategies

kubernetes project #1

Enabling the Feature

The Problem

Typical Scenarios

Impact on Large‑Scale AI/ML Workloads

Additional Benefit

Introducing the RestartAllContainers Action

What Is Preserved During an In‑Place Restart?

Use Cases

1. Efficient Restarts for ML/Batch Jobs

2. Watcher‑Sidecar‑Driven Reset

Example Pod Specification

Bottom Line

RestartAllContainers

onExit

1. Re‑running Init Containers for a Clean State

2. Handling a High Rate of Similar Task Executions

How to Use It

Observing the Restart

Learn More

We Want Your Feedback!

Related posts

Kubernetes v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager

Kubernetes v1.35: Extended Toleration Operators to Support Numeric Comparisons (Alpha)

Deployment strategies

kubernetes project #1

Introducing the `RestartAllContainers` Action

`RestartAllContainers`