Kubernetes v1.35: Introducing Workload Aware Scheduling

Published: 1 month ago (December 29, 2025 at 01:30 PM EST)

7 min read

Source: Kubernetes Blog

Workload‑Centric Scheduling

Scheduling large workloads is far more complex and fragile than scheduling a single Pod. Unlike per‑Pod scheduling, workload scheduling must consider all Pods together, taking into account their relationships and placement constraints.

Why Workload Scheduling Matters

Strategic placement – For a machine‑learning batch job, workers often need to be co‑located (e.g., on the same rack) to minimise latency and maximise throughput.
Identical Pods – Pods that belong to the same workload are usually identical from a scheduling perspective, which changes the scheduling algorithm’s assumptions.
Growing demand – As AI and data‑intensive workloads proliferate, efficient workload scheduling becomes a critical requirement for Kubernetes users.

Current Landscape

Many custom schedulers exist to handle workload‑level decisions, but they are external to the core kube-scheduler. This fragmentation leads to:

Inconsistent user experience – Operators must learn and maintain separate scheduling components.
Limited integration – Custom schedulers cannot easily leverage the built‑in features of the default scheduler (e.g., priority, preemption, or extender APIs).
Operational overhead – Deploying, monitoring, and upgrading additional schedulers adds complexity.

Call to Action

Given the importance of workload scheduling—especially in the AI era—it is time to:

Elevate workloads to first‑class citizens in the native kube-scheduler.
Expose native APIs that allow users to declare workload‑level constraints (rack affinity, spread policies, etc.).
Leverage existing scheduler extensions (e.g., framework plugins) to implement workload‑aware logic without abandoning the core scheduler.

By integrating workload‑centric capabilities directly into kube-scheduler, Kubernetes can provide a unified, reliable, and extensible scheduling experience for both traditional and emerging AI workloads.

Workload‑Aware Scheduling

Kubernetes v1.35 introduces the first tranche of workload‑aware scheduling improvements. These changes are part of a broader, multi‑SIG effort that will evolve across several releases, aiming toward the north‑star goal of seamless workload scheduling and management in Kubernetes—including, but not limited to, preemption and autoscaling.

Key Highlights

Feature	Description
Workload API	A new API that lets you describe both the desired shape and the scheduling‑oriented requirements of a workload.
Gang Scheduling (initial implementation)	Instructs the `kube‑scheduler` to schedule a set of Pods all‑or‑nothing, ensuring that a gang either runs completely or not at all.
Opportunistic Batching	Improves the scheduling speed of identical Pods (typically forming a gang) by batching them opportunistically.

What This Means for Users

Predictable deployments – Define the exact number and characteristics of Pods a workload needs, and the scheduler will honor those constraints.
Reduced scheduling latency – Identical Pods are grouped and scheduled together, cutting down the time to reach a ready state.
Foundation for future features – This release lays the groundwork for more advanced capabilities such as sophisticated preemption, autoscaling, and cross‑cluster workload placement.

Note: The functionality will continue to expand in upcoming releases and across additional SIGs, so keep an eye on the Kubernetes release notes for the latest enhancements.

Workload API

The new Workload API resource belongs to the scheduling.k8s.io/v1alpha1 API group.

It provides a structured, machine‑readable definition of the scheduling requirements for a multi‑Pod application. While user‑facing workloads such as Jobs describe what to run, a Workload describes how a group of Pods should be scheduled and how its placement is managed throughout its lifecycle.

What a Workload Lets You Do

Define a group of Pods (a podGroup).
Apply a scheduling policy to that group (e.g., gang scheduling, priority, etc.).

Example: Gang‑Scheduling Configuration

The snippet below creates a Workload that defines a pod group named workers and applies a gang policy requiring at least four Pods to be scheduled together.

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: training-job-workload
  namespace: some-ns
spec:
  podGroups:
    - name: workers
      policy:
        gang:
          # The gang is schedulable only if 4 pods can run at once
          minCount: 4

Linking Pods to the Workload

When you create the individual Pods, reference the Workload (and the specific pod group) via the workloadRef field:

apiVersion: v1
kind: Pod
metadata:
  name: worker-0
  namespace: some-ns
spec:
  workloadRef:
    name: training-job-workload   # Workload name
    podGroup: workers            # Pod group defined in the Workload
  # ... other pod spec fields ...

With this setup, the scheduler will place the Pods only when the gang‑scheduling constraints are satisfied, ensuring that the entire group can run together.

How Gang Scheduling Works

The gang policy enforces all‑or‑nothing placement. Without gang scheduling, a Job can be partially scheduled, consuming resources without being able to run. This leads to resource waste and potential deadlocks.

Lifecycle of a Gang‑Scheduled Pod Group

Pod creation – Pods (or a controller that creates them) are marked as part of a gang‑scheduled pod group.
Scheduler handling – The scheduler’s GangScheduling plugin manages the lifecycle for each pod group (or replica key).
Blocking phase – Pods are blocked from being scheduled until all of the following conditions are met:
- The referenced Workload object exists.
- The referenced PodGroup exists inside that Workload.
- The number of pending Pods in the group reaches the configured minCount.

Permit Gate

Once the minCount is satisfied, the scheduler attempts to place the pods, but they do not bind to nodes immediately. Instead, they wait at a Permit gate:

The scheduler checks whether it has found valid assignments for the entire group (at least minCount).
If the group fits: the gate opens and all Pods are bound to their nodes in a single step.
If only a subset fits (within the timeout, default = 5 minutes): the scheduler rejects all Pods in the group. The rejected Pods return to the queue, freeing the reserved resources for other workloads.

Future Directions

This is the first implementation of gang scheduling in Kubernetes. Upcoming releases aim to:

Provide a single‑cycle scheduling phase for the whole gang.
Add workload‑level preemption.
Introduce additional features that move toward the long‑term “north‑star” goal for coordinated scheduling.

Key take‑away: gang scheduling guarantees that a set of Pods either runs together or not at all, preventing partial allocation and improving overall cluster efficiency.

Opportunistic Batching

Kubernetes v1.35 introduces opportunistic batching (a Beta feature) that improves scheduling latency for identical Pods. Unlike gang scheduling, this feature requires no explicit opt‑in or the Workload API. The scheduler opportunistically reuses feasibility calculations for Pods that share the same scheduling requirements (container images, resource requests, affinities, etc.), significantly speeding up the scheduling process.

Note: Most users will benefit automatically, provided their Pods meet the criteria listed below.

Restrictions

Opportunistic batching works only when all fields used by the kube-scheduler to find a placement are identical between Pods. Certain scheduler features disable batching for correctness.

Review your kube-scheduler configuration to ensure it isn’t implicitly disabling batching.
See the official documentation for a complete list of restrictions:

Official Kubernetes Documentation – Opportunistic Batching

The North Star Vision

The project has a broad ambition to deliver workload‑aware scheduling. These new APIs and scheduling enhancements are just the first steps.

Near‑Future Goals

Introduce a workload scheduling phase
Improve support for multi‑node Dynamic Resource Allocation (DRA) and topology‑aware scheduling
Workload‑level preemption
Better integration between scheduling and autoscaling
Enhanced interaction with external workload schedulers
Manage placement of workloads throughout their entire lifecycle
Multi‑workload scheduling simulations

Note: The priority and implementation order of these focus areas are subject to change. Stay tuned for further updates.

Getting Started

To try the workload‑aware scheduling improvements, enable the required feature gates and API groups on your cluster components.

1. Workload API

Feature gate: GenericWorkload
Components: kube-apiserver and kube-scheduler
API group: scheduling.k8s.io/v1alpha1 (must be enabled)

2. Gang Scheduling

Feature gate: GangScheduling
Component: kube-scheduler
Prerequisite: The Workload API (step 1) must already be enabled.

3. Opportunistic Batching (Beta)

Enabled by default in v1.35.
To disable, turn off the OpportunisticBatching feature gate on kube-scheduler.

We encourage you to experiment with workload‑aware scheduling in your test clusters and share your experiences. Your feedback helps shape the future of Kubernetes scheduling.

How to Provide Feedback

Slack: Join the conversation in #sig‑scheduling.
GitHub:
- Comment on the workload‑aware scheduling tracking issue.
- File a new enhancement issue in the Kubernetes repository.

Learn More

Read the KEPs for:
- Workload API and gang scheduling
- Opportunistic batching
Track the Workload‑aware scheduling issue for recent updates.