Scaling ML Training on Kubernetes with JobSet

JobSet is a Kubernetes-native API for managing distributed ML and HPC jobs with support for multi-role pods, topology-aware placement, and scaling.

May 5, 2025

Abhimanyu Saharan

As machine learning models and high-performance computing (HPC) jobs scale across thousands of nodes and accelerators, Kubernetes is increasingly becoming the platform of choice for orchestrating these workloads. However, running distributed training jobs using only core Kubernetes APIs still presents challenges around pod orchestration, startup sequencing, failure handling, and topology-aware scheduling.

To address these challenges, the Kubernetes community has introduced JobSet—a new open-source API designed to model and manage distributed batch workloads natively within Kubernetes.

Why JobSet?¶

Large-scale ML workloads, such as training LLMs across thousands of GPUs or TPUs, often require coordinated orchestration of different types of pods (drivers, workers, parameter servers), intelligent placement across network topologies, and robust lifecycle management. While Kubernetes Jobs and the batch APIs have made significant progress (e.g., indexed jobs, pod failure policies), they still fall short for complex distributed training use cases.

The existing ecosystem is fragmented. Framework-specific operators like TFJob, PyTorchJob, and MPIJob exist under projects such as Kubeflow, but they lack a unified API surface and consistent semantics across workloads. JobSet solves this fragmentation by providing a generic, extensible, and framework-agnostic model for distributed jobs.

What is JobSet?¶

JobSet defines a higher-level abstraction over Kubernetes Jobs, grouping them into ReplicatedJobs with unified configuration and coordination logic. A JobSet is essentially a collection of jobs that are part of a single distributed workload, allowing each component (e.g., a driver or worker group) to have its own pod template, restart policy, and scheduling constraints.

Key features include:

ReplicatedJobs: Declarative replication of Jobs across accelerator islands, enabling workload sharding across racks, slices, or zones.
Startup Sequencing: Supports patterns like “driver-first” (e.g. Ray) or “worker-first” (e.g. MPI) startup sequencing.
Automatic Inter-Pod Communication: Built-in headless service creation with lifecycle management.
Success & Failure Policies: Granular control over when a JobSet is marked complete or retried.
Exclusive Topology Placement: Enforces that replicated jobs are scheduled exclusively to separate network or compute domains (e.g., one per rack or TPU slice).
Integration with Kueue: JobSet is compatible with Kueue for job queuing, oversubscription, and multi-tenant workload prioritization.

Example: Running JAX on TPUs with JobSet¶

Here’s a simplified JobSet spec to run a distributed training job across 4 TPU slices using JAX:

This example highlights how JobSet can easily manage workloads requiring:

Multiple pod templates
Cluster-level exclusive placement policies
Built-in communication and retries

Conclusion¶

JobSet represents a meaningful step forward in Kubernetes' ability to support modern, large-scale distributed workloads—especially in the domains of machine learning, scientific computing, and data-intensive training pipelines. By building on top of native Kubernetes primitives and integrating with tools like Kueue, JobSet provides both flexibility and consistency without locking users into framework-specific abstractions.

If you're managing complex multi-pod training jobs, coordinating compute across GPU or TPU clusters, or simply looking for a clean way to express distributed job topologies, JobSet offers a production-ready foundation. As the community iterates on the API and ecosystem, this project is well worth evaluating for your next-generation ML or HPC workload architecture.

FAQs

What is Kubernetes JobSet and why was it introduced?

JobSet is a new open-source Kubernetes API designed to orchestrate distributed batch workloads, especially for large-scale ML and HPC jobs. It addresses limitations of core Kubernetes Jobs by enabling coordinated multi-pod orchestration, startup sequencing, and topology-aware scheduling.

How does JobSet differ from existing workload operators like TFJob or MPIJob?

Unlike framework-specific operators, JobSet is framework-agnostic and provides a unified API for managing distributed workloads. It standardizes semantics across different workloads and eliminates the need for bespoke controllers tied to specific ML frameworks.

What kind of workloads benefit most from JobSet?

JobSet is ideal for ML training workloads involving drivers, workers, and parameter servers, especially those requiring tight orchestration across GPUs or TPUs, hardware-aware scheduling, and consistent failure handling at scale.

Can JobSet be used with Kueue for workload management?

Yes. JobSet integrates with Kueue to support workload queuing, oversubscription, and multi-tenant prioritization, making it suitable for shared HPC or ML clusters.

Why JobSet?¶

What is JobSet?¶

Key features include:

ReplicatedJobs: Declarative replication of Jobs across accelerator islands, enabling workload sharding across racks, slices, or zones.

Startup Sequencing: Supports patterns like “driver-first” (e.g. Ray) or “worker-first” (e.g. MPI) startup sequencing.

Automatic Inter-Pod Communication: Built-in headless service creation with lifecycle management.

Success & Failure Policies: Granular control over when a JobSet is marked complete or retried.

Exclusive Topology Placement: Enforces that replicated jobs are scheduled exclusively to separate network or compute domains (e.g., one per rack or TPU slice).

Integration with Kueue: JobSet is compatible with Kueue for job queuing, oversubscription, and multi-tenant workload prioritization.

Conclusion¶