Scaling ML Training on Kubernetes with JobSet
- Abhimanyu Saharan
As machine learning models and high-performance computing (HPC) jobs scale across thousands of nodes and accelerators, Kubernetes is increasingly becoming the platform of choice for orchestrating these workloads. However, running distributed training jobs using only core Kubernetes APIs still presents challenges around pod orchestration, startup sequencing, failure handling, and topology-aware scheduling.
To address these challenges, the Kubernetes community has introduced JobSet—a new open-source API designed to model and manage distributed batch workloads natively within Kubernetes.
Why JobSet?¶
Large-scale ML workloads, such as training LLMs across thousands of GPUs or TPUs, often require coordinated orchestration of different types of pods (drivers, workers, parameter servers), intelligent placement across network topologies, and robust lifecycle management. While Kubernetes Jobs and the batch APIs have made significant progress (e.g., indexed jobs, pod failure policies), they still fall short for complex distributed training use cases.
The existing ecosystem is fragmented. Framework-specific operators like TFJob
, PyTorchJob
, and MPIJob
exist under projects such as Kubeflow, but they lack a unified API surface and consistent semantics across workloads. JobSet solves this fragmentation by providing a generic, extensible, and framework-agnostic model for distributed jobs.
What is JobSet?¶
JobSet defines a higher-level abstraction over Kubernetes Jobs, grouping them into ReplicatedJobs with unified configuration and coordination logic. A JobSet is essentially a collection of jobs that are part of a single distributed workload, allowing each component (e.g., a driver or worker group) to have its own pod template, restart policy, and scheduling constraints.
Key features include:
- ReplicatedJobs: Declarative replication of Jobs across accelerator islands, enabling workload sharding across racks, slices, or zones.
- Startup Sequencing: Supports patterns like “driver-first” (e.g. Ray) or “worker-first” (e.g. MPI) startup sequencing.
- Automatic Inter-Pod Communication: Built-in headless service creation with lifecycle management.
- Success & Failure Policies: Granular control over when a JobSet is marked complete or retried.
- Exclusive Topology Placement: Enforces that replicated jobs are scheduled exclusively to separate network or compute domains (e.g., one per rack or TPU slice).
- Integration with Kueue: JobSet is compatible with Kueue for job queuing, oversubscription, and multi-tenant workload prioritization.
Example: Running JAX on TPUs with JobSet¶
Here’s a simplified JobSet spec to run a distributed training job across 4 TPU slices using JAX:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: multislice
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
failurePolicy:
maxRestarts: 3
replicatedJobs:
- name: workers
replicas: 4
template:
spec:
parallelism: 2
completions: 2
template:
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
cloud.google.com/gke-tpu-topology: 2x4
containers:
- name: jax-tpu
image: python:3.8
command:
- bash
- -c
- |
pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
python -c 'import jax; print("Global device count:", jax.device_count())'
sleep 60
resources:
limits:
google.com/tpu: 4
This example highlights how JobSet can easily manage workloads requiring:
- Multiple pod templates
- Cluster-level exclusive placement policies
- Built-in communication and retries
Conclusion¶
JobSet represents a meaningful step forward in Kubernetes' ability to support modern, large-scale distributed workloads—especially in the domains of machine learning, scientific computing, and data-intensive training pipelines. By building on top of native Kubernetes primitives and integrating with tools like Kueue, JobSet provides both flexibility and consistency without locking users into framework-specific abstractions.
If you're managing complex multi-pod training jobs, coordinating compute across GPU or TPU clusters, or simply looking for a clean way to express distributed job topologies, JobSet offers a production-ready foundation. As the community iterates on the API and ecosystem, this project is well worth evaluating for your next-generation ML or HPC workload architecture.