Scaling ML Training on Kubernetes with JobSet

Read time: 3 minutes
Abhimanyu Saharan
Abhimanyu Saharan

As machine learning models and high-performance computing (HPC) jobs scale across thousands of nodes and accelerators, Kubernetes is increasingly becoming the platform of choice for orchestrating these workloads. However, running distributed training jobs using only core Kubernetes APIs still presents challenges around pod orchestration, startup sequencing, failure handling, and topology-aware scheduling.

To address these challenges, the Kubernetes community has introduced JobSet—a new open-source API designed to model and manage distributed batch workloads natively within Kubernetes.

Why JobSet?

Large-scale ML workloads, such as training LLMs across thousands of GPUs or TPUs, often require coordinated orchestration of different types of pods (drivers, workers, parameter servers), intelligent placement across network topologies, and robust lifecycle management. While Kubernetes Jobs and the batch APIs have made significant progress (e.g., indexed jobs, pod failure policies), they still fall short for complex distributed training use cases.

The existing ecosystem is fragmented. Framework-specific operators like TFJob, PyTorchJob, and MPIJob exist under projects such as Kubeflow, but they lack a unified API surface and consistent semantics across workloads. JobSet solves this fragmentation by providing a generic, extensible, and framework-agnostic model for distributed jobs.

What is JobSet?

JobSet defines a higher-level abstraction over Kubernetes Jobs, grouping them into ReplicatedJobs with unified configuration and coordination logic. A JobSet is essentially a collection of jobs that are part of a single distributed workload, allowing each component (e.g., a driver or worker group) to have its own pod template, restart policy, and scheduling constraints.

Key features include:

  • ReplicatedJobs: Declarative replication of Jobs across accelerator islands, enabling workload sharding across racks, slices, or zones.
  • Startup Sequencing: Supports patterns like “driver-first” (e.g. Ray) or “worker-first” (e.g. MPI) startup sequencing.
  • Automatic Inter-Pod Communication: Built-in headless service creation with lifecycle management.
  • Success & Failure Policies: Granular control over when a JobSet is marked complete or retried.
  • Exclusive Topology Placement: Enforces that replicated jobs are scheduled exclusively to separate network or compute domains (e.g., one per rack or TPU slice).
  • Integration with Kueue: JobSet is compatible with Kueue for job queuing, oversubscription, and multi-tenant workload prioritization.

Example: Running JAX on TPUs with JobSet

Here’s a simplified JobSet spec to run a distributed training job across 4 TPU slices using JAX:

  • apiVersion: jobset.x-k8s.io/v1alpha2
    kind: JobSet
    metadata:
      name: multislice
      annotations:
        alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
    spec:
      failurePolicy:
        maxRestarts: 3
      replicatedJobs:
      - name: workers
        replicas: 4
        template:
          spec:
            parallelism: 2
            completions: 2
            template:
              spec:
                hostNetwork: true
                dnsPolicy: ClusterFirstWithHostNet
                nodeSelector:
                  cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
                  cloud.google.com/gke-tpu-topology: 2x4
                containers:
                - name: jax-tpu
                  image: python:3.8
                  command:
                  - bash
                  - -c
                  - |
                    pip install "jax[tpu]" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
                    python -c 'import jax; print("Global device count:", jax.device_count())'
                    sleep 60
                  resources:
                    limits:
                      google.com/tpu: 4
    

    This example highlights how JobSet can easily manage workloads requiring:

    • Multiple pod templates
    • Cluster-level exclusive placement policies
    • Built-in communication and retries

    Conclusion

    JobSet represents a meaningful step forward in Kubernetes' ability to support modern, large-scale distributed workloads—especially in the domains of machine learning, scientific computing, and data-intensive training pipelines. By building on top of native Kubernetes primitives and integrating with tools like Kueue, JobSet provides both flexibility and consistency without locking users into framework-specific abstractions.

    If you're managing complex multi-pod training jobs, coordinating compute across GPU or TPU clusters, or simply looking for a clean way to express distributed job topologies, JobSet offers a production-ready foundation. As the community iterates on the API and ecosystem, this project is well worth evaluating for your next-generation ML or HPC workload architecture.

    You Might Also Like

    Filter by category