Nebius, a leading AI infrastructure company, is excited to announce the open-source release of Soperator, the world’s first fully featured Kubernetes operator for Slurm, designed to optimize workload management and orchestration in modern machine-learning (ML) and high-performance computing (HPC) environments.
Soperator has been developed by Nebius to merge the power of Slurm, a job orchestrator designed to manage large-scale HPC clusters, with Kubernetes’ flexible and scalable container orchestration. It delivers simplicity and efficient job scheduling when working in compute-intensive environments, particularly for GPU-heavy workloads, making it ideal for ML training and distributed computing tasks.
Narek Tatevosyan, Director of Product Management for the Nebius Cloud Platform, said:
“Nebius is rebuilding cloud for the AI age by responding to the challenges that we know AI and ML professionals are facing. Currently there is no workload orchestration product on the market that is specialized for GPU-heavy workloads. By releasing Soperator as an open-source solution, we aim to put a powerful new tool into the hands of the ML and HPC communities.
“We are strong believers in community driven innovation and our team has a strong track record of open-sourcing innovative products. We’re excited to see how this technology will continue to evolve and enable AI professionals to focus on enhancing their models and building new products.”
Danila Shtan, Chief Technology Officer at Nebius, added:
“By open-sourcing Soperator, we’re not just releasing a tool – we’re standing by our commitment to open-source innovation in an industry where many keep their solutions proprietary. We’re pushing for a cloud-native approach to traditionally conservative HPC workloads, modernizing workload orchestration for GPU-intensive tasks. This strategic initiative reflects our dedication to fostering community collaboration and advancing AI and HPC technologies globally.”
Key features of Soperator include:
- Enhanced scheduling and orchestration: Soperator provides precise workload distribution across large compute clusters, optimizing GPU resource usage and enabling parallel job execution. This minimizes idle GPU capacity, optimizes costs, and facilitates more efficient collaboration, making it a crucial tool for teams working on large-scale ML projects.
- Fault-tolerant training: Soperator includes a hardware health check mechanism that monitors GPU status, automatically reallocating resources in case of hardware issues. This improves training stability even in highly distributed environments and reduces GPU hours required to complete the task.
- Simplified cluster management: By having a shared root file system across all cluster nodes, Soperator eliminates the challenge of maintaining identical states across multi-node installations. Together with Terraform operator, this simplifies the user experience, allowing ML teams to focus on their core tasks without the need for extensive DevOps expertise.
Future planned enhancements include improvements to security and stability, scalability and node management, as well as upgrades according to emerging software and hardware updates.
The first public release of Soperator is available from today as an open-source solution to all ML and HPC professionals on the Nebius GitHub, along with relevant deployment tools and packages. Nebius also invites anyone who would like to try out the solution for their ML training or HPC calculations running on multi-node GPU installations; the company’s solution architects are ready to provide assistance and guidance through the installation and deployment process in the Nebius environment.