Now that we have successfully released the new aion supercomputer, and before leaving my current position at the University, I wanted to share the experience learned from these long HPC developments (I also wanted to ensure that they will be later credited to the right persons at the origin of these works).

In complement to the ACM PEARC’22, the article “Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility” is dedicated to the design, validation and analysis of the new Slurm configuration put in place on the ULHPC facilities upon integration of the aion supercomputer [1]. It is the second of this series of accepted publications - more info on the third one incoming.

The paper was presented during the IEEE ISPDC22 conference (21st IEEE Int. Symp. on Parallel and Distributed Computing) held in Basel (Switzerland) on July 11-13, 2022. As I was in Boston to attend the PEARC conference at that time, I left Emmanuel presented our paper.

  1. S. Varrette, E. Kieffer, and F. Pinel, “Optimizing the Resource and Job Management System of an Academic HPC and Research Computing Facility,” in 21st IEEE Intl. Symp. on Parallel and Distributed Computing (ISPDC’22), Basel, Switzerland, 2022.
    URL

Here are the slides I prepared for the conference:

   Optimizing the RJMS of an Academic HPC & Research Computing Facility

Abstract:

High Performance Computing (HPC) is nowadays a strategic asset required to sustain the surging demands for massive processing and data-analytic capabilities. In practice, the effective management of such large scale and distributed computing infrastructures is left to a Resource and Job Management System (RJMS). This essential middleware component is responsible for managing the computing resources, handling user requests to allocate resources while providing an optimized framework for starting, executing and monitoring jobs on the allocated resources. The University of Luxembourg has been operating for 15 years a large academic HPC facility which relies since 2017 on the Slurm RJMS introduced on top of the flagship cluster Iris. The acquisition of a new liquid-cooled supercomputer named Aion which was released in 2021 was the occasion to deeply review and optimize the seminal Slurm configuration, the resource limits defined and the sustaining fairsharing algorithm. This paper presents the outcomes of this study and details the implemented RJMS policy. The impact of the decisions made over the supercomputers workloads is also described. In particular, the performance evaluation conducted highlights that when compared to the seminal configuration, the described and implemented environment brought concrete and measurable improvements with regards the platform utilization (+12.64%), the jobs efficiency (as measured by the average Wall-time Request Accuracy, improved by 110.81%) or the management and funding (increased by 10%). The systems demonstrated sustainable and scalable HPC performances, and this effort has led to a negligible penalty on the average slowdown metric (response time normalized by runtime), which was increased by 0.59% for job workloads covering a complete year of exercise. Overall, this new setup has been in production for 18 months on both supercomputers and the updated model proves to bring a fairer and more satisfying experience to the end users. The proposed configurations and policies may help other HPC centres when designing or improving the RJMS sustaining the job scheduling strategy at the advent of computing capacity expansions.