Now that we have successfully released the new
aionsupercomputer, and before leaving my current position at the University, I wanted to share the experience learned from these long HPC developments (I also wanted to ensure that they will be later credited to the right persons at the origin of these works).
The article “Aggregating and Consolidating two High Performant Network Topologies: The ULHPC Experience”  presented during the ACM PEARC’22 conference (Practice and Experience in Advanced Research Computing) in Boston, USA on July 13, 2022 is the first of this series of accepted publications tied to Aion deployments, focusing on the interconnect part (both Infiniband and Ethernet).
- S. Varrette, H. Cartiaux, T. Valette, and A. Olloh, “Aggregating and Consolidating two High Performant Network Topologies: The ULHPC Experience,” in ACM Practice and Experience in Advanced Research Computing (PEARC’22), Boston, USA, 2022.
Here are the slides I presented during this [excellent] conference in Boston.
High Performance Computing (HPC) encompasses advanced com- putation over parallel processing. The execution time of a given simulation depends upon many factors, such as the number of CPU/GPU cores, their utilisation factor and, of course, the interconnect performance, efficiency, and scalability. In practice, this last component and the associated topology remains the most significant differentiators between HPC systems and lesser performant systems. The University of Luxembourg operates since 2007 a large academic HPC facility which remains one of the reference implementation within the country and offers a cutting-edge research infrastructure to Luxembourg public research. The main high-bandwidth low-latency network of the operated facility relies on the dominant interconnect technology in the HPC market i.e., Infiniband (IB) over a Fat-tree topology. It is complemented by an Ethernet-based network defined for management tasks, external access and interactions with user’s applications that do not support Infiniband natively. The recent acquisition of a new cutting-edge supercomputer Aion which was federated with the previous flagship cluster Iris was the occasion to aggregate and consolidate the two types of networks.
This article depicts the architecture and the solutions designed to expand and consolidate the existing networks beyond their seminal capacity limits while keeping at best their Bisection bandwidth. At the IB level, and despite moving from a non-blocking configuration, the proposed approach defines a blocking topology maintaining the previous Fat-Tree height. The leaf connection capacity is more than tripled (moving from 216 to 672 end-points) while exhibiting very marginal penalties, i.e. less than 3% (resp. 0.3%) Read (resp. Write) bandwidth degradation against reference parallel I/O benchmarks, or a stable and sustainable point-to-point bandwidth efficiency among all possible pairs.