Finally! It took a very (too!) long journey, largely delayed by the global pandemic crisis and the difficulties met with the solution providers, yet we reached today a major milestone!

This wouldn’t have been possible without the tremendous efforts of all partners (ULHPC team, Atos, DDN and Mellanox experts, SIU infrastructure and network engineers) to mitigate and solve the complex technical issues and logistic delays preventing the production release. A special mention to Hyacinthe Cartiaux, Teddy Valette and Abatcha Olloh who, like me, used to work continuously over nights and week-ends through all these times to reach this major milestone. A deep and sincere thank to all of them!.
Of course, alumni staff members, either within the HPC team (Valentin Plugaru, Clément Parisot) or the procurement one (Magali Piccolo) who contributed to the initial tender writing should not be forgotten.

Aion Key Facts

The acquisition of a new HPC supercomputer ‘Aion’ was done through a European tender process under the reference RFP 190027 (TED document: TED72/2019-608787) with a budget of 3.5M€. It was composed of 3 lots:

  • Lot 1: Direct Liquid Cooled supercomputer Aion (incl. racks, cooling units etc.)
  • Lot 2: ULHPC storage extension and consolidation
  • Lot 3: Infiniband and Ethernet interconnect consolidation between the two supercomputers

The tender was awarded to Atos.

Aion is a cluster of x86-64 AMD-based compute nodes, based on Atos BullSequana X2410 blade with the following characteristics:

Hostname (#Nodes) #cores type Processors RAM
aion-[0001-0318] (318) 40704 Regular Epyc 2 AMD Epyc ROME 7H12 @ 2.6 GHz [64c/280W] 256 GB

The storage extension (Lot 2) for our IBM Spectrum Scale/GPFS-based storage, is based on a DDN GridScaler/GS7K solution completing the existing system.

Finally, The fast local interconnect network implemented within Aion relies on the Mellanox Infiniband (IB) HDR1 technology.

For more details:

   ULHPC Technical documentation on Aion

Aion Computing and Storage Performance Evaluation

Here is a summary of the Aion performance evaluation I performed with Atos and DDN benchmarking experts. I focus here only on the main benchmarks results we obtained:

  • Bisection Bandwidth (BB) benchmarks, demonstrating a sustanaible 96,99% efficiency for both unidirectional and bidirectional point-to-point IB bandwidth across all computing nodes;

  • STREAM sustainable Memory Bandwidth performance above 90,01% efficiency for 4 highly-intensive memory access pattern across all computing nodes;

  • High Performance Linpack (HPL) performance over 318 nodes to reach $R_max$ = 1255.36 TFlops (74,20% efficiency compared to the theoretical peak performance $R_{peak}$ = 1693 TFlops).
    • with this measure, Aion would have entered the Top500 in June 2020 (as initially planned).
    • the corresponding Green500 evaluation for this large-scale run brought 5.19 GFlops/W (+12,826% compared to the expected threshold), which would rank Aion at the 56th place in the June 2021 Green 500 list
  • High Performance Conjugate Gradients (HPCG performance 16.842 TFlops for the best full cluster (318 nodes) run (+15,35% compared to the threshold), allowing a GreenHPCG oriented optimized energy-efficient run maximizing HPCG performances of 0,0798 GFlops/W (+59,64% improvement).
    • this would rank Aion at the #110 place in the latest HPCG list
  • Graph500 for the challenging Breadth-First Search (BFS) kernel (Scale 36, edge: 16) to reach 975 GTEPS
    • that would rank Aion #23 in the latest June 2021 Graph500 list
    • we also reached 6.14 MTEPS/W for the best full run elligible for the GreenGraph500 list
  • IOR I/O performance were demonstrating a doubling of the performance against the extended GPFS/SpectrumScale storage solution delivered with Aion
    • Max Read: 22.58 GB/s (was 11.33 GB/s on the previous configuration)
    • Max Write: 19.02 GB/s (was 9.36 GB/s on the previous configuration)

  • IO500 were I obtained the best results with the isc21 release, compiled again the production 2020b software set (mpi/OpenMPI/4.0.5-GCC-10.2.0) over 64 Aion nodes / 128 clients to obtain the score 11.345219.
  1. High Data Rate (HDR) – 200 Gb/s throughput with a very low latency, typically below 0,6$\mu$s. The HDR100 technology allows one 200Gbps HDR port (aggregation 4x 50Gbps) to be divided into 2 HDR100 ports with 100Gbps (2x 50Gbps) bandwidth using an [optical] splitter” cable