At the occasion of the Aion supercomputer installation, we had to merge two Infiniband island together and it came with the deployment of the Mellanox OFED stack.
While installation is quite straightforward, the compliance with GPFS/SpectrumScale and Lustre (based on DDN solutions in our case) assumes non-standard compilation options that motivated this blog post. In short: don’t forget the --add-kernel-support --kmp option
Below are installation notes for Mellanox OFED (MOFED) 5.1-2.5.8.0 on CentOS 7.9 (kernel 3.10.0-1127.19.1.el7) with compliance to Lustre 2.12.5_ddn18 and GPFS/SPectrumScale 4.2.3.24 clients.
TL;DR;
1234567891011121314
### MOFED install - successfully tested on RHEL 8.3# Beware of matching kernel$ uname -r
$ yum info kernel-devel kernel-headers kernel-rpm-macros
$ yum install kernel-devel kernel-headers kernel-rpm-macros
### Pre-requisite packagesyum install createrepo rpm-build gcc elfutils-libelf-devel gdb-headless python36-devel bison flex tcl tk
# Alternative 1: Head/Compute node./mlnxofedinstall --add-kernel-support --kmp --without-fw-update --hpc --without-openmpi --without-mpi-selector --without-opensm --without-opensm-libs --without-ibdump
# Alternative 2: Minimal./mlnxofedinstall --without-fw-update --basic --with-opensm --with-opensm-libs --with-ibutils2
# DO NOT FORGET - Rebuilding the initramfs - see also https://access.redhat.com/solutions/365693cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).bak.$(date +%m-%d-%H%M%S).img
dracut -f -v
Preliminaries
Download the appropriate tarball archive MLNX_OFED_LINUX-<version>-rhel7.9-x86_64.tgz download from MLNX_OFED Download Center (adapt OS accordingly)
Copy it to a shared NFS directory (Not GPFS nor Lustre as you’ll have to umount it) /path/to/shared/dir
Uncompress it at the appropriate location:
123456
mofed_version=5.1-2.5.8.0
cp MLNX_OFED_LINUX-${mofed_version}-rhel7.9-x86_64.tgz /path/to/shared/dir # /!\ ADAPT accordinglycd /path/to/shared/dir
tar xf MLNX_OFED_LINUX-${mofed_version}-rhel7.9-x86_64.tgz
cd MLNX_OFED_LINUX-${mofed_version}-rhel7.9-x86_64/
./mlnxofedinstall -h
Then stop and unmount all storage depending on your IB network:
Then you need to remove manually a few kernel modules before being able to [re]start openibd and confirm with openibd service:
12345
rmmod rpcrdma ib_srpt ib_isert i40iw
/etc/init.d/openibd restart
etc/init.d/openibd stop # If OKsystemctl status openibd.service
systemctl start openibd.service
Don’t forget to rebuild the initramfs
123
# DO NOT FORGET - Rebuilding the initramfs - see also https://access.redhat.com/solutions/365693cp /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).bak.$(date +%m-%d-%H%M%S).img
dracut -f -v
Lustre client rebuild
Once MOFED has been installed as above, you must rebuild Lustre client.
123456789
cd /path/to/shared/dir
version=2.12.5_ddn18
sudo mkdir lustre-${version}_MOFED-${mofed_version}_$(uname -r)# Extract Lustre sources into that directorytar xvzf lustre-client-${version}.tar.gz -C lustre-${version}_MOFED-${mofed_version}_$(uname -r) --strip-components=1
# Lustre client RPMs buildcd lustre-${version}_MOFED-${mofed_version}_$(uname -r)./configure --disable-server
make rpms
Now reinstall the client RPMs generated and mount Lustre shares:
123456
rpm -qa | grep lustre
ls --color=never /path/to/shared/dir/lustre-${version}_MOFED-${mofed_version}_$(uname -r)/*lustre*${version}* | grep -vE 'tests|src|debug'| sort
ls --color=never /path/to/shared/dir/lustre-${version}_MOFED-${mofed_version}_$(uname -r)/*lustre*${version}* | grep -vE 'tests|src|debug'| sort | xargs -n1 rpm -ivh --reinstall
systemctl start lnet-peers.service
systemctl status lnet-peers.service
mount -v /mnt/lscratch
GPFS Portability layer RPM rebuild and reinstall
Once MOFED has been installed as above, you must rebuild the portability layer of GPFS/SpectrumScale client.
$ cd /usr/lpp/mmfs/bin
$ version=4.2.3.24
$ mmbuildgpl --build-package
--------------------------------------------------------
mmbuildgpl: Building GPL (4.2.3.24) module begins at Fri May 7 17:39:11 CEST 2021.
--------------------------------------------------------
Verifying Kernel Header...
kernel version=31000999(310001127019001, 3.10.0-1127.19.1.el7.x86_64, 3.10.0-1127.19.1) module include dir= /lib/modules/3.10.0-1127.19.1.el7.x86_64/build/include
module build dir= /lib/modules/3.10.0-1127.19.1.el7.x86_64/build
kernel source dir= /usr/src/linux-3.10.0-1127.19.1.el7.x86_64/include
Found valid kernel header file under /usr/src/kernels/3.10.0-1127.19.1.el7.x86_64/include
Verifying Compiler...
make is present at /bin/make
cpp is present at /bin/cpp
gcc is present at /bin/gcc
g++ is present at /bin/g++
ld is present at /bin/ld
Verifying rpmbuild...
Verifying Additional System Headers...
Verifying kernel-headers is installed ...
Command: /bin/rpm -q kernel-headers
The required package kernel-headers is installed
make World ...
make InstallImages ...
make rpm ...
Wrote: /root/rpmbuild/RPMS/x86_64/gpfs.gplbin-3.10.0-1127.19.1.el7.x86_64-4.2.3-24.x86_64.rpm
--------------------------------------------------------
mmbuildgpl: Building GPL module completed successfully at Fri May 7 17:39:29 CEST 2021.
--------------------------------------------------------
$ cd /path/to/shared/dir/
$ mkdir ${version}-MOFED-${mofed_version}$ cp /root/rpmbuild/RPMS/x86_64/gpfs.gplbin-$(uname -r).el7.x86_64-${version}.x86_64.rpm ${version}-MOFED-${mofed_version}/
Now reinstall the gpfs.gplbin RPMs generated and start the GPFS service: