Automatic Updates and XFS - Sebastien Varrette, PhD.

Today we suffered from a very weird issue on two adminfront servers (running Redhat 7). The systems were suffering an unexpected reboot, and the following symptoms:

the systemd-logind service failed to start on boot

  May 15 09:44:50 <hostname> systemd-logind[3011]: Failed to connect to system bus: No such file or directory
  May 15 09:44:50 <hostname> systemd-logind[3011]: Failed to fully start up daemon: No such file or directory

thus many other dependant services (DBus, NetworkManager etc.) failed to start
no more KVM guests were present as the libvirtd service was also in failed state.

Trying to start the failed services lead to strange errors:

 $> systemctl start libvirtd.service
 Error getting authority: Error initializing authority: Could not connect: No such file or directory (g-io-error-quark, 1)
 Failed to start libvirtd.service: Unit is masked.

It happens that:

The first message (Error getting authority ...) comes from the fact that polkit not able to establish a connection with D-Bus as indicated in this ticket.
The second (Unit is masked) was actually linked to the fact that the corresponding service file /lib/systemd/system/libvirtd.service was… of 0 byte size!!!

After further investigations with the UL HPC Team, we understood the problem as follows:

a set of automatic security updates were applied by yum-cron, in particular to update dbus, dbus-libs, libvirt, libvirt-libs, qemu-img…
whether because of the update, the known Meltdown/Spectre CPU Microcode “unexpected reboot” issue or anything else, the server crashed and reboot
because of an old XFS bug, a couple of files (written upon update?) were truncated (0 bytes for small files) leading to a priori an unrecoverable state.

Yet here are a few notes on what used to permit to restore the servers without restoring from a backup or reinstalling the root system:

# List files with issues from the install rpms with 'rpm -V [...]'
$> rpm -Va | tee rpm_Va.txt

# Clean the list of files only in 'list_files.txt'
$> cp rpm_Va.txt list_files.txt
$> vim list_files.txt     # CTRL-V for rectangular selection -- d to delete selection

# Extract the list of packages in 'list_packages.txt'
# -> 'rpm -qf <file>' tells you which package provides <file>
$> cat list_files.txt | xargs rpm -qf | sort | uniq | tee list_packages.txt

# Backup the Dbus, libvirt and polkit-1
$> for d in dbus-1 libvirt polkit-1; do mv /etc/${d} /etc/${d}.old; done

# Reinstall the packages having lost files
$> cat list_packages.txt | xargs yum reinstall -y

# Reboot
$> shutdown -r now

You can then restore the previous DBus policies and VM definitions:

$> cd /etc/dbus-1/system.d/
$> cp /etc/dbus-1.old/system.d/* .

$> cd /etc/libvirt
$> rsync -avzu  /etc/libvirt.old/qemu .

Make a new reboot and (hopefully) enjoy back your system.

Lessons Learned

This precise type of critical ‘incident’ was a new expecience – I could not imagine a recent Redhat system fully configured according to best practices and official recommendations (XFS became the default filesystem in RHEL7) would lead to such an bad state. I could only recommend to drop XFS in favor of ext4 upon new installation.

Links: