Today we suffered from a very weird issue on two adminfront servers (running Redhat 7). The systems were suffering an unexpected reboot, and the following symptoms:

  • the systemd-logind service failed to start on boot

      May 15 09:44:50 <hostname> systemd-logind[3011]: Failed to connect to system bus: No such file or directory
      May 15 09:44:50 <hostname> systemd-logind[3011]: Failed to fully start up daemon: No such file or directory
  • thus many other dependant services (DBus, NetworkManager etc.) failed to start
  • no more KVM guests were present as the libvirtd service was also in failed state.

Trying to start the failed services lead to strange errors:

 $> systemctl start libvirtd.service
 Error getting authority: Error initializing authority: Could not connect: No such file or directory (g-io-error-quark, 1)
 Failed to start libvirtd.service: Unit is masked.

It happens that:

  1. The first message (Error getting authority ...) comes from the fact that polkit not able to establish a connection with D-Bus as indicated in this ticket.
  2. The second (Unit is masked) was actually linked to the fact that the corresponding service file /lib/systemd/system/libvirtd.service was… of 0 byte size!!!

After further investigations with the UL HPC Team, we understood the problem as follows:

  • a set of automatic security updates were applied by yum-cron, in particular to update dbus, dbus-libs, libvirt, libvirt-libs, qemu-img
  • whether because of the update, the known Meltdown/Spectre CPU Microcode “unexpected reboot” issue or anything else, the server crashed and reboot
  • because of an old XFS bug, a couple of files (written upon update?) were truncated (0 bytes for small files) leading to a priori an unrecoverable state.

Yet here are a few notes on what used to permit to restore the servers without restoring from a backup or reinstalling the root system:

# List files with issues from the install rpms with 'rpm -V [...]'
$> rpm -Va | tee rpm_Va.txt

# Clean the list of files only in 'list_files.txt'
$> cp rpm_Va.txt list_files.txt
$> vim list_files.txt     # CTRL-V for rectangular selection -- d to delete selection

# Extract the list of packages in 'list_packages.txt'
# -> 'rpm -qf <file>' tells you which package provides <file>
$> cat list_files.txt | xargs rpm -qf | sort | uniq | tee list_packages.txt

# Backup the Dbus, libvirt and polkit-1
$> for d in dbus-1 libvirt polkit-1; do mv /etc/${d} /etc/${d}.old; done

# Reinstall the packages having lost files
$> cat list_packages.txt | xargs yum reinstall -y

# Reboot
$> shutdown -r now

You can then restore the previous DBus policies and VM definitions:

$> cd /etc/dbus-1/system.d/
$> cp /etc/dbus-1.old/system.d/* .

$> cd /etc/libvirt
$> rsync -avzu  /etc/libvirt.old/qemu .

Make a new reboot and (hopefully) enjoy back your system.

Lessons Learned

This precise type of critical ‘incident’ was a new expecience – I could not imagine a recent Redhat system fully configured according to best practices and official recommendations (XFS became the default filesystem in RHEL7) would lead to such an bad state. I could only recommend to drop XFS in favor of ext4 upon new installation.