Automatic Updates and XFS
Today we suffered from a very weird issue on two adminfront servers (running Redhat 7). The systems were suffering an unexpected reboot, and the following symptoms:
systemd-logindservice failed to start on boot
May 15 09:44:50 <hostname> systemd-logind: Failed to connect to system bus: No such file or directory May 15 09:44:50 <hostname> systemd-logind: Failed to fully start up daemon: No such file or directory
- thus many other dependant services (DBus, NetworkManager etc.) failed to start
- no more KVM guests were present as the
libvirtdservice was also in failed state.
Trying to start the failed services lead to strange errors:
$> systemctl start libvirtd.service Error getting authority: Error initializing authority: Could not connect: No such file or directory (g-io-error-quark, 1) Failed to start libvirtd.service: Unit is masked.
It happens that:
- The first message (
Error getting authority ...) comes from the fact that
polkitnot able to establish a connection with D-Bus as indicated in this ticket.
- The second (
Unit is masked) was actually linked to the fact that the corresponding service file
/lib/systemd/system/libvirtd.servicewas… of 0 byte size!!!
After further investigations with the UL HPC Team, we understood the problem as follows:
- a set of automatic security updates were applied by yum-cron, in particular to update
- whether because of the update, the known Meltdown/Spectre CPU Microcode “unexpected reboot” issue or anything else, the server crashed and reboot
- because of an old XFS bug, a couple of files (written upon update?) were truncated (0 bytes for small files) leading to a priori an unrecoverable state.
Yet here are a few notes on what used to permit to restore the servers without restoring from a backup or reinstalling the root system:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
You can then restore the previous DBus policies and VM definitions:
$> cd /etc/dbus-1/system.d/ $> cp /etc/dbus-1.old/system.d/* . $> cd /etc/libvirt $> rsync -avzu /etc/libvirt.old/qemu .
Make a new reboot and (hopefully) enjoy back your system.
This precise type of critical ‘incident’ was a new expecience – I could not imagine a recent Redhat system fully configured according to best practices and official recommendations (XFS became the default filesystem in RHEL7) would lead to such an bad state. I could only recommend to drop XFS in favor of
ext4 upon new installation.