Automatic Updates and XFS
Today we suffered from a very weird issue on two adminfront servers (running Redhat 7). The systems were suffering an unexpected reboot, and the following symptoms:
-
the
systemd-logind
service failed to start on bootMay 15 09:44:50 <hostname> systemd-logind[3011]: Failed to connect to system bus: No such file or directory May 15 09:44:50 <hostname> systemd-logind[3011]: Failed to fully start up daemon: No such file or directory
- thus many other dependant services (DBus, NetworkManager etc.) failed to start
- no more KVM guests were present as the
libvirtd
service was also in failed state.
Trying to start the failed services lead to strange errors:
$> systemctl start libvirtd.service
Error getting authority: Error initializing authority: Could not connect: No such file or directory (g-io-error-quark, 1)
Failed to start libvirtd.service: Unit is masked.
It happens that:
- The first message (
Error getting authority ...
) comes from the fact thatpolkit
not able to establish a connection with D-Bus as indicated in this ticket. - The second (
Unit is masked
) was actually linked to the fact that the corresponding service file/lib/systemd/system/libvirtd.service
was… of 0 byte size!!!
After further investigations with the UL HPC Team, we understood the problem as follows:
- a set of automatic security updates were applied by yum-cron, in particular to update
dbus
,dbus-libs
,libvirt
,libvirt-libs
,qemu-img
… - whether because of the update, the known Meltdown/Spectre CPU Microcode “unexpected reboot” issue or anything else, the server crashed and reboot
- because of an old XFS bug, a couple of files (written upon update?) were truncated (0 bytes for small files) leading to a priori an unrecoverable state.
Yet here are a few notes on what used to permit to restore the servers without restoring from a backup or reinstalling the root system:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
|
You can then restore the previous DBus policies and VM definitions:
$> cd /etc/dbus-1/system.d/
$> cp /etc/dbus-1.old/system.d/* .
$> cd /etc/libvirt
$> rsync -avzu /etc/libvirt.old/qemu .
Make a new reboot and (hopefully) enjoy back your system.
Lessons Learned
This precise type of critical ‘incident’ was a new expecience – I could not imagine a recent Redhat system fully configured according to best practices and official recommendations (XFS became the default filesystem in RHEL7) would lead to such an bad state. I could only recommend to drop XFS in favor of ext4
upon new installation.
Links: