[Beowulf] RAID question
skylar.thompson at gmail.com
Sat Mar 14 14:50:21 PDT 2015
On 3/13/2015 5:52 PM, mathog wrote:
> A bit off topic, but some of you may have run into something similar.
> Today I was called in to try and fix a server which had stopped
> working. Not my machine, the usual sysop is out sick. The
> model is a Dell PowerEdge T320 with a Raid PERC H710P controller.
> The symptoms reported were "it stopped working, could not find 'ls',
> and wouldn't reboot past grub". (Evidently it could find 'reboot'.)
> Got into the BIOS and ran RAID consistency check, which took 3 hours.
> It didn't say if it had passed or failed, or put up any sort of status
> message whatsoever, but there were no failure lights lit on the disks.
> On a reboot it gives:
> grub error 8: kernel must be loaded before booting.
> It is a Centos 6.5 system, so booted it with an installation disk of
> that flavor, and dropped down into a shell.
> This is where it gets strange.
> /boot is in /dev/sdb1. When mounted that directory is empty but
> when unmounted fsck shows 10 files in it taking up about 12Mb. Pretty
> clear why it wouldn't boot with nothing in /boot. Not sure
> what the 10 files fsck sees are, perhaps part of the filesystem. (ext2
> I think). I had never tried running fsck on an empty file system in a
> partition before.
> /bin is missing entirely, so that's why "ls" stopped working. /usr/bin
> is still there, which is why reboot was OK.
> /var/log/messages shows that the machine was logging what look like
> corrected disk errors (sense errors) for /dev/sdb1 for days before it
> Tried copying the contents of another machine's /boot (which is
> supposed to be an exact copy of this one) into /boot, and rebooting,
> but grub didn't get any farther than it had before. Probably grub
> needs to be reinstalled, but with /bin missing, and who knows what
> else gone besides, it seems like a full OS reinstall would be in order.
> Off the top of my head, if it weren't for the sense errors on
> /dev/sdb1, I would think that this might have been the result of an
> accidental (or hacker's)
> rm -rf /
> Anybody run into a hardware/software glitch with symptoms like this on
> a similar system???
> Is there some way on these sorts of Dell's to run per disk diagnostics
> from BIOS or UEFI even if they are already grouped into a virtual disk
> by the controller? I suspect that the disk which is /dev/sdb may
> really be on its way out, but I couldn't get smartctl to work off the
> DVD or from the copy on disk. (The smartctl commands used were
> tested on the twin machine, and they worked there.) The BIOS showed
> that SMART was disabled on all of the disks. Web searches for
> diagnostics for this controller all referenced software that requires
> a running OS, nothing built into the BIOS/UEFI. (It is set to use BIOS.)
I might start looking at non-RAID problems first. Maybe you have some
bad memory or CPU? Errant rm could do it too, as you mentioned.
More information about the Beowulf