[Beowulf] Re: how can I know that a hard disk died? (Dimitri
Antoniou) (Steve Cousins)
edkarns at firewirestuff.com
Fri Aug 12 13:14:00 PDT 2005
Dimitri & Steve:
" ... some sort of command line interface that allows you to write a
cron script to check for failed drives and email you if something is
This should be outlined in your documentation for your RAID array ...
polling individual drives within a RAID or any separate drive on your
cluster should be relatively easy. A simple "batch" command enquiring
as to the existence (or not) of a small text file in a support sub
directory should do it:
Periodically run a simple program routine or script command file might
::REM: drivename is actual drive descriptor on cluster or RAID element.
x = number of available drives
::REM: directory and filename are same on all drives, filename file
contains ASCII text = "OK";
::REM Exists is stock command or subroutine available on your
particular operating system or defined keyword in your programming
control language or may be defined. There are probably many alternates.
Syntax will vary with your operating system.
For n = 1 to x do
If Exists (drivename = x) Then Write [to screen & logfile] =
Else Write [to screen and logfile] = "?? Bad Drive at " + 'x';
Reset x = 1;
... or your favorite programing technic to this effect ... add emailed
log file to taste ...
On Friday, August 12, 2005, at 12:00 PM, beowulf-request at beowulf.org
> 1. Re: how can I know that a hard disk died? (Dimitri Antoniou)
> (Steve Cousins)
> On Fri, 12 Aug 2005 Dimitri Antoniou wrote:
>> We have a 16-node HP LC1000 cluster, with 3 hard disks
>> managed by hardware RAID.
>> Recently, a hard disk died, and we only found out
>> when we went to the room the cluster stays
>> and noticed a failure light on the disk.
>> When the disk died, the system didn't notify us,
>> and we haven't found any message in log files,
>> at least not anything obvious.
> What brand is the controller? What OS? All RAID cards that I have run
> into have some sort of command line interface that allows you to write
> cron script to check for failed drives and email you if something is
> wrong. For instance our Dell systems use afacli (Adaptec PERC card)
> megamgr (AMI PERC card) and our 3Ware systems use tw_cli.
> Good luck,
More information about the Beowulf