[Beowulf] Re: how can I know that a hard disk died? (Dimitri Antoniou) (Steve Cousins)

Ed Karns edkarns at firewirestuff.com
Fri Aug 12 13:14:00 PDT 2005


Dimitri & Steve:

" ... some sort of command line interface that allows you to write a 
cron script to check for failed drives and email you if something is 
wrong. ..."

This should be outlined in your documentation for your RAID array ... 
polling individual drives within a RAID or any separate drive on your 
cluster should be relatively easy. A simple "batch" command enquiring 
as to the existence (or not) of a small text file in a support sub 
directory should do it:

Periodically run a simple program routine or script command file might 
read:

::REM: drivename is actual drive descriptor on cluster or RAID element. 
x = number of available drives
::REM: directory and filename are same on all drives, filename file 
contains ASCII text = "OK";
::REM Exists is stock command or subroutine available on your 
particular operating system or defined keyword in your programming 
control language or may be defined. There are probably many alternates. 
Syntax will vary with your operating system.

For n = 1 to x do
   If Exists (drivename = x) Then Write [to screen & logfile] = 
'clustername/drivename/directoryname/filename';
          Else Write [to screen and logfile] = "?? Bad Drive at " + 'x';
Next x;
Reset x = 1;
EOF EOC

...

... or your favorite programing technic to this effect ... add emailed 
log file to taste ...

Ed Karns
FireWireStuff.com


On Friday, August 12, 2005, at 12:00  PM, beowulf-request at beowulf.org 
wrote:

>
>    1. Re: how can I know that a hard disk died? (Dimitri Antoniou) 
> (Steve Cousins)
>  On Fri, 12 Aug 2005 Dimitri Antoniou wrote:
>
>>  Hi,
>>
>>  We have a 16-node HP LC1000 cluster, with 3 hard disks
>>  managed by hardware RAID.
>>
>>  Recently, a hard disk died, and we only found out
>>  when we went to the room the cluster stays
>>  and noticed a failure light on the disk.
>>
>> ...
>>  When the disk died, the system didn't notify us,
>>  and we haven't found any message in log files,
>>  at least not anything obvious.
>
>
> What brand is the controller?  What OS? All RAID cards that I have run
> into have some sort of command line interface that allows you to write 
> a
> cron script to check for failed drives and email you if something is
> wrong.  For instance our Dell systems use afacli (Adaptec PERC card) 
> and
> megamgr (AMI PERC card) and our 3Ware systems use tw_cli.
>
> Good luck,
>
> Steve




More information about the Beowulf mailing list