Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Repeated Dell SC1435 crash / hang. How to get the vendor to resolve the issue when 20% of the servers fail in first year?

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Joe Landman landman at scalableinformatics.com
Mon Apr 6 06:11:50 PDT 2009


Chris Samuel wrote:
> ----- "Rahul Nabar" <rpnabar at gmail.com> wrote:
> 
>> I contact Dell. Responses range from the clueless to absurd. First,
>> they convinced us it was Fedora. So I shifted to CentOS. They still
>> claim CentOS is "unvalidated" but I refuse to spend a fortune to move
>> over to RHEL like they want me to.
> 
> Not that this helps, but you have my sympathy as I've
> been dealing with the same stuff from IBM over a storage
> server they sold us.
> 
> Turns out I can make 7-12 drives in their external
> enclosures fail in short order (seconds to minutes
> between failures) by telling the software RAID to
> do a check, thus:
> 
> for i in md[0123]; do
>    echo check > /sys/block/$i/md/sync_action
> done

Are these softirq cpu hangs?

could you tell me what

   cat /sys/block/md[0123]/md/stripe_cache_size

reports?

> 
> Even though we could reproduce it on 64-bit Debian
> and 32-bit CentOS they wouldn't escalate the issue
> until we could reproduce it on RHEL5 - which we did
> today.
> 
> Sigh..
> 


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list