<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

<HTML>

<HEAD>

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">

<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7653.38">

<TITLE>RS: [Beowulf] Odd SuperMicro power off issues</TITLE>

</HEAD>

<BODY>

<!-- Converted from text/plain format -->

<BR>


<P><FONT SIZE=2>Hi.<BR>

<BR>

Dunno if this is a bright idea, but what about the power supply temperature?<BR>

There are usually no measurements done in there, and a hot power supply could<BR>

easily have a thermal fuse that gets tripped.<BR>

<BR>

It maybe worthwhile trying with a different power box, if possible with a higher<BR>

power rating.<BR>

<BR>

Cheers,<BR>

-Alan<BR>

<BR>

<BR>

<BR>

-----Missatge original-----<BR>

De: beowulf-bounces@beowulf.org en nom de Chris Samuel<BR>

Enviat el: dl. 08/12/2008 04:33<BR>

Per a: Beowulf List<BR>

A/c: David Bannon; Brett Pemberton<BR>

Tema: [Beowulf] Odd SuperMicro power off issues<BR>

<BR>

Hi folks,<BR>

<BR>

We've been tearing our hair out over this for a little<BR>

while and so I'm wondering if anyone else has seen anything<BR>

like this before, or has any thoughts about what could be<BR>

happening ?<BR>

<BR>

Very occasionally we find one of our Barcelona nodes with<BR>

a SuperMicro H8DM8-2 motherboard powered off.  IPMI reports<BR>

it as powered down too.<BR>

<BR>

No kernel panic, no crash, nothing in the system logs.<BR>

<BR>

Nothing in the IPMI logs either, it's just sitting there<BR>

as if someone has yanked the power cable (and we're pretty<BR>

sure that's not the cause!).<BR>

<BR>

There had not been any discernible pattern to the nodes<BR>

affected, and we've only a couple nodes where it's happened<BR>

twice, the rest only have had it happen once and scattered<BR>

over the 3 racks of the cluster.<BR>

<BR>

For the longest time we had no way to reproduce it, but then<BR>

we noticed that for 3 of the power off's there was a particular<BR>

user running Fluent on there.   They've provided us with a copy<BR>

of their problem and we can (often) reproduce it now with that<BR>

problem.  Sometimes it'll take 30 minutes or so, sometimes it'll<BR>

take 4-5 hours, sometimes it'll take 3 days or so and sometimes<BR>

it won't do it at all.<BR>

<BR>

It doesn't appear to be thermal issues as (a) there's nothing in<BR>

the IPMI logs about such problems and (b) we inject CPU and system<BR>

temperature into Ganglia and we don't see anything out of the<BR>

ordinary in those logs. :-(<BR>

<BR>

We've tried other codes, including HPL, and Advanced Clustering's<BR>

Breakin PXE version, but haven't managed to (yet) get one of the<BR>

nodes to fail with anything except Fluent. :-(<BR>

<BR>

The only oddity about Fluent is that it's the only code on<BR>

the system that uses HP-MPI, but we used the command line<BR>

switches to tell it to use the Intel MPI it ships with and<BR>

it did the same then too!<BR>

<BR>

I just cannot understand what is special about Fluent,<BR>

or even how a user code could cause a node to just turn<BR>

off without a trace in the logs.<BR>

<BR>

Obviously we're pursuing this through the local vendor<BR>

and (through them) SuperMicro, but to be honest we're<BR>

all pretty stumped by this.<BR>

<BR>

Does anyone have any bright ideas ?<BR>

<BR>

cheers,<BR>

Chris<BR>

--<BR>

Christopher Samuel - (03) 9925 4751 - Systems Manager<BR>

 The Victorian Partnership for Advanced Computing<BR>

 P.O. Box 201, Carlton South, VIC 3053, Australia<BR>

VPAC is a not-for-profit Registered Research Agency<BR>

_______________________________________________<BR>

Beowulf mailing list, Beowulf@beowulf.org<BR>

To change your subscription (digest mode or unsubscribe) visit <A HREF="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</A><BR>

<BR>

</FONT>

</P>


</BODY>

</HTML>