[Beowulf] Re: Odd AMD quad core SuperMicro power off issues

David Mathog mathog at caltech.edu
Mon Jul 6 10:20:00 PDT 2009


Chris Samuel wrote:
> 
> In April I wrote:
> 
> > Well we've been gradually replacing the Barcelona chips
> > with Shanghai (same clockspeed) and we are yet to see a
> > power off on a Shanghai node!
> 
> Since I wrote that we have seen far fewer with 2.3GHz
> Shanghai (2376, a 75W part), *but* we have some 

some as in:  some of the upgraded nodes do this, some do not?

> nodes
> upgraded to the ULP 2.4 GHz Shanghai (2379 HE, a 55W
> part) which do exhibit this issue very regularly! :-(

If some of your upgraded nodes do this, and some do not, then this will
most likely map to one of:

1.  CPU
2.  motherboard (all are identical, including BIOS, right?)
3.  RAM
4.  power supply

Start swapping parts between good and bad nodes and pray that it
correlates perfectly with the location of one component type.   Also
keep bugging Supermicro, they should have some idea what is going on.

Refresh our memory on this, are you seeing orderly power off (as in a
shutdown) or are the nodes just powering down like "boom"?  In the
latter case I would tend to suspect that the power supply has issues and
is triggering an emergency power off to prevent damage from overheating
or overload.  Swapping the CPUs could make a difference if the newer
ones use a bit less power than the older ones.  (We had a bunch of PCs
which, due to a monster graphics card, were so close to the power
supply limit that adding a single fan made the difference between
being able to run SpecViewPerf to completion or not - using a lower
power CPU would have made the same sort of difference.)


Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list