[Beowulf] Odd AMD quad core SuperMicro power off issues

Jason Clinton jclinton at advancedclustering.com
Mon Jul 6 08:27:02 PDT 2009


On Fri, Jul 3, 2009 at 12:17 AM, Chris Samuel<csamuel at vpac.org> wrote:
>> Well we've been gradually replacing the Barcelona chips
>> with Shanghai (same clockspeed) and we are yet to see a
>> power off on a Shanghai node!
>
> Since I wrote that we have seen far fewer with 2.3GHz
> Shanghai (2376, a 75W part), *but* we have some nodes
> upgraded to the ULP 2.4 GHz Shanghai (2379 HE, a 55W
> part) which do exhibit this issue very regularly! :-(

We saw a similar power-off issue on a customer of ours who upgraded
from 2220's to Barcelona's on a similar board; it was reproducible at
the same failure rate on approximately 160 nodes. After trying just
about everything under the sun, we wholesale replaced all the memory
in the entire cluster. The power-offs ceased immediately thereafter
and have not returned.

-- 
Jason D. Clinton, 913-643-0306, http://twitter.com/HPCClusterTech
http://www.google.com/profiles/jasondclinton



More information about the Beowulf mailing list