[Beowulf] Performance issue - CPU Intel 00/02

Bill Wichser bill at Princeton.EDU
Wed Jul 20 06:48:08 PDT 2005

System: 128 node Intel 2.4GHz P4
MBO: Tyan S2099, i845E
OS: RedHat 8.0, kernel  2.4.18-18.8.0 (but 2.4.20-28.8 changes nothing)

Problem: Performance one third after 60 minutes from reload/reboot on a 
number of nodes, as determined by an xhpl run


On SuperBowl Sunday, the lights went out on this cluster.  Ever since 
that time performance has suffered.  Initially, when running xhpl, there 
was a 3x performance difference between a "good" node and a "bad" node. 
  A reboot solved the problem, or so I thought.

This summer, having more time to investigate the problem, I found that 
some nodes exhibit this degradation after a power cycle while others 
didn't.   I've used strace, ptrace, watched memory usage statistics, etc 
but the only thing which ever changed was that all of these calls 
suffered a 3x performance hit on a bad node.

At first I thought it might be cooling, knowing that these Intel 
processors throttle down when reaching a set value.  But watching the 
temperatures revealed that all nodes were effectively running the same 
way.  And once performance dropped, they never returned to normal.

By accident I discovered that of these 128 nodes, 50 of them show some 
strange value in /proc/cpuinfo for model name.  On a good node these 
reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a 
"bad" node they call themselves "00/02" yet when checking the BIOS, and 
all the nodes have the same configuration I believe although I neglected 
to gather the level this last go round, they reveal themselves correctly 
as Intel(R) Pentium(R) 4 CPU 2.40GHz.

Now I'm stuck.  I don't know how to proceed.  I see the symptom but 
somehow find it hard to believe that 40% of the CPUs have become somehow 
defective.  Yet the software is all the same and reloads on a good node 
or a bad node produce no changes whatsoever.  Only a reboot on a bad 
node seems to cure the performance problem albeit for some short duration.

My next step will be to swap two CPUs, one from a known good into a 
known bad and see if anything changes.  But before I go that route I 
just wanted to ask the advice of this group, hoping that someone might 
have seen this before and offer a solution.



More information about the Beowulf mailing list