[Beowulf] Performance issue - CPU Intel 00/02
bill at Princeton.EDU
Wed Jul 20 06:48:08 PDT 2005
System: 128 node Intel 2.4GHz P4
MBO: Tyan S2099, i845E
OS: RedHat 8.0, kernel 2.4.18-18.8.0 (but 2.4.20-28.8 changes nothing)
Problem: Performance one third after 60 minutes from reload/reboot on a
number of nodes, as determined by an xhpl run
On SuperBowl Sunday, the lights went out on this cluster. Ever since
that time performance has suffered. Initially, when running xhpl, there
was a 3x performance difference between a "good" node and a "bad" node.
A reboot solved the problem, or so I thought.
This summer, having more time to investigate the problem, I found that
some nodes exhibit this degradation after a power cycle while others
didn't. I've used strace, ptrace, watched memory usage statistics, etc
but the only thing which ever changed was that all of these calls
suffered a 3x performance hit on a bad node.
At first I thought it might be cooling, knowing that these Intel
processors throttle down when reaching a set value. But watching the
temperatures revealed that all nodes were effectively running the same
way. And once performance dropped, they never returned to normal.
By accident I discovered that of these 128 nodes, 50 of them show some
strange value in /proc/cpuinfo for model name. On a good node these
reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a
"bad" node they call themselves "00/02" yet when checking the BIOS, and
all the nodes have the same configuration I believe although I neglected
to gather the level this last go round, they reveal themselves correctly
as Intel(R) Pentium(R) 4 CPU 2.40GHz.
Now I'm stuck. I don't know how to proceed. I see the symptom but
somehow find it hard to believe that 40% of the CPUs have become somehow
defective. Yet the software is all the same and reloads on a good node
or a bad node produce no changes whatsoever. Only a reboot on a bad
node seems to cure the performance problem albeit for some short duration.
My next step will be to swap two CPUs, one from a known good into a
known bad and see if anything changes. But before I go that route I
just wanted to ask the advice of this group, hoping that someone might
have seen this before and offer a solution.
More information about the Beowulf