[Beowulf] Performance issue - CPU Intel 00/02
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bill Wichser bill at Princeton.EDUWed Jul 20 06:48:08 PDT 2005
- Previous message: [Beowulf] New HPCC results, and an MX question
- Next message: [Beowulf] Performance issue - CPU Intel 00/02
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
System: 128 node Intel 2.4GHz P4 MBO: Tyan S2099, i845E OS: RedHat 8.0, kernel 2.4.18-18.8.0 (but 2.4.20-28.8 changes nothing) Problem: Performance one third after 60 minutes from reload/reboot on a number of nodes, as determined by an xhpl run --- On SuperBowl Sunday, the lights went out on this cluster. Ever since that time performance has suffered. Initially, when running xhpl, there was a 3x performance difference between a "good" node and a "bad" node. A reboot solved the problem, or so I thought. This summer, having more time to investigate the problem, I found that some nodes exhibit this degradation after a power cycle while others didn't. I've used strace, ptrace, watched memory usage statistics, etc but the only thing which ever changed was that all of these calls suffered a 3x performance hit on a bad node. At first I thought it might be cooling, knowing that these Intel processors throttle down when reaching a set value. But watching the temperatures revealed that all nodes were effectively running the same way. And once performance dropped, they never returned to normal. By accident I discovered that of these 128 nodes, 50 of them show some strange value in /proc/cpuinfo for model name. On a good node these reveal themselves as "Intel(R) Pentium(R) 4 CPU 2.40GHz" while on a "bad" node they call themselves "00/02" yet when checking the BIOS, and all the nodes have the same configuration I believe although I neglected to gather the level this last go round, they reveal themselves correctly as Intel(R) Pentium(R) 4 CPU 2.40GHz. Now I'm stuck. I don't know how to proceed. I see the symptom but somehow find it hard to believe that 40% of the CPUs have become somehow defective. Yet the software is all the same and reloads on a good node or a bad node produce no changes whatsoever. Only a reboot on a bad node seems to cure the performance problem albeit for some short duration. My next step will be to swap two CPUs, one from a known good into a known bad and see if anything changes. But before I go that route I just wanted to ask the advice of this group, hoping that someone might have seen this before and offer a solution. Thanks, Bill
- Previous message: [Beowulf] New HPCC results, and an MX question
- Next message: [Beowulf] Performance issue - CPU Intel 00/02
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
