[Beowulf] Multisocket mainboard hardware problems
Nifty Tom Mitchell
niftyompi at niftyegg.com
Wed Jan 21 15:46:16 PST 2009
On Fri, Jan 16, 2009 at 12:52:39PM -0500, Thomas Vixel wrote:
> It's somewhat of a stretch since you say it *suddenly* lost the use of
> the bank of memory, but it could be that the processor for that
> particular bank of memory isn't properly seated.
> We've had two such systems in the past couple years, with the first
> memtest86 kept reporting errors at consecutive addresses after it
> crossed the memory boundary to where the affected processor's memory
> controller took over. I swapped out memory modules, fiddled with
> memory settings, and re-arranged the cards all to no avail. The final
> thing I did that ended up fixing the problem was taking the processors
> out and seating each in the others' slot.
> At that point I had figured the processor itself was damaged, and that
> I'd surely get errors in the other side of the memory region but to my
> surprise I did not.
> The second system flat out refused to recognize one whole bank of
> memory like yours. After swapping out the memory didn't work, I tried
> the processor swapping trick and it worked perfectly afterward. Even
> swapping them back so they were in their original arrangement worked
> the second time.
> So, if all else fails you may want to try swapping the processors
> around or reseating them. It just might save you some headaches
> dealing with SuperMicro's RMA & tech support departments.
With Supermicro AMD mother boards also reduce the HT link speed between
proecessors and I/O. Depending on the model of the motherboard the HT
links may or may not be rock solid at the increased clocks of modern CPUs.
Reinstalling the processors including the cleaning then a correct reapplication
of high quality thermal compound will also 'touch' things like fans,
fan connectors, CPU contacts and lots of other subtle physical things
on the MB that might make a system unstable.
More information about the Beowulf