[Beowulf] Errors on IBM e325
landman at scalableinformatics.com
Mon Jun 28 11:21:34 PDT 2004
On Fri, 2004-06-25 at 11:21, Jeff Layton wrote:
> Good morning,
> We've got a shiny new IBM cluster with e325 nodes (Opteron).
> However, we're having some trouble with a number of nodes.
> We keep getting 'GART' errors showing up in the logs. Here is
> an example,
> Jun 21 07:07:42 c3n32.cluster kernel: Lost an northbridge error
> Jun 21 07:40:52 c1n4.cluster kernel: Lost an northbridge error
> Jun 21 07:07:42 c3n32.cluster kernel: GART error 3
> Jun 21 07:40:52 c1n4.cluster kernel: GART error 3
> Jun 21 14:03:49 c1n2.cluster kernel: extended error chipkill ecc error
> Jun 21 14:03:50 c1n2.cluster kernel: corrected ecc error
Does booting with iommu=off help?
> Does anybody have any ideas what the cause might be?
The e325's have an onboard ATI VGA bit. Last I checked it was PCI based
(I don't have a unit here to see). There was a little discussion of
GART based issues in RH
https://www.redhat.com/archives/amd64-list/2004-May/date.html . Which
kernel, how much memory, how is it distributed? I have noticed that
some vendors do not configure the memory on Opteron systems correctly,
though I would expect the IBM folks not to have a problem with this.
There are also some BIOS settings on the e325 that directly impact
memory layout, NUMA use, etc. Of course, I don't remember what they
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://scalableinformatics.com
phone: +1 734 612 4615
More information about the Beowulf