Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Errors on IBM e325

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Joe Landman landman at scalableinformatics.com
Mon Jun 28 11:21:34 PDT 2004


On Fri, 2004-06-25 at 11:21, Jeff Layton wrote:
> Good morning,
> 
>    We've got a shiny new IBM cluster with e325 nodes (Opteron).
> However, we're having some trouble with a number of nodes.
> We keep getting 'GART' errors showing up in the logs. Here is
> an example,
> 
> Jun 21 07:07:42 c3n32.cluster kernel: Lost an northbridge error
> Jun 21 07:40:52 c1n4.cluster kernel: Lost an northbridge error
> Jun 21 07:07:42 c3n32.cluster kernel: GART error 3
> Jun 21 07:40:52 c1n4.cluster kernel: GART error 3
> Jun 21 14:03:49 c1n2.cluster kernel:     extended error chipkill ecc error
> Jun 21 14:03:50 c1n2.cluster kernel:     corrected ecc error

Does booting with iommu=off help?

> 
> 
>    Does anybody have any ideas what the cause might be?

The e325's have an onboard ATI VGA bit.  Last I checked it was PCI based
(I don't have a unit here to see).  There was a little discussion of
GART based issues in RH
https://www.redhat.com/archives/amd64-list/2004-May/date.html .  Which
kernel, how much memory, how is it distributed?  I have noticed that
some vendors do not configure the memory on Opteron systems correctly,
though I would expect the IBM folks not to have a problem with this. 

There are also some BIOS settings on the e325 that directly impact
memory layout, NUMA use,  etc.  Of course, I don't remember what they
are :(.

Joe

> 
> Thanks!
> 
> Jeff
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615




More information about the Beowulf mailing list