[Beowulf] Errors on IBM e325
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jeff Layton jeffrey.b.layton at lmco.comMon Jun 28 09:50:41 PDT 2004
- Previous message: [Beowulf] single power supply for multiple nodes
- Next message: [Beowulf] emulating MPI?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Michael Will wrote: >Was this not tested before it was deployed? Or is it a problem that only >recently developed? > Well supposedily it was tested before deployment. We're seeing these errors (among others) on a number of nodes at random times. :( >It sounds similar to http://lists.suse.com/archive/suse-amd64/2003-Sep/0063.html >suggesting that you should make sure that you run the latest kernel, and if the problem >persists is a case for your service contract. (i.E. hardware broken) > Well, I hate to say it, but it's not SuSE. It's the other guy :) The kernel is only 2.4.21 but has been patched quite a bit. The NUMA patches are in there, but not built in the binary kernel. I'm not sure if we will continue to get support if we rebuild the kernel with NUMA activated (out IT people require support at all times). >also see http://www.cs.caltech.edu/~weixl/research/fast-mon/arch/x86_64/kernel/bluesmoke.c > I'll try this code to see what it finds out. Thanks! Jeff >Michael Will >On Friday 25 June 2004 08:21 am, Jeff Layton wrote: > > >>Good morning, >> >> We've got a shiny new IBM cluster with e325 nodes (Opteron). >>However, we're having some trouble with a number of nodes. >>We keep getting 'GART' errors showing up in the logs. Here is >>an example, >> >>Jun 21 07:07:42 c3n32.cluster kernel: Lost an northbridge error >>Jun 21 07:40:52 c1n4.cluster kernel: Lost an northbridge error >>Jun 21 07:07:42 c3n32.cluster kernel: GART error 3 >>Jun 21 07:40:52 c1n4.cluster kernel: GART error 3 >>Jun 21 14:03:49 c1n2.cluster kernel: extended error chipkill ecc error >>Jun 21 14:03:50 c1n2.cluster kernel: corrected ecc error >> >> >> Does anybody have any ideas what the cause might be? >> >>Thanks! >> >>Jeff >> >> >> > > > -- Dr. Jeff Layton Aerodynamics and CFD Lockheed-Martin Aeronautical Company - Marietta
- Previous message: [Beowulf] single power supply for multiple nodes
- Next message: [Beowulf] emulating MPI?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
