(Fwd) Re: HELP! linux cluster with LAM-MPI
Simen Timian Thoresen
simentt at dolphinics.no
Fri Feb 9 07:46:14 PST 2001
From: khocha at icu.ac.kr
To: beowulf at beowulf.org
Date sent: Fri, 9 Feb 2001 20:35:21 +0900
Subject: HELP! linux cluster with LAM-MPI
> Dear All.
> I'm a graduate student of 'Information and Communications Univ.' in Korea. In
> our Lab., we built diskless clustering system with Intel L440GX+ board.
> Our system used Linux kernel 2.2.13 and LAM-MPI 6.3.2.
> By the way, during the test, the system made unexpected troubles.
> The MPI-test program has only two communications (that means it has 'EP'
> style). (1. distribute data(in beginning part), 2 collect result data(in
> endding part)). It uses only a little memory, but has many loop operations.
> With a few iteration, it works well, but when we increase the number of loop
> operations for solving some difficult problems, a node displays error message
> as follow, and then it is downed.
> == ====== [root at node11 root]# Unable to handle kernel paging request at
> virtual address e6 70e602 current->tss.cr3 = 07591000, %cr3 = 07591000 *pde =
> 00000000 Oops: 0002 CPU: 1 EIP: 0010: EFLAGS: 00010246 eax: 00000000
> ebx: c7593fb4 ecx: 00000286 edx: 00000000 esi: 00000000 edi: c7592000
> ebp: c7593fbc esp: c7593fa0 ds: 0018 es: 0018 ss: 0018 Process vital
> (pid: 424, process nr: 20, stackpage=c7593000) Stack: bffffe14 00000032
> 00000005 00000000 c7592000 00000000 1dcd6500 bffffd3c
> c0109fb8 bffffd34 00000000 40107bec 00000000 bffffe14 bffffd3c 000000a2
> c010002b 0000002b 000000a2 400a9f51 00000023 00000206 bffffd14 0000002b
> Call Trace:  
> Code: 00 b0 02 e6 70 e6 80 e4 71 e6 80 88 c1 31 d2 88 ca 89 54 24
> == ======
> Please~~, tell us the hint to solve this problem.
> p.s. Our system are consist of
> L440GX+ (Dual Pentium III 550MHz, 24 cluster nodes, each node doesn't have a
> disk, it use server's RAID), Compaq Proliant 1600 server (Dual Pentium III
> 600MHz , server), Serial HUB (Comtrol Rocketport), Fast Ethernet Hub (3com ),
> 108 GB RAID
It seems like you have a kernel fault more than a LAM fault. Does this occur
on one node only, or on random nodes? If it only happens on one node, I
would assume that node has some sort of faulty hardware. Unfortunately, I
am not qualified to analyse your Oops.
You might want to update your kernel, either to 2.2.18 or 2.4.1 and see if the
same oops occurs - if it still does, and at that on random machines, you
might get better help if you go to linux-kernel.
Simen Thoresen, Beowulf-cleaner and random artist.
Er det ikke rart?
The gnu RART-project on http://valinor.dolphinics.no:1080/~simentt/rart
More information about the Beowulf