HELP! linux cluster with LAM-MPI
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Simen Timian Thoresen simentt at dolphinics.noFri Feb 9 05:27:53 PST 2001
- Previous message: HELP! linux cluster with LAM-MPI
- Next message: (Fwd) Re: HELP! linux cluster with LAM-MPI
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
From: khocha at icu.ac.kr To: beowulf at beowulf.org Date sent: Fri, 9 Feb 2001 20:35:21 +0900 Subject: HELP! linux cluster with LAM-MPI > Dear All. > > I'm a graduate student of 'Information and Communications Univ.' in Korea. In > our Lab., we built diskless clustering system with Intel L440GX+ board. > > Our system used Linux kernel 2.2.13 and LAM-MPI 6.3.2. > By the way, during the test, the system made unexpected troubles. > > The MPI-test program has only two communications (that means it has 'EP' style). > (1. distribute data(in beginning part), 2 collect result data(in endding part)). > It uses only a little memory, but has many loop operations. > > With a few iteration, it works well, but when we increase the number of loop > operations for solving some difficult problems, a node displays error message as > follow, and then it is downed. > > ================================================================================ > ====== [root at node11 root]# Unable to handle kernel paging request at virtual > address e6 70e602 current->tss.cr3 = 07591000, %cr3 = 07591000 *pde = 00000000 > Oops: 0002 CPU: 1 EIP: 0010:[] EFLAGS: 00010246 eax: 00000000 ebx: > c7593fb4 ecx: 00000286 edx: 00000000 esi: 00000000 edi: c7592000 ebp: > c7593fbc esp: c7593fa0 ds: 0018 es: 0018 ss: 0018 Process vital (pid: 424, > process nr: 20, stackpage=c7593000) Stack: bffffe14 00000032 00000005 00000000 > c7592000 00000000 1dcd6500 bffffd3c > c0109fb8 bffffd34 00000000 40107bec 00000000 bffffe14 bffffd3c 000000a2 > c010002b 0000002b 000000a2 400a9f51 00000023 00000206 bffffd14 0000002b > Call Trace: [] [] > Code: 00 b0 02 e6 70 e6 80 e4 71 e6 80 88 c1 31 d2 88 ca 89 54 24 > ================================================================================ > ====== > > Please~~, tell us the hint to solve this problem. > > p.s. Our system are consist of > ------------------------------- > L440GX+ (Dual Pentium III 550MHz, 24 cluster nodes, each node doesn't have a > disk, it use server's RAID), Compaq Proliant 1600 server (Dual Pentium III > 600MHz , server), Serial HUB (Comtrol Rocketport), Fast Ethernet Hub (3com ), > 108 GB RAID > Sir, It seems like you have a kernel fault more than a LAM fault. Does this occur on one node only, or on random nodes? If it only happens on one node, I would assume that node has some sort of faulty hardware. Unfortunately, I am not qualified to analyse your Oops. You might want to update your kernel, either to 2.2.18 or 2.4.1 and see if the same oops occurs - if it still does, and at that on random machines, you might get better help if you go to linux-kernel. Good luck. Yours, -Simen -- Simen Thoresen, Beowulf-cleaner and random artist. Er det ikke rart? The gnu RART-project on http://valinor.dolphinics.no:1080/~simentt/rart
- Previous message: HELP! linux cluster with LAM-MPI
- Next message: (Fwd) Re: HELP! linux cluster with LAM-MPI
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
