[Beowulf] BLACS Errors?

Ashton Peters ape20 at student.canterbury.ac.nz
Wed Aug 4 19:14:19 PDT 2004


I am having trouble with BLACS calls within a very simple Fortran 90
program on a ten-node dual-Opteron Rocks Linux 3.2.0 cluster. We have
the PGI CDK 5.1 installed.

I have written a simple Fortran program to test broadcast sends and
receives using BLACS. The full code of this program is attached to the
end of this message.

I compile this code with:
$ pgf90 -Mscalapack -o simple.opt simple.f

... and run it with:
$ mpirun -np X simple.opt

The code will run fine with 2 or 3 processors, with any vector length
(n) I choose. Below is the screen output from a successful 3 processor
run:

[ape20 at colossus fwdsolvers]$ pgf90 -Mscalapack -o simple.opt simple.f
[ape20 at colossus fwdsolvers]$ mpirun -np 3 simple.opt
ape20 at compute-0-0's password:
ape20 at compute-0-1's password:
 Process 0 is alive at grid position (0,0)
 For this test n =1000
 Array sent from process    0
 Process 1 is alive at grid position (0,1)
 Array received at process  1
 Process 2 is alive at grid position (0,2)
 Array received at process  2
[ape20 at colossus fwdsolvers]$

However, if I try to run with -np 4 or greater, I get the following
screen output:

[ape20 at colossus fwdsolvers]$ pgf90 -Mscalapack -o simple.opt simple.f
[ape20 at colossus fwdsolvers]$ mpirun -np 4 simple.opt
ape20 at compute-0-0's password:
ape20 at compute-0-1's password:
ape20 at compute-0-2's password:
 Process 0 is alive at grid position (0,0)
 For this test n =1000
 Array sent from process    0
 Process 2 is alive at grid position (0,2)
 Array received at process  2
 Process 1 is alive at grid position (0,1)
 Array received at process  1
bm_list_28551: (7.738281) wakeup_slave: unable to interrupt slave 0 pid
28550
Received disconnect from 10.255.255.252: Command terminated on signal
13.
[ape20 at colossus fwdsolvers]$ rm_l_1_19376: (5.019531) net_send: could
not write to fd=6, errno = 9
rm_l_1_19376:  p4_error: net_send write: -1
    p4_error: latest msg from perror: Bad file descriptor
rm_l_2_10837: (2.453125) net_send: could not write to fd=6, errno = 9
rm_l_2_10837:  p4_error: net_send write: -1
    p4_error: latest msg from perror: Bad file descriptor
[ape20 at colossus fwdsolvers]$

Does anyone have an idea what these error messages mean, and how I can
fix them? I am a beginner with BLACS, so it is possible that my Fortran
code code has not initialized it correctly, but I have checked it
against many tutorial examples and it seems OK.

Many thanks in advance,

Ashton Peters

Center for Bioengineering
University of Canterbury
Christchurch, New Zealand

----- FORTRAN CODE -----

        program SIMPLE
        
ccccc   VERY SIMPLE BLACS TEST PROGRAM   ccccc

ccccc   Declare variables       
        integer iam,nprocs,nprows,npcols,ctxt,myprow,mypcol
        integer junk(5000)
        
ccccc   Total number of processes
        call BLACS_PINFO(iam,nprocs)
        
ccccc   Define size of process grid (in this case a single row)
        nprows=1
        npcols=nprocs
        
ccccc   Get the system context
        call BLACS_GET(0,0,ctxt)
        
ccccc   Initialise the process grid
        call BLACS_GRIDINIT(ctxt,'Row',nprows,npcols)
        call BLACS_GRIDINFO(ctxt,nprows,npcols,myprow,mypcol)
        
ccccc   Get each process to check in with grid coordinates
10      format(a8,i2,a28,i1,a1,i1,a1)
        print 10,'Process',iam,
     &           'is alive at grid position (',myprow,',',mypcol,')'
        
ccccc   Master generates integer array and broadcasts to all slaves
        if((myprow.eq.0).and.(mypcol.eq.0)) then
        
        n=1000
        call IGEBS2D(ctxt,'All',' ',1,1,n,1)
        
20      format(a18,i4)
        print 20,'For this test n =',n
        
        do i=1,n
          junk(i)=i
        enddo
        
        call IGEBS2D(ctxt,'All',' ',n,1,junk,5000)
        print 30,'Array sent from process   ',iam

ccccc   End master code
        endif
        
ccccc   Slaves receive info and check it is correct
        if((myprow.ne.0).or.(mypcol.ne.0)) then

        call IGEBR2D(ctxt,'All',' ',1,1,n,1,0,0)
        
        call IGEBR2D(ctxt,'All',' ',n,1,junk,2500,0,0)

30      format(a27,i2)
        if((junk(1).eq.1).and.(junk(n).eq.n)) then
          print 30,'Array received at process ',iam
        else
          print 30,'Error receiving at process',iam
        endif

ccccc   End slave code
        endif

ccccc   End program
        end

----- END OF FORTRAN CODE -----




More information about the Beowulf mailing list