[Beowulf] Please help test compiler/hardware issue

Robert G. Brown rgb at phy.duke.edu
Thu May 3 17:11:45 PDT 2007


On Thu, 3 May 2007, Orion Poplawski wrote:

>
> Okay, I have a test case for the problem I reported before that I've 
> attached.
>
> We have two pairs of identical machines:
>
> - 2 Tyan S2882 Dual Processor 244 stepping 10
> - 2 Tyan S2882-D Dual processor dual core Opteron 275 stepping 2
>
> The attached code when compiled with the Portland Group Fortran compiler with 
> -O2 and run on either of the 244's will abort in random locations:

What about gfortran?  Or pathscale?

Mind you, I made myself actually look at the code below (shudder) in
spite of it being fortran, and it looks ok as far as >>I<< can tell
after not doing fortran unless my life depends on it for twenty years or
so.  To me it is wierd to use a(1) both as the address of a(1) (as an
argument to the subroutines) and as the contents of a(1) = 1, but hey.

It seems really really odd that any compiler or any program would fail
on this piece of code, though.  I wonder if a C memcpy would fail?  Or
what does stream (with a check) do?  Stream's copy isn't much more than
this.

Maybe somebody who has used fortran more recently than the mid-eighties
can comment further on the code, but to me it looks like a very odd
compiler bug.

    rgb

>
> [orion at coop00 rams.debug]$ pgf95 -O2 -o testatob testatob.f90
> [orion at coop00 rams.debug]$ ./testatob
> checkatob abort n=       246500 , i=         4685  a(i)=    8712085.
>  b(i)=    8465585.
> Abort
> [orion at coop00 rams.debug]$ ./testatob
> checkatob abort n=       246500 , i=       145817  a(i)=    9592717.
>  b(i)=    8853217.
> Abort
>
> [orion at coop01 rams.debug]$ time ./testatob
> checkatob abort n=       246500 , i=       118169  a(i)=    9565069.
>  b(i)=    8825569.
> Aborted
>
> real    0m31.842s
> user    0m16.476s
> sys     0m0.060s
>
>
> Haven't seen it run longer than 1 minute yet.
>
> However, it runs fine on the 275's (or at least I haven't seen it crash yet). 
> It also runs fine on the 244's when compiled with -O1.
>
> So, I guess this points to a hardware issue, but it may be a somewhat 
> generalized hardware issue.  I'd love to hear reports on other (particularly 
> other Tyan S2882 dual 244's) systems.
>
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list