[Beowulf] hang-up of HPC Challenge

Mikhail Kuzminsky kus at free.net
Wed Aug 20 10:52:51 PDT 2008


In message from Greg Lindahl <lindahl at pbm.com> (Tue, 19 Aug 2008 
19:39:38 -0700):
>On Wed, Aug 20, 2008 at 03:45:43AM +0400, Mikhail Kuzminsky wrote:
>> For some localization of possible problem reason, I ran pure HPL 
>>test  
>> instead of HPCC. HPL performs direct output to screen instead of 
>>writing 
>> to the file.
>>
>> Using MPICH w/np=8 I obtained normal HPL result for N=35000 - 
>>including  
>> 3 "PASSED" strings for ||Ax-b|| calculations. BUT ! Linux hang-ups  
>> immediately after output of this strings.
>
>Well, what did your configuration file tell HPL to do? Does it have
>another test, perhaps a bigger one, or is it supposed to exit? We
>aren't mind-readers.

Pls sorry: I performed now 2 HPL run cases for the same N=10000, 

(1st) - "single" HPL run, i.e. ONE N=10000, ONE blocksize value, and 
ONE any other HPL.dat parameter.

(2nd) - "multiple" HPL run w/same (one) N=10000 and blocksize=100, but 
with a sets of PFACTS etc (see the output below).

1st run finished successfully, 2nd lead to Linux hang-up.     

Yours
Mikhail 

"single" HPL run :
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 
20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing 
Labs.,  UTK
============================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   10000
NB     :     100
PMAP   : Row-major process mapping
P      :       2
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 16 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 
         1.110223e-16
- Computational tests pass if scaled residuals are less than 
          16.0

============================================================================
T/V                N    NB     P     Q               Time 
            Gflops
----------------------------------------------------------------------------
WR11C2R4       10000   100     2     4              23.32 
         2.859e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0767386 ...... 
PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0181586 ...... 
PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0040588 ...... 
PASSED
============================================================================

Finished      1 tests with the following results:
               1 tests completed and passed residual checks,
               0 tests completed and failed residual checks,
               0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.
============================================================================
[1]+  Done                    mpirun -np 8 xhpl

"multiple" HPL run:
HPLinpack 1.0a  --  High-Performance Linpack benchmark  --   January 
20, 2004
Written by A. Petitet and R. Clint Whaley,  Innovative Computing 
Labs.,  UTK
============================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   10000
NB     :     100
PMAP   : Row-major process mapping
P      :       2
Q      :       4
PFACT  :    Left    Crout    Right
NBMIN  :       2        4
NDIV   :       2
RFACT  :    Left    Crout    Right
BCAST  :   1ring
DEPTH  :       0
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 16 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
    1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
    2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
    3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 
         1.110223e-16
- Computational tests pass if scaled residuals are less than 
          16.0

============================================================================
T/V                N    NB     P     Q               Time 
            Gflops
----------------------------------------------------------------------------
WR00L2L2       10000   100     2     4              23.02 
         2.897e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0980967 ...... 
PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0232126 ...... 
PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0051885 ...... 
PASSED
============================================================================
T/V                N    NB     P     Q               Time 
            Gflops
----------------------------------------------------------------------------
WR00L2L4       10000   100     2     4              22.97 
         2.903e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0832258 ...... 
PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0196937 ...... 
PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0044019 ...... 
PASSED
============================================================================
T/V                N    NB     P     Q               Time 
            Gflops
----------------------------------------------------------------------------
WR00L2C2       10000   100     2     4              22.95 
         2.905e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0980967 ...... 
PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0232126 ...... 
PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0051885 ...... 
PASSED

... and here Linux hangs ...


>
>-- greg
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list