HPL residual check failure

Antoine Petitet petitet at cs.utk.edu
Mon Sep 3 09:06:44 PDT 2001


  Hi,
  
> ||Ax-b||_oo / ( eps * ||A||_1  * N        ) =     3255.3898794 ......
FAILED

  means that the last 4 digits of the vector solution are incorrect
  3255.389... = O(10^3). This also means that the first 12 digits are
  correct ...
  
  This residual number should be a O(1), and because it is found to be
  more than the threshold value given in the HPL.dat (16.0 is the default)
  the test is flagged as failed.
  
  Those residuals are computed, because if you were to report the perfor-
  mance achieved by your system to say the Top 500 list, you would be asked
  for those residuals.

  
  Such a failure may happen for one of the two following reasons:
  
  1) the matrix random generator may produce a poorly conditioned matrix
  such that a more accurate result can not be produced with the algorithm
  used in HPL.

  One way to check for this would be to estimate the condition number of
  the randomly generated matrix. HPL does not do it, because such an
  operation is time-consuming, and also because a large number of those
  randomly generated matrices have been shown to be sufficiently well-
  conditioned.
  
  In short: On one hand, I cannot prove that this generator produces 
  well-conditioned matrices, and on the other hand, not a single case 
  of failure due to this generator has been reported so far.

  
  2) For some reason, a bit or a byte is being corrupted during the computa-
  tions / communications. Such a problem may be caused by the hardware,
  or the software. Ex: A memory bank corrupts data, a network transmission
  failed, or a computation get the wrong result say because of a data align-
  ment issues.
  
  Software problems are relatively easy to track down: multiple implemen-
  tation of MPI, or the BLAS are available. The problem could be in HPL as
  well, but with the source available, one can potentially investigate.
  
  Hardware failures are more problematic. They are often not repeatable,
  and they rarely occur during a short run. They are also rare.


  
  Cheers,
  
  Antoine






More information about the Beowulf mailing list