HPL residual check failure
petitet at cs.utk.edu
Mon Sep 3 09:06:44 PDT 2001
> ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 3255.3898794 ......
means that the last 4 digits of the vector solution are incorrect
3255.389... = O(10^3). This also means that the first 12 digits are
This residual number should be a O(1), and because it is found to be
more than the threshold value given in the HPL.dat (16.0 is the default)
the test is flagged as failed.
Those residuals are computed, because if you were to report the perfor-
mance achieved by your system to say the Top 500 list, you would be asked
for those residuals.
Such a failure may happen for one of the two following reasons:
1) the matrix random generator may produce a poorly conditioned matrix
such that a more accurate result can not be produced with the algorithm
used in HPL.
One way to check for this would be to estimate the condition number of
the randomly generated matrix. HPL does not do it, because such an
operation is time-consuming, and also because a large number of those
randomly generated matrices have been shown to be sufficiently well-
In short: On one hand, I cannot prove that this generator produces
well-conditioned matrices, and on the other hand, not a single case
of failure due to this generator has been reported so far.
2) For some reason, a bit or a byte is being corrupted during the computa-
tions / communications. Such a problem may be caused by the hardware,
or the software. Ex: A memory bank corrupts data, a network transmission
failed, or a computation get the wrong result say because of a data align-
Software problems are relatively easy to track down: multiple implemen-
tation of MPI, or the BLAS are available. The problem could be in HPL as
well, but with the source available, one can potentially investigate.
Hardware failures are more problematic. They are often not repeatable,
and they rarely occur during a short run. They are also rare.
More information about the Beowulf