HPL residual check failure
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Antoine Petitet petitet at cs.utk.eduMon Sep 3 09:06:44 PDT 2001
- Previous message: client working.( urgent , pls help me).
- Next message: Fwd: Commerical Sun Grid Engine Support Available
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, > ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 3255.3898794 ...... FAILED means that the last 4 digits of the vector solution are incorrect 3255.389... = O(10^3). This also means that the first 12 digits are correct ... This residual number should be a O(1), and because it is found to be more than the threshold value given in the HPL.dat (16.0 is the default) the test is flagged as failed. Those residuals are computed, because if you were to report the perfor- mance achieved by your system to say the Top 500 list, you would be asked for those residuals. Such a failure may happen for one of the two following reasons: 1) the matrix random generator may produce a poorly conditioned matrix such that a more accurate result can not be produced with the algorithm used in HPL. One way to check for this would be to estimate the condition number of the randomly generated matrix. HPL does not do it, because such an operation is time-consuming, and also because a large number of those randomly generated matrices have been shown to be sufficiently well- conditioned. In short: On one hand, I cannot prove that this generator produces well-conditioned matrices, and on the other hand, not a single case of failure due to this generator has been reported so far. 2) For some reason, a bit or a byte is being corrupted during the computa- tions / communications. Such a problem may be caused by the hardware, or the software. Ex: A memory bank corrupts data, a network transmission failed, or a computation get the wrong result say because of a data align- ment issues. Software problems are relatively easy to track down: multiple implemen- tation of MPI, or the BLAS are available. The problem could be in HPL as well, but with the source available, one can potentially investigate. Hardware failures are more problematic. They are often not repeatable, and they rarely occur during a short run. They are also rare. Cheers, Antoine
- Previous message: client working.( urgent , pls help me).
- Next message: Fwd: Commerical Sun Grid Engine Support Available
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
