Clarification: [Beowulf] hpl - large problems fail

Craig Tierney ctierney at HPTI.com
Fri Mar 11 07:06:21 PST 2005


On Fri, 2005-03-11 at 07:46, Guy Coates wrote:
> > the command prompt when I run it.  It fails when it checks the solution
> > to linear equations.  The residual is too high and fails.  This is part
> > of the data from my HPL.out file:
> >
> 
> This could still be dodgy memory; if bits get flipped then you can expect
> those sorts of numerical instabilities.
> 
> Try running a single HPL job on each machine. If you get the correct
> answer on 3 machines and the wrong answer on one, then you've narrowed it
> down to hardware.
> 
> If you get the wrong answer on all your machines then you probably have a
> software problem. Try recompiling HPL with no compiler optimisations, a
> different compiler and/or blas library.
> 
> 
> If that doesn't work, then it might just be possible that you are into
> wierd hardware/kernel bug territory.  I ran into similar HPL problems
> whilst benchmarking a rather large hardware purchase we made several years
> ago. The HPL residuals were coming out as NaN.  Recompiling with a
> different compiler gave the same result. Rather worryingly, the same
> binaries ran correctly when run on different hardware. After alot of head
> scratching and phonecalls to an extremely worried vendor ("Hey, this kit
> you sold us can't do maths properly!") the problem was tracked down to a
> dodgy kernel module. It turned out that the module provided by the vendor
> to do console-over-lan stomped over the floating point registers under
> certain circumstances.
> 

It could also be the interconnect.  If you are using ethernet,
I would think it is unlikely but I have seen issues with high-speed
interconnects where they had a problem with the PCI slot, and
we would get wrong answers when running HPL on more than 2 systems.

Craig




More information about the Beowulf mailing list