Clarification: [Beowulf] hpl - large problems fail
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Craig Tierney ctierney at HPTI.comFri Mar 11 07:06:21 PST 2005
- Previous message: Clarification: [Beowulf] hpl - large problems fail
- Next message: [Beowulf] hpl - large problems fail
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, 2005-03-11 at 07:46, Guy Coates wrote:
> > the command prompt when I run it. It fails when it checks the solution
> > to linear equations. The residual is too high and fails. This is part
> > of the data from my HPL.out file:
> >
>
> This could still be dodgy memory; if bits get flipped then you can expect
> those sorts of numerical instabilities.
>
> Try running a single HPL job on each machine. If you get the correct
> answer on 3 machines and the wrong answer on one, then you've narrowed it
> down to hardware.
>
> If you get the wrong answer on all your machines then you probably have a
> software problem. Try recompiling HPL with no compiler optimisations, a
> different compiler and/or blas library.
>
>
> If that doesn't work, then it might just be possible that you are into
> wierd hardware/kernel bug territory. I ran into similar HPL problems
> whilst benchmarking a rather large hardware purchase we made several years
> ago. The HPL residuals were coming out as NaN. Recompiling with a
> different compiler gave the same result. Rather worryingly, the same
> binaries ran correctly when run on different hardware. After alot of head
> scratching and phonecalls to an extremely worried vendor ("Hey, this kit
> you sold us can't do maths properly!") the problem was tracked down to a
> dodgy kernel module. It turned out that the module provided by the vendor
> to do console-over-lan stomped over the floating point registers under
> certain circumstances.
>
It could also be the interconnect. If you are using ethernet,
I would think it is unlikely but I have seen issues with high-speed
interconnects where they had a problem with the PCI slot, and
we would get wrong answers when running HPL on more than 2 systems.
Craig
- Previous message: Clarification: [Beowulf] hpl - large problems fail
- Next message: [Beowulf] hpl - large problems fail
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
