[Beowulf] AMD64 results...

Sun Dec 19 07:43:18 PST 2004

On Thu, 16 Dec 2004, Josip Loncaric wrote:

> Robert G. Brown wrote:
> > [...]   One can see how having 64 bits would really
> > speed up 64 bit division compared to doing it in software across
> > multiple 32 bit registers...
> 
> Correct me if I'm wrong, but doesn't the floating point unit normally 
> use an internal iterative process to perform the division?  This would 
> not involve 32-bit registers...
> 
> I'm not so sure about *integer* 64-bit division.  Integer division may 
> involve multiple 32-bit integer registers.
> 
> Good ole' Cray-1 used an iterative process for floating point division 
> which worked like this: given a floating point number x, use the first 8 
> bits of the mantissa to index into a lookup table containing initial 
> guesses, then do a few steps of Newton-Raphson iteration involving only 
> multiply-add operations to get the fully converged reciprocal mantissa, 
> fix the exponent, thus obtaining 1/x, then multiply y*(1/x) to get y/x.
> 
> As I recall, the famous Pentium FDIV bug involved some corner cases in a 
> similar iterative process, all of which is internal to the floating 
> point unit.  Moreover, in addition to following the 32/64-bit IEEE 754 
> standard for floating point arithmetic, some implementations (e.g. 
> Pentium, Opteron) support x87 legacy internal 80-bit representations of 
> floating point numbers, which can really help when accumulating long 
> sums and computing square roots, etc.  Prof. Kahane has numerous 
> arguments in favor of this internal 80-bit representation...

This may well be -- I used to hand code the 8087 back on the IBM PC and
thought that the 80 bit internal representation was peachy keen at the
time.  I haven't tracked precisely how the x87 coprocessor model has
evolved (legacy or not) into P6-class processors, though -- the mixing
of RISC, CISC, CISC-interpreted-to-RISC-onchip left me confused years
ago.

I was really just making an empirical observation, and struggling to
understand it.  As I pointed out yesterday, trancendental evals seem to
be much faster as well, which would certainly be consistent with a
resurrection of an efficient internal x87 architecture.  If so, I'm all
for it -- HPC code (at least MY HPC code:-) tends to have more than just
triad-like operations on vectors -- things like the trig functions,
exponentials and logs, floating point division.  I remember when my Sun
386i could turn in a savage that compared pretty well with the otherwise
much faster Sun 110 and Sparc 1 because it had a real CISC 80387 and Sun
was doing all of its trancendental calls in (RISC) software.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu