[Beowulf] AMD64 results...
Robert G. Brown
rgb at phy.duke.edu
Sun Dec 19 07:43:18 PST 2004
On Thu, 16 Dec 2004, Josip Loncaric wrote:
> Robert G. Brown wrote:
> > [...] One can see how having 64 bits would really
> > speed up 64 bit division compared to doing it in software across
> > multiple 32 bit registers...
> Correct me if I'm wrong, but doesn't the floating point unit normally
> use an internal iterative process to perform the division? This would
> not involve 32-bit registers...
> I'm not so sure about *integer* 64-bit division. Integer division may
> involve multiple 32-bit integer registers.
> Good ole' Cray-1 used an iterative process for floating point division
> which worked like this: given a floating point number x, use the first 8
> bits of the mantissa to index into a lookup table containing initial
> guesses, then do a few steps of Newton-Raphson iteration involving only
> multiply-add operations to get the fully converged reciprocal mantissa,
> fix the exponent, thus obtaining 1/x, then multiply y*(1/x) to get y/x.
> As I recall, the famous Pentium FDIV bug involved some corner cases in a
> similar iterative process, all of which is internal to the floating
> point unit. Moreover, in addition to following the 32/64-bit IEEE 754
> standard for floating point arithmetic, some implementations (e.g.
> Pentium, Opteron) support x87 legacy internal 80-bit representations of
> floating point numbers, which can really help when accumulating long
> sums and computing square roots, etc. Prof. Kahane has numerous
> arguments in favor of this internal 80-bit representation...
This may well be -- I used to hand code the 8087 back on the IBM PC and
thought that the 80 bit internal representation was peachy keen at the
time. I haven't tracked precisely how the x87 coprocessor model has
evolved (legacy or not) into P6-class processors, though -- the mixing
of RISC, CISC, CISC-interpreted-to-RISC-onchip left me confused years
I was really just making an empirical observation, and struggling to
understand it. As I pointed out yesterday, trancendental evals seem to
be much faster as well, which would certainly be consistent with a
resurrection of an efficient internal x87 architecture. If so, I'm all
for it -- HPC code (at least MY HPC code:-) tends to have more than just
triad-like operations on vectors -- things like the trig functions,
exponentials and logs, floating point division. I remember when my Sun
386i could turn in a savage that compared pretty well with the otherwise
much faster Sun 110 and Sparc 1 because it had a real CISC 80387 and Sun
was doing all of its trancendental calls in (RISC) software.
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf