p4 v itinium

Tue May 14 14:59:59 PDT 2002

On Tue, 14 May 2002, Mark Hahn wrote:

> > > can a p4 access more than 4GB memory - some of the codes have a 
> > > requirement to access large memory
> 
> depends on what you mean by "access".  if you mean "faster than disk",
> then yes, that's possible.  you'll wind up using some form of OS-supported
> bank-switching, which considering that Linux syscalls are <1us, can
> be fairly acceptable.  but a single ia32 process can never
> (not even with icky 16:32 segmentation) directly access >4G, since the 
> address-mapping hardware goes through a strictly 32b stage.

To amplify Mark's "faster than disk" reply just a bit in the strict
context of beowulfery, one of several reasons one might build a beowulf
is to aggregate memory, not just parallelize CPU (indeed, in the best of
worlds one might do both at the same time).  One fundamental advantage
of a cluster is that 16 2 GB, 2 processor nodes represent perhaps 30 GB
of usable memory -- yes, it may be difficult to just "address the
memory" as a single virtual space (although, see below) but for many
tasks designing an application that permits the space to be used
nonetheless is not all that difficult.

This is an area that has Real Computer Scientists working on it.  The
Trapeze project at Duke is one example of a setup designed to make
aggregate memory available to a single threaded task at network-bound
speeds, which are typically much slower than real memory but much MUCH
faster than virtual memory (disk).  Once upon a time there was even a
list thread where the possibility of building a big ramdisk on a node
and NFS exporting it to a client as swap was discussed -- this wouldn't
work in 2.2.x kernels (if I correctly recall the discussion) but might
be workable in 2.4.x kernels.  Even as inefficient as this probably
would be, it would still likely beat local disk swap by 2-3 orders of
magnitude in speed.

However, a better match to the beowulf architecture might require some
task redesign.  Monte Carlo people often want to do really big lattices.
In recent years, accessible memory has generally exceeded what one can
accomplish with the CPU (for my own personal MC tasks, at least) but not
long ago system memories were only a few MB -- hard as that is to
believe -- and a quite small 3d lattice of "sites", each site
represented by a vector of double precision numbers, sufficed to use it
all up.  One solution was lattice partitioned Monte Carlo, where one
took a large space of "sites" with some nearest neighbor interaction too
big to fit on one system and partitioned it into sublattice blocks that
did, distributing the sites among nodes and working on them parallel
with the node CPUs.  This often scaled up quite well, with no worse than
surface to volume scaling of IPCs (so bigger codes spent proportionally
more time working on the interior of their sublattice and less time
sending information on the boundary sites to the node that shared the
boundary).

For some tasks the orders-of-magnitude slowdown in off-node memory
access can be a killer; for others it is no big deal.  For still others
to be able to do the large memory tasks at >>all<< is the big win. (The
amazing thing about a dancing bear isn't how gracefully it dances but
that it dances at all:-)

So one thing to think about in the event that you cannot easily find a
single system architecture that will hold your task all at once is --
design a cluster (probably one with very low latency, very fast
networking) that WILL hold your task all at once, and try to get back
some of what you lose on IPC's by parallelizing the task itself so it
doesn't all HAVE to live on a single processor.  A remarkable number of
tasks, especially those which live in a very large volume of memory, CAN
be efficiently partitioned and parallelized with a little thought.

A good book on parallel program design could probably help with this.
As is often the case, there are many ways to partition memory and tasks
with wildly different scaling properties and overhead costs, and there
are likely tasks for which no solution currently exists that is a winner
in the sense that it accomplishes the work you desire in the time you
require.  Even for those tasks, I've found that simply waiting a couple
of years for Moore's Law to catch up sometimes helps.  Right now Moore's
Law is enabling me to do larger lattices at equivalent precision every
few years (every major cluster upgrade, for sure).  In a decade we'll
have tens or hundreds of GB of active memory on a typical system, in all
probability, just as a decade ago we had ones to tens of MB.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu