p4 v itinium

Fri May 17 01:54:42 PDT 2002

Robert,

Thanks for the response.

My requirement is for people with 'old' Fortran programs, Atomic 
and Molecular Physics codes thatthey know work and do not wish 
to even think of developing orparallelising. In some instances 
they may be looking for 8GB -16GB+.

I am concerened who these would perform on a p4 compared to an
Itanium, which can handle the memory.

Ricky

On Tue, 14 May 2002 17:59:59 -0400 (EDT) "Robert G. Brown" 
<rgb at phy.duke.edu> wrote:

> On Tue, 14 May 2002, Mark Hahn wrote:
> 
> > > > can a p4 access more than 4GB memory - some of the codes have a 
> > > > requirement to access large memory
> > 
> > depends on what you mean by "access".  if you mean "faster than disk",
> > then yes, that's possible.  you'll wind up using some form of OS-supported
> > bank-switching, which considering that Linux syscalls are <1us, can
> > be fairly acceptable.  but a single ia32 process can never
> > (not even with icky 16:32 segmentation) directly access >4G, since the 
> > address-mapping hardware goes through a strictly 32b stage.
> 
> To amplify Mark's "faster than disk" reply just a bit in the strict
> context of beowulfery, one of several reasons one might build a beowulf
> is to aggregate memory, not just parallelize CPU (indeed, in the best of
> worlds one might do both at the same time).  One fundamental advantage
> of a cluster is that 16 2 GB, 2 processor nodes represent perhaps 30 GB
> of usable memory -- yes, it may be difficult to just "address the
> memory" as a single virtual space (although, see below) but for many
> tasks designing an application that permits the space to be used
> nonetheless is not all that difficult.
> 
> This is an area that has Real Computer Scientists working on it.  The
> Trapeze project at Duke is one example of a setup designed to make
> aggregate memory available to a single threaded task at network-bound
> speeds, which are typically much slower than real memory but much MUCH
> faster than virtual memory (disk).  Once upon a time there was even a
> list thread where the possibility of building a big ramdisk on a node
> and NFS exporting it to a client as swap was discussed -- this wouldn't
> work in 2.2.x kernels (if I correctly recall the discussion) but might
> be workable in 2.4.x kernels.  Even as inefficient as this probably
> would be, it would still likely beat local disk swap by 2-3 orders of
> magnitude in speed.
> 
> However, a better match to the beowulf architecture might require some
> task redesign.  Monte Carlo people often want to do really big lattices.
> In recent years, accessible memory has generally exceeded what one can
> accomplish with the CPU (for my own personal MC tasks, at least) but not
> long ago system memories were only a few MB -- hard as that is to
> believe -- and a quite small 3d lattice of "sites", each site
> represented by a vector of double precision numbers, sufficed to use it
> all up.  One solution was lattice partitioned Monte Carlo, where one
> took a large space of "sites" with some nearest neighbor interaction too
> big to fit on one system and partitioned it into sublattice blocks that
> did, distributing the sites among nodes and working on them parallel
> with the node CPUs.  This often scaled up quite well, with no worse than
> surface to volume scaling of IPCs (so bigger codes spent proportionally
> more time working on the interior of their sublattice and less time
> sending information on the boundary sites to the node that shared the
> boundary).
> 
> For some tasks the orders-of-magnitude slowdown in off-node memory
> access can be a killer; for others it is no big deal.  For still others
> to be able to do the large memory tasks at >>all<< is the big win. (The
> amazing thing about a dancing bear isn't how gracefully it dances but
> that it dances at all:-)
> 
> So one thing to think about in the event that you cannot easily find a
> single system architecture that will hold your task all at once is --
> design a cluster (probably one with very low latency, very fast
> networking) that WILL hold your task all at once, and try to get back
> some of what you lose on IPC's by parallelizing the task itself so it
> doesn't all HAVE to live on a single processor.  A remarkable number of
> tasks, especially those which live in a very large volume of memory, CAN
> be efficiently partitioned and parallelized with a little thought.
> 
> A good book on parallel program design could probably help with this.
> As is often the case, there are many ways to partition memory and tasks
> with wildly different scaling properties and overhead costs, and there
> are likely tasks for which no solution currently exists that is a winner
> in the sense that it accomplishes the work you desire in the time you
> require.  Even for those tasks, I've found that simply waiting a couple
> of years for Moore's Law to catch up sometimes helps.  Right now Moore's
> Law is enabling me to do larger lattices at equivalent precision every
> few years (every major cluster upgrade, for sure).  In a decade we'll
> have tens or hundreds of GB of active memory on a typical system, in all
> probability, just as a decade ago we had ones to tens of MB.
> 
>    rgb
> 
> -- 
> Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
> Duke University Dept. of Physics, Box 90305
> Durham, N.C. 27708-0305
> Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

----------------------
Ricky Rankin
Principal Analyst
Computing Services
Queen's University Belfast

tel: 02890 273819
fax: 02890 230592
email: r.rankin at qub.ac.uk