[Beowulf] Parallel memory

Wed Oct 19 05:52:39 PDT 2005

On Wed, 19 Oct 2005, Bogdan Costescu wrote:

> On Tue, 18 Oct 2005, Robert G. Brown wrote:
>
>> There was once also an online list discussion about swapping to an NFS 
>> mounted remote exported ramdisk
>
> There are some problems with this approach, which is in principle very 
> similar with swapping over NBD or iSCSI. A discussion about iSCSI just 
> happened 1-2 months ago on the netdev and linux-kernel mailing lists; the 
> main problem is what to do in case the memory becomes tight: a part of memory 
> has to be sent to the block device, but sending it requires allocating more 
> memory for the network transfer; sometimes some part of memory is sent out 
> just so that another part is copied back in, in which case more memory is 
> needed for the network transfer (for the reception) - both these situations 
> can result in a deadlock.
> So, swapping over network is not (yet) that reliable - doing this at the 
> application level, therefore knowing more-or-less the pattern of transfers, 
> seems much wiser.

Ah, very interesting.  So they've fixed it so that it works, but there
are circumstances where the kernel itself becomes unstable as it juggles
memory if one tries to run a ramdisk that is too large a fraction of
total memory and simultaneously drives the network to where there is a
large backlog of packets in transit.

Presumably this means that there is a range of tolerance in ramdisk size
where this WILL work robustly for a maximally driven rate of data
exchange.  Given the disparity in speed (2-3 orders of magnitude)
between memory and the network device, one would expect writes to swap
to never be a problem.  An incoming stream would be cleared to ramdisk
much faster than it could accumulate.

The outgoing stream, OTOH, is the problem -- the remote kernel could
request (in principle) the entire contents of swap, which is delivered
100x more slowly than it is read from ramdisk, all of which is
effectively double buffered pending delivery.  So allocating a ramdisk
of (total memory/2 - kernel/libraries) ought to "work" nearly all of the
time.  Not terribly efficient use of space, to buy 4 GB of memory per
node to provide perhaps 1.9 GB of remote swap, but it could work and
even be worth it in the short run for certain classes of problem.

However, I agree that the better solution in the long run is either to
do this as a fairly simple parallel application with routines that do
data management for you (create a big scratch space and constantly call
routines to fill it and empty it as one works through the application
data).  This seems like it would be a REAL PITA to program for certain
apps, but one could look at e.g. ATLAS for guidance since this is the
kind of thing ATLAS does to optimize utilization of disparate memory
speeds and operational sizes.  The situation is a lot closer to what one
needs to do to manage/optimize CPU caches than it is to a parallel
application per se, and although you'll have to reinvent wheels that
work automagically inside the CPU/cache/memory subsystem on most
systems, there are algorithmic solutions for this stuff that will work.
OR you can see if trapeze or one of the other virtualizing shmem
projects still have kernel-level solutions that can be built and
implemented.  I kinda doubt it, unfortunately.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu