Network RAM for Beowulf

Thu Aug 16 05:38:05 PDT 2001

On Thu, 16 Aug 2001, Amber Palekar wrote:

> hi,
>    we as a group of four students are *also* thinking
> of implementing Network RAM for a beowulf cluster
> (assuming 100Mbps Ethernet ) whereby each node in the
> cluster will donate some part of their RAM to be used
> by all other nodes.so we will basically be mapping
> this shared RAM to the address space of the current
> node.One of the uses that we're thinking of is for
> Journaling(as in file systems ).We'll be maintaining
> the journals on the Network RAM instead of writing
> them to the local disks.As we are completely new to
> this , it is very difficult for us to determine the
> statistics like :- the overhead in writing to Network
> RAM . Any info or pointers to these stats would be
> highly appreciated .

Check out the Trapeze project at Duke:

  http://www.cs.duke.edu/ari/trapeze/

This is a high end version of the project you are suggesting.  I suspect
that you can implement a simpler (but slower) version of this out of
component parts with existing kernels (or almost so).  If current
kernels support swap over NFS, for example, you can build a large
ramdisk on each node (and otherwise leave them just enough memory to
function comfortably without swapping) and export them all to a central
node for use as swap.  This would effectively extend the size of VM on
the central node, but swapping would occur at network-limited speeds
instead of disk-hardware limited speeds.  And of course you can
definitely export a set of ramdisks and NFS mount them for regular
"network bound" file I/O, and might even be able to glue them together
into a big striped filesystem (I don't think md will to this, but you
MIGHT be able to hack it so that it would).

Disk bandwidth, of course, has gotten much better over the years that
this might not make sense from a raw BW point of view.  For something
doing a lot of random accesses to disk, though, where the performance is
dominated by latency, might well benefit as the combination of memory
latency on the nodes plust network latency plus NFS latency will still
likely be an order of magnitude less than the seek time on a disk (which
requires the physical movement of big chunks of mass).  Order of
milliseconds for a disk seek, order of 100 microseconds for the
network+memory hit, a rough factor of ten improvement.  Note that even
if you hack the kernel and eliminate all the kludginess from this
approach (NFS swap?), you're still going to be limited by raw socket
latency and will therefore probably not improve on this estimate by as
much as a factor of 2.

Of course this works another order of magnitude better with Myrinet
(used in the Duke project) with latency less than 10 microseconds and
even the BW compares decently with local disk.  This is the primary
motivation for this project -- creating very large network-distributed
ramdisk-like storage with a very low level (and hence efficient)
implementation.

The PRIMARY place where this is useful is in projects that require far
more memory and/or faster (lower latency) disk than one can physically
add to a system, as it will almost always be cheaper and better to add
memory to a local system than to build a network of virtual memory IF
you can get to the desired memory regime by adding sticks to your own
box.  This is especially true now with memory prices in free fall (512MB
PC2100 DDR only $168 as of this morning on pricewatch.com -- it was over
$600 at the beginning of the summer -- and 512MB PC133 only >>$30<<
ditto).  Ditto with disk BW -- high end disk storage units now can
provide quite a lot of BW (with terrible latency of course).

Building a system with (say) 10 GB of network/virtual memory is a lot
more challenging.  To start with, this is more than the unhacked kernel
can address, I believe.  So there you'd need to build BOTH the
socket-based memory subsystem AND hack the kernel so that it could
somehow address it.  You should definitely look over the Trapeze site to
see how they attempt to finesse this problem via a higher level API (if
I understand their papers).  That is, they don't tamper with the
existing VM so much as to graft a set of hooks into a special library so
that the remote "memory" can be accessed via a file interface.  Or
something like that.  You'll need to read about it yourself, and might
even want to talk to the primary researchers as they'd likely have some
very sensible suggestions and direction to give you.

   rgb

>    TIA
>     Amber
>
> __________________________________________________
> Do You Yahoo!?
> Make international calls for as low as $.04/minute with Yahoo! Messenger
> http://phonecard.yahoo.com/
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu