[Beowulf] single machine with 500 GB of RAM

Mark Hahn hahn at mcmaster.ca
Wed Jan 9 14:53:30 PST 2013


> However, looking at the user manual for this application however, I 
> suspect the bulk of the work can be made parallel, in contrast to the 
> original post:

yes - the very first page of preface mentions a straightforward
decomposition that scales to 40 processors.  that would be "shared-nothing",
so in this context could mean 40 separate machines.

as for the "ginormous machine" approach, it's going to lose.
you can indeed put O(TB) into a single box, single address space,
by using ccnuma (popularized by AMD, now supported by Intel.)
the problem is that a single thread sees only a modest increase
in memory performance (for local versus local+remote).  so you've
got something like a 4-socket box with only the memory controllers
in three sockets doing anything - not to mention that in the active
socket, all the cores but one are idle too.

given the "slicing" methodology that this application uses for 
decomposition, I wonder whether it's actually closer to sequential
in its access patterns, rather than random.  the point here is that 
you absolutely must have ram if your access is random, since your
only constraint is latency.  if a lot of your accesses are sequential,
then they are potentially much more IO-like - specifically disk-like.

in short, suppose you set up a machine with a decent amount of ram
(say, lga2100, 8x8G dimms) and a lot of swap.  then just run your 
program that uses 512G of virtual address space.  depending on the 
pattern in which it tranverses that space, the results will either 
be horrible (not enough work per page) or quite decent (enough work 
in the set of hot pages that the kernel can cache in 64G.)  of course,
swap to SSD reduces the latency of thrashing and is pretty easy to 
configure.  the real appeal of this approach is that it doesn't need
any special hardware to test (you wouldn't bother with a raid controller,
since they're absolutely useless for raid0-type patterns.)

> have proper memory, it isn't optimized, and as a result you're 
> constantly swapping.  Merges are a good example of what /should/ work

if the domain is sliced in the "right" direction, merging should be 
very efficient.  even if sliced in the wrong direction, merging should
at least be block-able (and thus not terrible.)

> merging on just one of them that is also outfitted with a ramdisk'd 0.5 
> TB Fusion-IO PCI-E flash device.  If I am not wildly off the mark on the

I wouldn't bother with PCI-E flash, myself.  they tend to have
dumb/traditional raid controllers on them.  doing raid0 across a 
handful of cheap 2.5" SATA SSDs is ridiculously easy to do and will 
scale well up fairly well (with some attention to the PCIE topology
connecting the controllers, of course.)

regards, mark hahn.



More information about the Beowulf mailing list