[Beowulf] BIG 'ram' using SSDs - was single machine with 500 GB of RAM

Wed Jan 9 16:09:57 PST 2013

On Wed, 9 Jan 2013 at 08:27 -0000, Vincent Diepeveen wrote:

> What would be a rather interesting thought for building a single box
> dirt cheap with huge 'RAM' is the idea of having 1 fast RAID array
> of SSD's function as the 'RAM'.

We recently had a chance to look at something like this at a smaller
scale.

Most of our nodes are diskless.  People have expressed interest in
having local disk and/or SSD so we have one test node with a local
hard disk (no raid) and one test node with a local SSD.  I have
generally left them configured as swap with a manual ability to mount
as local disk.

Until recently the only place these nodes actually provided any
benefit was with jobs which mapped large files into virtual memory.
Having the large swap space allowed the scheduler to schedule these
jobs on these nodes, even though the swap was not used at all.  This
is a scheduler issue not really a hardware issue.  When the swap space
was actually used for jobs the performance sucked greatly.

More recently we had a user job which was not working on either of
these test nodes (actually for more basic reasons) until I got
involved.

I had access to a new test node (intended cloud-ish stuff) so was able
to run and watch the (slightly fixed) application with 200G+ hard disk
based swap, 200G+ SSD based swap and 200G real RAM (and 48 cores).

For this application, all three ran fine for the first day or so until
the application crossed the first memory boundary.  After that both
swap solutions slowed down significantly while the RAM system kept
chugging along.  We saw another memory plateau (which had been seen in
previous runs).  However, after another day or so the application took
another large growth of memory and ran another couple of days until
completing successfully.

The application only succeeded on the large memory node.  It might
have eventually completed on the SSD based swap node but would have
taken significantly more time (didn't even bothered to estimate).

Take aways:

- when you want RAM, you really want RAM, not something else (swap
even to SSD is still swap).  This actually reinforces my belief in
diskless (and thus swapless) nodes.

- having a couple of test nodes of different/larger configuration may
allow for application completion (and associated monitoring).  We now
know this specific application is mostly single threaded (there was
one short period where it used all the available cores).  We know how
much memory the application actually uses (between 72G and 96G).
Prior testing (which did not go to completion) had only showed the
second plateau and was indication a smaller memory need.

- We have better information about what possible applications of our
next generation nodes might require (the ones just delivered will also
be short on memory for this specific application).  We can feed this
information into future expansion/upgrade procurement.

- After doing this one run, the user can decide basic questions of
theirs: Is the application even useful?  What is the value of the
application versus the cost to acquire/upgrade hardware to allow for
additional runs of the application?

All the other discussion in these threads is useful, but sometimes a
basic brute force approach is sufficient (either it works or it
doesn't work).  The practical ideas about what is required to build a
500GB node are useful (e.g. may need multiple CPUs just to be able to
add the memory at all).

I wish I had more time/need to understand/use some of these lower
level performance improvement issues.  However, Brute force is all
most people are actually interested in and sometimes the answer is to
wait for time/other uses to drive up capability of commodity systems.

I also continue to believe that push back to application programmers
to build knowledge about scaling issues is necessary.  I'm not fully
fanatical about it, but I don't like wasting environmental resources
when more intelligent programming would reduce needs.  I like the idea
of large HPC systems helping to supply this back pressure by getting
more work out of existing systems.  Unfortunately, this often seem to
work the other way in that now people think (smaller) HPC systems are
becoming commodity items.

Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone