[Beowulf] Definition of HPC

Wed Apr 24 07:05:04 PDT 2013

>> Because it stopped the random out of memory conditions that we were having.
>
> aha, so basically "rebooting windows resolves my performance problems" ;)

in other words, a workaround.  I think it's important to note when behavior
is a workaround, so that it don't get ossified into SOP.

> Mark, I don't understand your forcefulness here.

it's very simple: pagecache and VM balancing is a very important part of the
kernel, and has received a lot of quite productive attention over the years.
I question the assumption that "rebooting the pagecache" is a sensible way to 
deal with memory-tuning problems.  it seems very passive-aggressive to me:
as if there is assumption that the kernel isn't or can't Do The Right Thing 
for HPC.

I think this is a good discussion - let's keep the thread going. Any other contributions welcomed!
And apologies if my quoting is screwed up - blame Outlook.

for sites where a single job is rolled onto all nodes and runs for a long 
time, then is entirely removed, sure, it may make sense.  

I agree there. The assumption being made here is that you have one job per node - which may not be the case.

rebooting entirely
might even work better.  

Actually, that is well worth discussing. What would be so bad about rebooting between jobs - which also 
Would clear out any orphaned processed, and any shared memory segments left by past jobs?
And I have had to work on clearing oprphaned processes and running cleanipcs on nodes in the past.

> All modern compute nodes are essentially NUMA machines (I am assuming all are dual or more socket machines).

it depends.  we have some dual E5-2670 nodes that have two memory nodes - 
I strongly suspect that they do not need any pagecache-reboot, since 
they have just 2 normal zones to balance.  obviously, 4-chip nodes
(including AMD dual-G34 systems) have an increased chance of fragmentation.
similarly, if you shell out for a MANY-node system, and run a single job
at a time on it, you should certainly be more concerned with whether the 
kernel can balance all your tiny little memory zones.  standard statistics
apply: if the kernel balances a zone well .99 of the time, anyone with 
a few hundred zones will be very unhappy sometimes.

in short, all >1s servers are NUMA, but that doesn't mean you should drop_caches.

> If caches are a large fraction of memory then you have increased memory
> requests from the foreign node.

wait, slow down.  first, why are you assuming remote-node access?  do your 
jobs specifically touch a file from one node, populating the pagecache,
then move to another node to perform the actual IO?  we normally have a rank
wired to a particular core for its life.

yes, it's certainly possible for high IO to consume enough pagecache to also 
occupy space on remote nodes.  are you sure this is bad though?  pagecache is 
quite deliberately treated as a low-caste memory request - normally pagecache
scavenges its own current usage under memory pressure.  and the alternative 
is to be doing uncached IO (or pagecache misses).

I often also meet people who think that having free memory is a good thing,
when in fact it means "I bought too much ram".  that's a little over the top,
of course, but the real message is that having all your ram occupied,
even or especially by pagecache, is good, since pagecache is so efficiently
scavenged.  (no IO, obviously - the Inactive fields in /proc/meminfo are 
lists dedicated to this sort of easy scavenging.)

> Surely for HPC workloads resetting the system so that you get deterministic run times is a good thing?

who says determinism is a good thing?  I assume, for instance, you turn off 
your CPU caches to obtain determinism, right?  I'm not claiming that variance
is good, but why do you assume that the normal functioning of the pagecache 
will cause it?

I can refer you to section 8.3.2 of 'Introduction to High Performance Computing for Scientists and Engineers' by Georg Hager
and Gerhard Wellein
There they have a benchmark of vector triads run on a dual-socket Intel system with unform memory access, versus a dual socket Optoeron
with NUMA characteristics.
They see about a 2x times variation depending on the buffer size filled.

Also look at page 96-98 of this set of slides  (same graph as in the book)
http://www.multicore-challenge.org/resources/FMCCII_Tutorial_Treibig.pdf

Definitely interested in your response - I am certainly learning here!
And yes - running a vector triad benchmark is artificial, and any effects seen would depend on the real code you are running.

The contents of this email are confidential and for the exclusive use of the intended recipient.  If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy.