[Beowulf] cloudy HPC?

Sun Feb 9 11:03:40 PST 2014

On Fri, 7 Feb 2014, Bogdan Costescu wrote:
> On Fri, Jan 31, 2014 at 4:30 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
>> it would split the responsibility into one organization concerned
>> only with hardware capital and operating costs, and another group
>> that does purely os/software/user support.
>
> Well, in a way this split already exists in larger HPC centers, where
> there are different people taking care of the hardware and software
> sides. Except that they are part of the same organization, probably
> have easy ways to communicate and work together, and a single boss :)

perhaps VERY large centers.  box-monkeying probably requires one person
per 10k or so nodes, perhaps less, depending on the organization's 
attitude towards vendor service contracts, time-to-repair, whether nodes
are repaired at all, etc.  I would argue that such activity should really be
regarded as "Facilities", and need essentially no contact, communication 
or (need for) shared bosses.

I think this is a far more natural division 
than the usual sysadmin vs user/app-specialist.

> If I understand correctly your idea, the two organizations would be
> separate entities, possibly with different bosses, and the
> virtualization layer would separate them at runtime.

sure, let's call it Facilities and Everyoneelse (EOE).

> In my view, this
> isn't terribly different from having 2 organizations with a complete
> stack (HW+SW) each, where the internal communication/workflow/etc. in

I don't follow you at all.  to me, Facilities has essentially no SW,
and to EOE, hardware is almost invisible.  (well, either HW is available
or it is not).  Amazon/Azure/GCE/etc seem to agree that this is a useful
dividing point (though of course all such IaaS providers also attempt to add
value (extract revenue)) via storage, bandwidth, monitoring/automation, etc.

> each of them is better, but there's an overall performance loss as the
> two HPC installations can't be used as a single one. And here we come
> to the difference between grid (2 HPC installations) and cloud (single
> HPC installation with VMs). Pick your poison...

yeah, you lost me.  I'm talking about a horizontal partition in the stack.
I don't see any relation between outdated notions of Grid and what I'm 
talking about.  (the division I'm proposing isn't really even dependent
on using VMs: EOE could offer an API to boot on metal, for instance.)

>> we wouldn't, obviously.  owning an IB card is only relevant for an MPI
>> program, and one that is pretty interconnect-intensive.  such jobs
>> could simply be constrained to operatin in multiples of nodes.
>
> I somehow though that the cluster would be homogeneous.

again, I don't follow you.  if you're buying a cluster for a single dedicated
purpose, then there is no real issue.  my context is "generic" academic
research HPC, which is inherently VERY high variance.

> If you operate
> with constraints/limits/etc, then I would argue that the most cost
> effective way is to buy different types of nodes for different types
> of jobs:

thats fine if you have a predictable job mix - you can partition it
and specialize each node.  though I'd argue that this is a bit deceptive:
if your workload is serial, you might well still want IB just to provide
decent IO connectivity (in spite of not caring about latency) so the node
might be identical to an MPI workload.

memory-per-core is certainly a parameter that matters, and to some 
extent cpu-memory-bandwidth can be optimized (mainly by varying the 
number and clock of cores, of course: at any moment there is pretty 
much one reasonable choice of memory.

hmm, I suppose add-in-cards are another anti-generic dimension, since 
jobs in general don't want a GPU, but some really do.  I would claim that
add-in accelerators/coprocessors are not a permanent feature, and don't
really change the picture.  (that is, the field is attempting to become
more generally useful in directions such as Phi and AMD-APU, and that 
will make GP-GPU no longer a thing anyone talks about eventually.)

> IB-equipped for MPI jobs,

IB is still desired for IO, even for non-MPI.

> many-core nodes for threaded-but-not-MPI jobs,

I don't think so - MPI and serial jobs still want manycore nodes, since the 
main point of manycore is mainly compute density/efficiency.  in a very broad
sense, synchronization is not dramatically faster amongst cores on a node
versus a fast inter-node fabric (or conversely, message passing can
efficiently use shared memory.)

> high-GHz few-core nodes for single CPU jobs.

I don't see that happening much.  if people have only a few serial jobs,
they'll run them on their 3.7 GHz desktop, and people who have many
serial jobs would rather have 32 2.4GHz cores rather than 4x3.7.

> Also I somehow though that the discussion was mostly about
> tight-coupled communicating jobs.

no.

> If jobset is very heterogeneous, are
> all types of jobs amenable to run in a VM ? F.e. a job taking one full
> node both in terms of cores and RAM and with little communication
> needs, could run on the physical HW or in a VM and use the same node.

sure.  traditionally, running on bare metal has been somewhat slower to 
setup, but could be faster when running due to virtualization overheads
(if any).

> In this case the cost of virtualization (more below) directly
> translates to the organizational split, i.e. there is no technical
> advantage of running in a VM.

well, not quite: it's always potentially handy to have the hypervisor:
let it step in to provide isolation, or to perform checkpoints/migration...

>> I don't know why you ask that.  I'm suggesting VMs as a convenient way
>> of drawing a line between HW and SW responsibilities, for governance
>> reasons.
>
> You are indeed drawing a line, what I was arguing about is its
> thickness :)

sorry, lost me again.

> Let's talk about some practical issues:
> - the HW guys receive a VM which requires 1CPU and 64GB RAM.

no, HW guys have a datacenter filled with 10k identical nodes, each
with 20 cores and 64G ram, qdr IB, 4x1T local disks.  they are responsible
for keeping it powered, cooled and repaired.

> This is a
> hard requirement and the VM will not run on a host with less than
> this.

memory is a parameter that might well justify "de-homogenizing" nodes,
but the problem is that this (partitioning in general) always introduces
the opportunity for inefficiency when supply and demand don't match.

> This VM might come as a result of some user previously running
> through the queuing system on the physical HW and not specifying the
> amount of required memory - which is very often the case;

never for us: we require a hard memory limit at submit time.  it would 
be amusing to use VMs to address this though, if users really didn't want
to predict limits or couldn't.  (arguably our experience is that lots of 
users are totally useless in setting this parameter...)

one could imagine launching such processes on a box that has vast memory,
then, once you think the process has stabilized, migrating it to a box 
where it just fits.  fundamentally, the question is how much you're going
to save by doing this - is memory cheap?  (yes, you'd also have to deal
with the issue of jobs that have multiple phases with different memory 
use, so might be migrated ("repacked") more than once.)

> - the HW guys have to provide some kind of specifications for the VMs.

facilities guys just run the hardware; yes, some sort of negotiation needs
to take place to ensure that the hardware can actually be used.

> It will make a large difference in performance whether the VM (say
> KVM) will expose a rtl8139 or an optimized virtio device. Same whether
> the VM will just provide an Ethernet device or a passthrough-IB one.

will it?  do you have numbers?

> Also same whether the VM exposes a generic CPU architecture or gives
> access to SSE/FMA/AVX/etc.

I can't imagine any reason to hide physical capabilities.  if one had 
heterogenous HW, it might be valuable to track cpu feature usage of each
job, to maximize scheduling freedom, of course.  same as tracking memory
usage or MPI intensity or IO patterns.  but in none of those cases would 
you actually lie to clients about availability.

> or if it allows access (again passthrough?)
> to a GPGPU/MIC device.

high-cost heterogeneity is really a strategic question: do you think 
that demand will be predictable enough to justify hiving off a specialized
cluster?  using VM/containers makes it *easier* to manage a mixed/dynamic
cluster, since, for instance, most GPU jobs don't fully occupy the CPU cores,
which can then be used by migratory serial jobs.

> - HW has failures. Do you use some kind of SLA to deal with this ?

I don't see this as a problem.  in my organization, none of the compute 
nodes even has UPS power, and our MTBF is low enough that people get good 
work done.  in the governance structure I'm proposing, there would be 
some sort of interface between facilities and EOE, but there's nothing 
difficult there: either nodes work or they don't.  SLAs are a legalistic 
way to approach it, whereas shared monitoring would make it feel
less zero-sum.

> More technical, how does a failure in HW translate to a failure of the
> VM or in the VM ?

I don't know what you mean.  are you suggesting that byzatine failure modes
would be widespread and thus a concern?  it, that facilities and EOE would 
have a hard time disagreeing on what constitutes failure?

>>  though it's true that this could all be done bare-metal
>> (booting PXE is a little clumsier than starting a VM or even container.)
>
> With the degree of automation we have these days, I don't think in
> terms of clumsiness but in terms of time needed until the job can
> start. It's true that a container can start faster than booting a node
> and VMs can be paused+resumed. But do a few seconds to tens of seconds
> make a huge difference in node repurposing time for your jobset ? If a

as I've said, we have no jobset - or rather *all* jobsets.  we, like most 
HPC centers, tell users to make their jobs last at least minutes, because 
we know that our (crappy old) infrastructure has seconds-to-minutes of 
overhead.  obviously this is not inherent, and if all our users suddenly
wanted to run only 5s jobs, we could figure out a way to do it.

let me put it this way: more isolation always costs more startup overhead
(and sometimes some ongoing speed cost.)  this is well-known, though not 
particularly well-handled anywhere to my knowlege.  most Compute Canada 
centers do whole-node scheduling, and some provide layered systems for 
doing single-user queueing of sub-jobs for this reason.  obviously, it 
would be far better for the user not to need to get dragged down to this
level, assuming the system could manage it efficiently.  (I also really 
hate schedulers that have "array jobs" because it's usually nothing more 
than admission that the scheduler's per-job overhead is too high.)

> typical job runtime is in the hours to days range, I would say that it
> doesn't...

we have no discernable job mixture, and I think that accurately reflects 
the reality of HPC (or "Advanced Research Computing") today.  a particular
center may decide "we want nothing to do with anything other than
tight-coupled MPI jobs of 10k ranks or greater that run for at least 1d".
bully for them.  I'm talking about the whole farm, not just cherry picking.

>> and that many jobs don't do anything that would stress the interconnect
>> (so could survive with just IP provided by the hypervisor.)
>
> This makes a huge difference.

debatable; do you have any data?

> If many/most jobs have this pattern,

as I've said several times, I'm talking about a job stew that has no 
discernable "many/most".

you can force users to sort themselves into partitions based on the kind
of resources they use.  I'd argue that this is irresponsibly BOFHish,
and simply becomes an even less tractable partitioning problem.

that is, if you decide to cherry-pick big-HPC into a single tier1 center,
and then apple-pick all the small-serial into another, separate center,
you still have to decide how much cash to spend on each - and MORE 
importantly, how much to spend on the "everyone else" center.)

> then a traditional HPC installation is probably the wrong solution, a
> deal with one of the large cloud providers would probably provide much
> better satisfaction.

the interesting thing here is that you seem to think there's something 
special about HPC or "large cloud providers".  I'm suggesting there is not.
that running datacenters is a very straightforward facilities challenge
(with, incidentally, little economy of scale.)  and that a Facilities 
organization (like Amazon) could do HPC perfectly well.  (Amazon happens 
to make an obscene profit on their Facilities business, which is why it's
not a realistic or rational choice to replace forms of HPC.)

> But in this case the HW guys (which side are you
> ? :)) will remain jobless...

I'm full-stack, myself: from dimms to collaborating with users on research.

regards, mark hahn.