[Beowulf] cloudy HPC?

Fri Feb 7 02:42:21 PST 2014

On Fri, Jan 31, 2014 at 4:30 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
> it would split the responsibility into one organization concerned
> only with hardware capital and operating costs, and another group
> that does purely os/software/user support.

Well, in a way this split already exists in larger HPC centers, where
there are different people taking care of the hardware and software
sides. Except that they are part of the same organization, probably
have easy ways to communicate and work together, and a single boss :)

If I understand correctly your idea, the two organizations would be
separate entities, possibly with different bosses, and the
virtualization layer would separate them at runtime. In my view, this
isn't terribly different from having 2 organizations with a complete
stack (HW+SW) each, where the internal communication/workflow/etc. in
each of them is better, but there's an overall performance loss as the
two HPC installations can't be used as a single one. And here we come
to the difference between grid (2 HPC installations) and cloud (single
HPC installation with VMs). Pick your poison...

> we wouldn't, obviously.  owning an IB card is only relevant for an MPI
> program, and one that is pretty interconnect-intensive.  such jobs
> could simply be constrained to operatin in multiples of nodes.

I somehow though that the cluster would be homogeneous. If you operate
with constraints/limits/etc, then I would argue that the most cost
effective way is to buy different types of nodes for different types
of jobs: IB-equipped for MPI jobs, many-core nodes for
threaded-but-not-MPI jobs, high-GHz few-core nodes for single CPU
jobs. And still allow them to operate on physical hardware...

Also I somehow though that the discussion was mostly about
tight-coupled communicating jobs. If jobset is very heterogeneous, are
all types of jobs amenable to run in a VM ? F.e. a job taking one full
node both in terms of cores and RAM and with little communication
needs, could run on the physical HW or in a VM and use the same node.
In this case the cost of virtualization (more below) directly
translates to the organizational split, i.e. there is no technical
advantage of running in a VM.

> I don't know why you ask that.  I'm suggesting VMs as a convenient way
> of drawing a line between HW and SW responsibilities, for governance
> reasons.

You are indeed drawing a line, what I was arguing about is its
thickness :) Let's talk about some practical issues:
- the HW guys receive a VM which requires 1CPU and 64GB RAM. This is a
hard requirement and the VM will not run on a host with less than
this. This VM might come as a result of some user previously running
through the queuing system on the physical HW and not specifying the
amount of required memory - which is very often the case; when going
to VMs the user just put the figure for the whole node (s)he was
previously running on, not actually knowing the actual requirements.
Such VMs will create serious bottlenecks in scheduling, f.e. on nodes
with 12-24 cores and 128GB RAM.
- the HW guys have to provide some kind of specifications for the VMs.
It will make a large difference in performance whether the VM (say
KVM) will expose a rtl8139 or an optimized virtio device. Same whether
the VM will just provide an Ethernet device or a passthrough-IB one.
Also same whether the VM exposes a generic CPU architecture or gives
access to SSE/FMA/AVX/etc. or if it allows access (again passthrough?)
to a GPGPU/MIC device.
- HW has failures. Do you use some kind of SLA to deal with this ?
More technical, how does a failure in HW translate to a failure of the
VM or in the VM ?

>  though it's true that this could all be done bare-metal
> (booting PXE is a little clumsier than starting a VM or even container.)

With the degree of automation we have these days, I don't think in
terms of clumsiness but in terms of time needed until the job can
start. It's true that a container can start faster than booting a node
and VMs can be paused+resumed. But do a few seconds to tens of seconds
make a huge difference in node repurposing time for your jobset ? If a
typical job runtime is in the hours to days range, I would say that it
doesn't...

> and that many jobs don't do anything that would stress the interconnect
> (so could survive with just IP provided by the hypervisor.)

This makes a huge difference. If many/most jobs have this pattern,
then a traditional HPC installation is probably the wrong solution, a
deal with one of the large cloud providers would probably provide much
better satisfaction. But in this case the HW guys (which side are you
? :)) will remain jobless...

Cheers,
Bogdan