[Beowulf] clustering using xen virtualized machines
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caTue Jan 26 15:18:40 PST 2010
- Previous message: [Beowulf] clustering using xen virtualized machines
- Next message: [Beowulf] clustering using xen virtualized machines
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
>> Is it just me, or does HPC clustering and virtualization fall on >> opposite ends of the spectrum? depends on your definitions. virtualization certainly conflicts with those aspects of HPC which require bare-metal performance. even if you can reduce the overhead of virtualization, the question is why? look at the basic sort of HPC environment: compute nodes running a single distro, controlled by a scheduler. from the user's or job's perspective, there are just some nodes - which ones doesn't matter, or even how many in total. the user _should_ be able to assume that when they land on a node, it behaves as if freshly installed and booted de novo. we don't reboot nodes nodes between jobs, of course, or even make much effort towards preventing a serial job from noticing other serial jobs on the same node (as containers would, let alone VMs). but we could, without tons of effort, just lower utilization. virtualization is about a few things: - improve utilization by coalescing low-duty-cycle services. - isolate services from each other - either to directly arbitrate runtime resource contention, or to disentangle configurations. - encapsulate all the state of a server so it can be moved. I think the first axis is quite non-HPC, since I don't think of HPC jobs as being like idle services. (OTOH, many clusters have good utilization because multiple workloads get interleaved _above_ the processor level.) the second factor is not often an HPC problem, at least not in my experience, where J Random Fortran user doesn't really care that much about the environment (ie - want f77 and lapack and empty queues). migration has some HPC appeal, since it permits defragmenting a cluster, as well as better preemption. > Gavin, not necessarily. You could have a cluster of HPC compute nodes > running a minimal base OS. > Then install specific virtual machines with different OS/software stacks > each time your run a job. or for each job, just install the provided OS image on the bare metal... your job's done, have it halt or reboot the node ;) > OK, this is probably more relevant for grid or cloud computing - I first grid and cloud computing are all part of the same game, no? along with massively parallel low-latency MPI, old-style vector supercomputing, GPU-assisted computing, throughput serial farming, etc. > thought this would be a good idea when seeing > that (at the time) the CERN LHC Grid software would only run with Redhat > 7.2 > So you could imagine 'packaging up' a virtual machine which has your > particular OS flavour/libraries/compilers and shipping > it out with the job. right, that's one of the axes of the problem-space: whether the app gets its own custom runtime environment (in the sense of kernel, libc, etc). another axis is the degree to which the app has to contend for resources (as in an overcommited normal cluster, or a VM without guaranteed resources.) > Another reason could be fault tolerance - you run VMs on the compute > nodes. When you detect a hardware fault is coming along > (eg from ECC errors or disk errors) you perform a live migration from > one node to another - and your job keeps on trucking. > (In theory, checkpointing needed etc. etc.) I'm pretty skeptical about this - the main issue with checkpointing is when there are external side-effects. checkpointing networked apps (including MPI) is hard because you have state "in flight", so can only freeze-dry the state by quiescing (letting the messages land, etc). the "live migration" demos I've seen have been apps that are tolerant to the loss in-flight transactions (or which retry automatically). so I don't think virt is any kind of paradigm-changer, just like manycore merely stretches existing definitions. -mark
- Previous message: [Beowulf] clustering using xen virtualized machines
- Next message: [Beowulf] clustering using xen virtualized machines
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
