[Beowulf] VMC - Virtual Machine Console
Robert G. Brown
rgb at phy.duke.edu
Wed Jan 16 06:55:28 PST 2008
On Wed, 16 Jan 2008, Douglas Eadline wrote:
> I get the desire for fault tolerance etc. and I like the idea
> of migration. It is just that many HPC people have spent
> careers getting applications/middleware as close to the bare
> metal as possible. The whole VM concept seems orthogonal to
> this goal. I'm curious how people are approaching this
As previously noted, however, YMMV and one size does not fit all. There
are two distinct ways of managing the heterogeneous environments that
some cluster applications might require. One is indeed the creation of
VMs -- running an extremely thin toplevel operating system that does
little else but to run the host VM and respond to provisioning requests,
as is the case in many corporate HA environments. The other is to
create a similar provisioning system that works at the level of e.g.
grub and/or PXE to provide the ability to easily boot a node into a
unique environment that might last only for the duration of a particular
computation. Neither is particularly well supported in current
clustering, although projects for both have been around for some time
(Duke's Cluster On Demand project and wulfware being examples of one,
Xen and various VMs as examples of the other).
There are plenty of parallel chores that are tolerant of poor latency --
the whole world of embarrassingly parallel computations plus some
extension up into merely coarse grained, not terribly synchronous real
parallel computations. Remember, people did parallel computation
effectively with 10Base ethernet for many years (more than a decade)
before 100Base came along, and cluster nodes would now ROUTINELY be
provisioned with at least 1000Base. Even a 1000Base VM is going to have
better latency in most cases than a 10Base ever did on the old hardware
it ran on, and it might well compete with early 100Base latencies. It
isn't exactly like running in a VM is going to cripple all code.
VMs can also be wonderful for TEACHING clustering and for managing
"political" problems. In many environments there are potential nodes
with lots of spare cycles that "have to run Windows" 24x7 and have a
Windows console available at the desktop at all times (and thus cannot
be dual booted) but which CAN run e.g. VMware and an "instant node" VM
under Windows. Having any sort of access to a high-latency Linux VM
node running on a Windows box beats the hell out of having no node at
all or having to port one's code to work under Windows.
We can therefore see that there are clearly environments where the bulk
of the work being done is latency tolerant and where VMs may well have
benefits in administration and security and fault tolerance and local
politics that make them a great boon in clustering, just as there are
without question computations for which latency is the devil and any
suggestion of adding a layer of VM latency on top of what is already
inherent to the device and minimal OS will bring out the peasants with
pitchforks and torches. Multiboot systems, via grub and local
provisioning or PXE and remote e.g. NFS provisioning is also useful but
is not always politically possible or easy to set up.
It is my hope that folks working on both sorts of multienvironment
provisioning and sysadmin environments work hard and produce spectacular
tools. I've done way more work than I care to setting up both of these
sorts of things. It is not easy, and requires a lot of expertise.
Hiding this detail and expertise from the user would be a wonderful
contribution to practical clustering (and of course useful in the HA
world as well).
>> I certainly cannot speak for the VMC project, but application migration
>> and fault tolerance (the primary benefits other than easy access to
>> heterogeneus environments from VMs) are always going to result in a
>> peformance hit of some kind. You cannot expect to do more things with no
>> overhead. There is great value in introducing HA concepts into an HPC
>> cluster depending on the goals and configuration of the cluster in
>> question (as always).
>> I cannot count the number of times a long running job (weeks) crashed,
>> bumming me out as a result, even with proper checkpointing routines
>> integrated into the code and/or system.
>> As a funny aside, I once knew a sysadmin who applied 24 hour timelimits to
>> all queues of all clusters he managed in order to force researchers to
>> think about checkpoints and smart restarts. I couldn't understand why so
>> many folks from his particular unit kept asking me about arrays inside the
>> scheduler submission scripts and nested commends until I found that out.
>> Unfortunately I came to the conclusion that folks in his unit were
>> spending more time writing job submission scripts than code... well...
>> maybe that is an exaggeration.
>> Am 16.01.2008, 14:19 Uhr, schrieb Douglas Eadline <deadline at eadline.org>:
>>> While your project looks interesting and I like the idea of
>>> VMs, however I have not seen a good answer to the fact that VM = layers
>>> and in HPC layers = latency. Any thoughts? Also, is it open source?
>>>> I would like to announce the availability of VMC (Virtual Machine
>>>> Console). VMC is an attempt to provide an opensource, web-based VM
>>>> management infrastructure. It uses libvirt as the underlying library
>>>> to manage para-virtualized Xen VMs. In time we intend to scale this to
>>>> manage VM clusters running HPC applications.
>>>> You can find out more on our "Introduction to VMC" page:
>>>> List of current features and future plans:
>>>> To get started, we have made available a "VMC Install" document:
>>>> We invite people to take a look at VMC and tell us what you like and
>>>> what you don't like. If you have any problems, questions or
>>>> suggestions please feel free to contact us at dev at sxven.com or post
>>>> them on our forum:
>>>> Best regards,
>>>> Meng Kuan
>>>> Beowulf mailing list, Beowulf at beowulf.org
>>>> To change your subscription (digest mode or unsubscribe) visit
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
>> Geoff Galitz, geoff at galitz.org
>> Blankenheim, Deutschland
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown Phone(cell): 1-919-280-8443
Duke University Physics Dept, Box 90305
Durham, N.C. 27708-0305
Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
More information about the Beowulf