[Beowulf] Re: Purdue Supercomputer

Mark Hahn hahn at mcmaster.ca
Sat May 10 17:28:03 PDT 2008


> clusters.What if you have 1 of the systems in the cluster down or any
> network failures.Can make our cluster(2-5 sytems only) work properly.

normally, the cluster's management software will monitor and deal with
node failure.  at least that means noticing a failure and ensuring that the
node isn't used (until fixed) and dealing with any jobs that involved the
node.  it's also fairly common for server nodes (not just slave/compute
nodes) to have some failover/high-availability features.  (HA can also be
done for compute jobs, but IMHO it's not worth considering in normal cases,
ie, infrequent node failures.)

> Also what about geographically distant cluster systems.Say 1 in USA

sure, there's nothing about clusters that really assumes locality,
though obviously geographic distribution has effects on achievable
performance for wide-area MPI or distant file access.  wide-area
clustering seems more of a political stunt to me (yes, including grids.)

> and other in India.How do we manage our cluster in mishaps or
> difficult conditions.

I find that with IPMI and console redirection, it's very rarely necessary to
care about where your nodes are, at least from a sysadmin perspective.
you need to ask what the benefit is, though, in a wide-area cluster
(versus seprate, local ones.)  I wouldn't assume that management would 
be easier, and obviously only gratuitously parallel apps (sometimes called 
embarassinly parallel) could use it.

> lastly, how about having beowulf cluster systems in space.putting 1 pc
> on each planet or celestial body that we want to track and the server
> in india.

just because it could be done doesn't mean it makes sense...

> is linux the best choice in such cases...

your choice of OS depends primarily on your preference and experience.



More information about the Beowulf mailing list