[Beowulf] Can one Infiniband net support MPI and a parallel
hahn at mcmaster.ca
Thu Aug 14 22:23:53 PDT 2008
> Gus' numbers makes sense to me. I assume his workload consists of multiple
> sized jobs, serial, modest parallel, and parallel jobs using all resources.
> Without pre-emptive scheduling, the batch queue system has to starve the
> system in order to run the larger jobs.
unless backfill can utilize those temporarily idle cpus.
> Obviously, before a job which
> consumes all resources starts , then all resources have to be idle. Which
> means no jobs can't be scheduled, even though they're idle.
true enough, but does depend on the size of large, high-prio jobs
relative to the size of the cluster.
> Another interesting metric is of course how many of the jobs runs to
> successful completion, i.e., are not killed due to resource limits, or
> crashes, or for other reasons. That's what I call net vs. gross utilization.
surely this survival rate is quite high, no? again, it depends largely
on the design of the cluster (I see few node crashes, maybe 1 of 768 nodes
per week, and few resource crashes (perhaps a couple buggy jobs per week))
More information about the Beowulf