Why not NT clusters? Need arguments.

Robert G. Brown rgb at phy.duke.edu
Fri Oct 6 13:09:42 PDT 2000


On Fri, 6 Oct 2000, Dan Yocum wrote:

> Jon Tegner wrote:
> > 
> > In a disussion of clusters I got the question why not using systems
> > running microsoft NT. I only came up with cost and stability in a
> > sweeping way, and I couldnt present more quantitative arguments. Later,
> 
> 
> And that wasn't enough?  

To be more specific, to have a good chance of COMPLETING a longrunning
parallel computation on N hosts, it really helps if the probability of
single host failure is considerably less than 1/N over the time required
for completion.  This is extremely quantitative.  If you estimate that
the mean time between crashes of your NT boxes is ten days, then you
typically will almost NEVER complete a computation that runs for a day
on 20 boxes.  This alone is why you won't see many really big clusters
running NT.

I've heard anecdotally that an organization has excellently skilled NT
people that devote enough time to the project they can tune and
configure NT well enough to be stable out at 30-60 days (or even more)
and build a workable cluster out of it.  This often limits to some
extent the applications they'll allow to be run, as some applications
are more destabilizing than others.  If this point is raised, you can
counter that:

  a) That extra time and skill costs money.  Quite a lot of it -- humans
are often more expensive than hardware, and really skilled NT SE's are
no more common than any other variety, however many "MCSE"'s there are
floating around in the world.  We all know that one cannot learn to
stabilize a complex operating system in a correspondance course or
community college type environment.

  b) Linux is more stable than the most stable NT platforms you're ever
likely see right out of the box.  The latter 2.2 kernels are simply rock
solid on all but a few very rare hardware combinations.  It still
requires a skilled individual to install and administer it, but
stabilizing it isn't rocket science.  It also scales very well
administratively, especially using tools like kickstart or some of the
diskless boot mechanisms described on this list.

  c) THEN you can point out the hundreds of dollars per platform you
save on OS software and other software.  This is actually not that much,
compared to the human costs, unless you have a lot of platforms -- one
reason a lot of institutions might reasonably give for not making a
switch.

> > I even found that an nt cluster sits on place 207 on the top500 list
> > (see http://www.top500.org/lists/TOP500List.php3?Y=2000&M=06)
> > is that an exception, or are there many of these beasts around?
> 
> 
> Check out how many linux cluster are far above that on the list...
> actually, I'm sure you'll find many more Linux clusters there than NT
> clusters.

The top500 list ranking per se also doesn't address the usability of
that cluster.  There have been systems on that list before that were so
unstable they (according to rumor, anyway) could barely get through
benchmarking runs and were used more for computer science (short runs)
than for parallel application production (long runs).  I don't know what
fraction of them were NT systems (if any) but the preponderance of linux
is due to its lower cost and higher stability.  People "vote" with their
purchase decisions, and your people would be most unwise to ignore the
wisdom of the masses, especially when the masses who build top500
machines in the first place are among the best and brightest (cluster
computing) computer people in the world.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu







More information about the Beowulf mailing list