[Beowulf] SGI to offer Windows on clusters

Robert G. Brown rgb at phy.duke.edu
Mon Apr 16 10:14:38 PDT 2007


On Sun, 15 Apr 2007, Mike Davis wrote:

> one desire is supported stability. My number two desire is speed.  Maybe this 
> philosophy comes from all of my years in unix world (21 and counting), but 
> the idea of standardizing on something that has the limited longterm support 
> of FC scares me. We regulalry run nodes for years without reboot.

You do, of course, see the oxymoron inherent in this statement.  You
regularly run nodes for years without a reboot.  Hence after the nodes
are installed and stabilized (something that will almost certainly take
place during the period where FC is in fact well supported) support
becomes irrelevant, because the distro is stable and "just works".
Every version of FC I have used has without exception become stable by
this standard within six to eight months of release, certainly well
within a year.  And the problem persists for even the "supported"
distros -- we ran RH 7.3 on nodes way, way past the point where
"support" for 7.3 was in any way relevant -- I think we finally dumped
7.3 somewhere around FC2 or 4 or thereabouts.

I think that a study of the number of actual updates of any distro,
"supported" or not, would show an exponentally decreasing distribution
of events with all updates in the primary set of applications and
libraries of importance to "most users" being quickly resolved and then
gradually working down to less frequently traversed parts of system
phase space.  Cluster nodes are particularly simple to stabilize here as
they generally OMIT most of this phase space -- no X or X-based apps,
for example.  Indeed cluster nodes generally run little more than a
stripped core with a selection of add-on libraries that DEPEND only on
a stripped core -- a nearly ideal representation of the functional
dependency decomposition I discussed extensively in the previous
message.

Given that a cluster node requires little more than a stable kernel, the
basic libraries and these add ons and are INVARIABLY going to be stable
long before an FC release stops being supported, The PRACTICAL
difference between FC and RHEL can be reduced to the number of devices
supported and yes, the efficiency and quality of the libraries.
Historically RHEL/Centos has lagged well behind the bleeding edge of
hardware except for that brief period where FC and RHEL and Centos are
all the same (once every couple of years, basically, although this rule
appears to be broken as of the latest release of RHEL).  The quality of
the supported libraries has been an even bigger problem -- I literally
couldn't use Centos for many of the projects I've been working on
because they depend on the GSL and the GSL was frozen in RHEL at a
primitive state.  Laptop users are similarly REQUIRED to use FC in
almost all cases because both the supported device list and Network
Manager were basically broken for two years or more under Centos for
nearly all laptops being sold.  Athlon 64 based systems would frequently
not install under Centos because their chipsets were too new to be
supported by the Centos kernels.  I could go on.

Again, this entire problem can be broken down into a fundamental failure
to properly organize ANY of these distributions in terms of hierarchical
decomposition of dependency.  This problem is getting worse, not better,
and is about to become MUCH worse if FC 7 indeed flattens out dependency
space in with 7-10 kpkgs "in" the distribution, effectively requiring RH
to follow suit within the next two years or face exponentially
increasing costs of providing a subset with acceptable decomposition
planes and supporting just that.  It is further complicated by (as Joe
notes) the lack of support for LSB.

IMO the right way to solve ALL of these problems is to DECOMPOSE FC and
RH and SLES and etc to:

   a) A standard, LSB-compliant minimal core, one that installs quickly
and in a self-contained manner in all cases into a fully functional
system.  Obviously its dependency set is a closed subset of the complete
dependency space -- it builds using only itself, a true bootstrappable
minimum.  Obviously the kernel, python, yum, and libc should probably
live therein.

   b) 1-N second-stage package sets, primarily containing libraries and
development resources.  These package sets would "ideally" be
LSB-compliant and fully decomposed -- each one should contain
dependencies only in the closed set of the core plus the package set
itself.

These are the ONLY things that should have long term support in ALL
distros.  That support should be all but unnecessary after a very short
shakedown period, though, because problems therein will be reflected
basically "everywhere" up the dependency tree.

   c) 1-M third stage package sets containing both libraries and
applications that have relatively "simple" vertical decompositions of
their dependencies -- they can depend freely on any package sets of a
lower level (a or b) and on internal packages, but should avoid
depending on anything at their own level or higher that isn't
self-contained in their own package.  Again, LSB compliance of all
libraries would be lovely, although most of the important library
dependencies are probably already accomplished in a) and b).

Commercial distro vendors will likely support this level for an extended
time, but things like FC probably will not.  It will, however, support
it for long enough to almost certainly stabilize it with respect to the
core!  This is enough, because by maintaining the package decompositions
in a) and b) rigorously (and sticking with LSB) fixes in future FC
releases will have a high probability of backporting trivially for
anybody that really needs them, and they are all "elective" packages
that can be managed and tested with yum without risking the
destabilization of a large production facility.  The a-b core stands
alone -- the c-level packages may break, but they won't bring down a
system as they do so as a general rule and should be easy enough to fix
because they are still decomposed in terms of a small self-contained
list of packages and libraries and the a-b core.

   d) Everything else, with full mixed dependencies but STILL with a
FUNCTIONAL hierarchical organization that no longer strictly reflects
dependencies.  Obviously dependencies should still be self-avoiding (and
hence not form dependency loops!) but this is all STRICTLY elective
application space -- again, problems out here can exist but they should
not destabilize a system, only break the application (set) itself.  This
will ALWAYS be a problem.  Again, only serious commercial distributions
will provide long term support for chunks of this space, and they will
almost always do their damnedest to do so after functionally decomposing
it into at worst an extended c-prime set -- a set of c-level package
sets that close internally in their dependency loops (and hence place
boundaries on the maintenance problem).  d-level packages will generally
NOT be "package sets" in the dependency sense at all -- they will be
individually maintained task-specific packages that either work or don't
but it isn't the "responsibility" of RH or FC to make them work, it is
the responsibility of the package authors and maintainers or even the
package users.

Obviously there is a serious problem with this.  None of the distros
currently have a functional hierarchical decomposition that precisely
reflects this as a strict set of design/decomposition rules, although
e.g. RH and FC do have "core" groups that perhaps come close.  It would
also be initially painful to move things into this sort of hierarchical
decomposition, as I'm certain there are plenty of "critically important"
packages that for a variety of reasons violate this standard.  However,
these packages are PART OF THE PROBLEM and as long as they are there,
they make maintaining linux in a cost effective and scalable way much
more difficult.

Fortunately, it needn't happen overnight.  Studying the problem in
empirical detail, working out a precise standard for the solution, and
starting with the core and working out it could be implemented over a
couple or three releases of FC, with the consequence that FC and RHEL
could be RE-UNIFIED by the next RHEL release -- the "a-b core" of
FC-core and the core of RHEL would be identical and very slowly varying;
most of the "FC variation" and toplevel development work would be
happing at the c and d levels and would proceed to stability (or not) in
FC and then be adopted (or not) into RHEL for longer term support, with
a very high probability of backporting or forwardporting most RHEL fixes
to associated FC releases without problem.

> Right now there are projects that are still at FC2  for their stable versions 
> (open-ssi for instance). But FC2 is dead. I think that the dev version of 
> open-ssi is FC3, but it's dead too. 
> I do not run diskless nodes. I test sample hardware for power usage, heat 
> output, and stability before purchasing it in quantity. A working cluster is 
> not the place that I want to experiment with "maybe's." Just because an OS 
> isn't the newest is not a bad thing.

There is no NEED to experiment with a working cluster.  Only with a
cluster you are currently building, and that need is there independent
of the distro used.

> Stability, stability, stability.

     rgb

>
>
> Mike Davis
>
>
>
>
>
> Joe Landman wrote:
>
>> 
>> 
>> Jeffrey B. Layton wrote:
>> 
>>> Robert G. Brown wrote:
>>> 
>>>> On Sun, 15 Apr 2007, John Hearns wrote:
>>>> 
>>>>> And re. the future version of Scientific Linux, there has been debate on 
>>>>> the list re. co-operating with CENTos and essentially using CENTos as a 
>>>>> base, and SL being an overlay of specific application and library RPMs.
>>>>> Pros and cons either way there.
>>>> 
>>>> 
>>>> 
>>>> IMO, most cluster builders will find it more advantageous to track the
>>>> FC releases instead of using RHEL or Centos or things derived therefrom.
>>> 
>>
>>  Only if they are not building clusters for commercial customers, or 
>> customers with specific OS (distro) requirements.  FC simply will not fly 
>> in a shop that demands long term support.  We deal with lots of these.
>> 
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list