[Beowulf] hpl size problems

Wed Sep 28 14:54:01 PDT 2005

In re: the xfs example:

xfs 2648 0.0 0.1 11804 1336 ?  Ss Jun20 0:00 xfs -droppriv -daemon

or the accumulation of less than a second of CPU time since June 20 when
the node was rebooted after its FC4 upgrade.  In fact, on this
kickstart-installed X86_64 (dual opteron) node all the processes but
cluster-specific processes (applications or monitoring tools) put
together have used less than 30 minutes of CPU time over three months.
Almost all of THAT is by one daemon (hald) that is, as it turns out,
probably not necessary.  So in our next cluster kickstart we leave it
off (or turn it off) or we use one of our drop-in tools to have it
turned off/removed overnight.  xfs is indeed a perfect example -- of
something that it isn't worth worrying about.  It's not clear that it's
worth the effort to even turn it off.

In re: the comment on e.g. installation time:  Most of the clusters I've
run have spent anywhere from 100% to 70-80% running a handful of
applications.  Sometimes only one application (the primary research tool
of the cluster owner).  The efficiency of that application has generally
not been a strong function of operating system -- we put a reasonably
but not pathologically stripped image on the nodes and then just plain
forget them (aside from yum-driven automagical updates of their
installed base) for as long as their entire service lifetime, although
more often we'll update them once or twice, when convenient.  So while
very short install times, instant dynamical provisioning and perfectly
minimal application support sounds very reasonable (and is, for the most
part, what the kernel already does, right?) in practice it would save us
pretty much zero time and very likely wouldn't make our CPU-bound code
run significantly faster.  YMMV, of course -- I'm not claiming that our
operation is typical or ideal.

Then, to continue...

Donald Becker writes:

>> So I agree, I agree -- thin is good, thin is good.  Just avoid needless
>> anorexia in the name of being thin -- thin to the point where it saps
>> your nodes' strength.
> 
> You've got the wrong perspective: you don't build a thin compute node from 
> a fat body.  You develop a system that dynamically provisions only the 
> needed elements for the applications actually run.  That takes more than a 
> single mechanism to do correctly, but you end up with a design that has 
> many advantages.  (Sub-second provisioning, automatic update consistency, 
> no version skew, high security, simplicity...)
> 
> -- 
> Donald Becker				becker at scyld.com
> Scyld Software	 			Scyld Beowulf cluster systems
> 914 Bay Ridge Road, Suite 220		www.scyld.com
> Annapolis MD 21403			410-990-9993

Ah!  THAT'S the problem.  I have the wrong perspective.

I knew it was something like that...;-)

Just to be clear about my mistake, if one builds a cluster that is a
single computer as you describe, either it is on an open network or it
is inside a head node that is a de facto firewall.  Since the motivation
for such a cluster is often to enable it to scale effectively with
tightly coupled code, one usually chooses the latter.  In such a cluster
(as you've pointed out a number of times over the years) one doesn't
think of the cluster as having "nodes" that you can login to, one thinks
of it being a single computer you can login to with lots of CPUs and a
message passing paradigm for IPCs across the CPUs.  In such a cluster
internal security is usually all but nonexistent, because security costs
performance and because it is a "computer", all real security lives on
the head node.  One doesn't provide internal interCPU security in an MP
operating system as processor 3 isn't generally trying to take over
processor 4 -- it is more a question of who owns the kernel and root
process structure that controls what runs on ANY processor.

You work very hard (usually) to create such a system that scales well
and maintain it on top of or on the side of an update stream connected
back to the kernel, to the libraries, to hardware drivers, to essential
tools and utilities, and to provide users of the system with specialized
tools and utilities to replace standard tools and utilities to provide
the illusion (or rather the layered, network-implemented reality) of a
single MP computer.  This work doesn't end -- you are basically
supporting a separate distribution that shares memes with linux/gnu and
you MUST co-develop with the base code set or in a couple of years
you're screwed as the code base you rely on to not have to reinvent or
coinvent numberless wheels diverges.  Very labor intensive, very
expertise dependent -- not a lot of people CAN do the work.  Very, if
you like, "expensive" in every sense of the word, but justified by the
absolute benefits it provides to users of the appropriate kind of code
and maybe (less certainly) by the relative benefits it can confer in
terms of ease of use or installation.  

Where "less certainly" is because there you have to do a CBA against all
the various competing paradigms, and there are a lot of them.  Not an
easy assessment given wildly varying real costs, opportunity costs, and
cost scalings per environment per paradigm.  Given also a rather wide
range of applications that one might run on a cluster where the marginal
benefits (in terms of speedup, time to completion of a project, etc.)
are likely to be very slender or even (in for example the case of an
application that doesn't easily build and run because of divergence from
or a lack of support for some tool or library available in one of the
competing distributions) negative benefits, at least pending the
investment of still more labor. Which economically can easily turn out
to be more labor than it is worth for a small enough class of tasks,
which leads to yet ANOTHER economic problem -- do you invest your
limited resources working to support a small class of applications that
cannot in some sense "pay their own way" in terms of overall
contribution to cost-benefit...

In competition with this there is (primarily) the "use an existing
operating system distribution for your cluster" paradigm, which can be
implemented on top of pretty much any Unixoid operating system
distribution or even (shudder) on WinXX.  This approach deliberately
minimizes the the utilization of cluster-specific modifications of the
actual core operating system -- kernel, libraries, binaries,
documentation -- because in this way it avoids the immense cost of that
divergence.  It seeks instead to add "portable" cluster tools on TOP of
the distribution -- ideally tools that will build and run on ANY
distribution with minimal hacking and #ifdef'ing and autoconf'ing (a
fantasy that of course sometimes works and sometimes doesn't:-).  This
general paradigm is broad enough to encompass fat nodes and thin nodes,
diskfull nodes and diskless nodes, and even internal protected
synchronous clusters with no internal security and clusters that span
Internet-wide authentication and administrative domains where security
is a sine qua non.  BECAUSE the "cost" of the approach is deliberately
held to a minimum and there is a reasonable degree of portability in the
cluster toolset, there is much competition between fat and thin, between
diskless and diskful, between "beowulf" and NOW/COW and Grid.  

With this degree of competition one gets many benefits.  

First, a would-be cluster builder can choose what makes sense to them in
their particular environment for their particular application space.
There is a high probability that they'll be able to build a cluster that
uses any hardware supported by the general distribution(s) that might
form a base, with any library(s) that it might require, with
installation, administration, maintenance, and authentication mechanisms
that are familiar to them because of their EXISTING investment in e.g.
supporting a LAN on top of some distribution.  This in turn may minimize
retraining costs or permit them to leverage existing resources or may
permit them to comply with some externally imposed constraint: the use
of some particular authentication mechanism, integration with some
organization-wide resource, the ability to run particular applications
with strong constraints on their support libraries.

Where the latter is no joke -- I'm not KIDDING about the RH 7.1 in the
ATLAS gridware, even though it SEEMS like a joke.  Maybe they've finally
bitten the bullet and started the arduous process of porting their code
to a modern linux.  Maybe not.  Really, this is just another example of
the danger of forking something off into a separately maintained branch
without the resources to properly codevelop it with the main branch.
If you don't CONSTANTLY feed the source base money and human time, you
ultimately pay even more to realign them.

Second, the competition in an open source environment with free user
choice (where they vote by using or not using anything that is developed
and offered up by developers) is a near-optimal genetic optimization
process.  Code (memes) are shared.  Successful code fragments survive
and are written into more complex applications and spawn new species of
application; unsuccessful ones or obsolete ones are gradually pruned
from the tree.  Warewulf is but one example although there are other
emerging diskless linuces and it may well be that in a year or two
diskless linux will emerge as the paradigm of choice for LAN
workstations.  Cross-fertilization works both ways.  CaOSity, Scientific
Linux, Mandriva, FC, RHEL, SuSE, Debian -- you can build a cluster on
top of any or all of these, using (in many cases) familiar tools and
installation/maintenance paradigms AND being certain that any advance or
improvement or bugfix or security patch that makes it into your base
distro will make it into your cluster as well.  

Third, the alternatives have very different up-front costs so that at
least some of the alternatives are not de facto exclusionary -- too
expensive for students, for cluster builders in the third world, for
hobbyists.  Exclusionary expense is one of the biggest PROBLEMS with the
high cost of divergence -- only the "rich" can afford it, so it ends up
being useful only to those with high-benefit problems in real cash
terms.  Distributions that serve as the base for clusters are such that
at any given time a few of them cost money and have sufficiently large
installed bases and associated revenue streams that they attract
commericial developers and can get certified for various kinds of
government restricted usage.  Others are free and more suited for
hacker/hobbyists.  Some are in between.  The key thing to remember is
that hacker/hobbyists with limited disposable resources for software
BUILT linux and most of gnu and a lot of the existing cluster support
software -- in part because they didn't want to or weren't able to
afford the cost of paying for the commercial alternatives of the day --
and continue to be a rich source of new ideas and applications today.
So this isn't a bad thing.

So finally -- my perspective is, in a nutshell, this last statement.

This isn't a bad thing.  

In my opinion, when one considers the range of applications to be run on
ANY kind of cluster, the range of administrative expertise likely to be
available to a would-be cluster builder, the costs and benefits of the
two primary cluster paradigms (on top of a minimally diverged distro
mostly in userspace or as a signifanctly deviated and customized
codeveloped branch with significant alterations in the kernel and
rootspace), most builders of and users of clusters will benefit from
sticking close to a distro and THEN making their choices -- diskless or
not, network isolated or not, rsh or ssh, and so on.  Building a
cluster on "top" of a standard distro, as fat or thin as they like, with
minimal divergence (most clusterware on TOP of the distro, as in
warewulf).

Whether they run the cluster as a "beowulf" with a batch queueing system
and a head node or as an undifferentiated NOW, MOST of their operating
environment will update from the primary tree/mirrors of their
distribution and will hence evolve along with that distribution (which
rate may be fast or may be slow according to what they choose) and
remain as patch-current as its maintainers can keep it with a much
larger user base participating.

Heck, I think that this is a GOOD thing, and that most, but not all,
would-be cluster builders are well served by it.

    rgb
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050928/b62b1a53/attachment.sig>