[Beowulf] OS for 64 bit AMD

Joe Landman landman at scalableinformatics.com
Sun Apr 3 17:55:06 PDT 2005



Mark Hahn wrote:
> this is utterly pointless, since we seem to disagree on axioms:

Not really.  We disagree on basic definitions.  Axioms are accepted 
"truths".

> correct code conforms to the standard; it is buggy if it depends 
> on undefined (outside-the-standard) behavior.

We agree on this.

> the platform is the ABI, not the distribution.  if you believe that 
> the ABI doesn't cover enough, talk to the organization that manages it.

We disagree on this.  This is not an axiom.  RH9 is the prototypical 
case which changed the ABI in an incompatible manner with an existing 
functional ABI.  At this point, the platform became the distribution, as 
commercial vendors target platforms (specifically RH) with the largest 
installed base.   If the linux platform were truly distribution 
independent, then it would not matter what it was compiled for, and 
frankly vendors would not need to QA against multiple distributions, as 
they would have the ABI.  Unfortunately this is not how it works.  I 
would like it to work like this.  It would be great if the LSB would in 
fact require certification, and that the application vendors would be 
required to code to certification levels (been arguing this for years). 
  Not likely to happen, but it would be nice.

> productionworthiness (PW) is behavioral stability, not some vendor's 
> assertion about "support".

It is *long term* behavioural, driver, and interface stability. 
Changing an ABI midway through (4k stacks) is *not* behavioral 
stability.   You have no real reason to expect a code to work correctly 
when you alter one of the critical underlying structures that it relies 
upon.  Many drivers rested on 8k kernel stacks, it was in the ABI as a 
(defacto) standard.  RHEL3 did not (properly so) change its underlying 
kernel structures in such a way to render some portions of the system 
unworkable.  RHEL4 is not likely to change its underlying kernel 
structures in such a way to render some portions of the system 
unworkable.  FC-x is likely to (and has) changed its underlying kernel 
structures.

> there is no data to suggest that a "supported" configuration 
> is actually more stable - support is a matter of CYA and risk aversion. 
> (not the actual risk; PW is the actual risk (well, inverse of it).)

May in fact be less stable, though the likelihood is that it is more 
conservative (makes support easier) so the implication is more stable. 
The fact is that the supported configurations are fundamentally averse 
to changing the underlying internals.  This is not the case in FC-x (nor 
should it be given its purpose).

> Fedora has normal release management, with pre-release testing
> as well as post-release updates.  the pre-release testing is 
> also known as "beta-testing".

So you have pre-release testing as "beta-testing" but you deny that 
"proving ground" is beta-testing?  Seems to be same side of a coin here. 
  Having a normal release management does not a production quality 
system make.  It is most definitely one of the requirements for such a 
system, but it does not, in and of itself, make the OS a production 
class OS.  A reasonable definition of production class OS will likely 
incorporate inherent stability of the underlying structures of the 
system, and a guarantee that they will not change for some fixed 
interval.  Production specifically implies a repetitive behavior, 
specifically for HPC, a cycle shop.  If the next incompatible change in 
FC-x renders your IB drivers unworkable for your cluster, does that in 
fact make the OS that you have installed on the system production ready 
or not?  If you have to continuously chase hacks/patches/etc to keep 
your system operational after every upgrade, does that make your system 
production ready?

> 
> --
> 
> the existence of commercial products which specify RH-whatever vX.Y
> does not magically turn FC into a beta-test.  if you redefine words
> that way, you might as well call all of SunOS a beta for Solaris.

Er... you are the only one who indicated this, so if you want to argue 
this, I would suggest you contact the person who generated this idea 
(that commercial products dependent upon RH make FC a beta test) who can 
be found at hahn _at_ physics _dot_ mcmaster _dot_ ca.

I said "My customers care about running on distributions (whoops, there 
we go with that word again) on which their apps are supported.  I am not 
aware of active support for FC-x for applications from commercial 
program providers.  If I am incorrect about this, please let me know 
(seriously, as FC-3++ looks to be pretty good)."   Prior to this I said 
"It is by Redhat's definition, a rolling beta (proving ground)."  The 
two are specifically independent ideas.  I know of few commercially 
supported applications that will accept support calls from FC-x running 
users.

Note:  Debian has very little in the way of commercial support (none 
from the distributer).  It is most definitely not a beta.  You can use 
the beta version in unstable.  This is analogous to Fedora.

What makes FC a beta is that Redhat specifically is note that, and is 
using Fedora as a "proving ground"  (c.f. 
http://dictionary.reference.com/search?q=proving+ground ) as in "It is 
also a proving ground for new technology that may eventually make its 
way into Red Hat products." (from http://fedora.redhat.com/ )  From the 
reference.com site "prov·ing ground (prvng) n.   A place for testing new 
devices, weapons, or theories."  Would you call a system that is defined 
by its maker to be a proving ground to be a production environment (e.g. 
stable, unchanging) ?

> the customer needs to evaluate how fragile a commercial product is:
> how well it conforms to the ABI.  NVidia is a great example of 
> an attractive product which is inherently fragile since NVidia 
> chooses to hide trade secrets in a binary-only, kernel-mode driver
> which (by definition and example) depends on undefined behavior.  
> VMWare is another good (flawed) example.

Hmmm.  I hear this argument time and again from people about the closed 
source nature of nVidia's drivers.  nVidia does not (as far as I know) 
own all the intellectual property in their driver, and they do not have 
the right to give that IP away via GPL or any other mechanism.  The 
fundamental flaw in the arguments against the nVidia driver are an 
inherent presumtion that nVidia is hiding trade secrets in order to make 
its life better and get end user lock-in.  The behavior it (the driver) 
depends upon has been built into the kernel, and when that behavior 
suddenly changed, nVidia wasnt the only driver affected.  Many open 
source drivers were impacted.   Are you going to argue that this makes 
them (the open source drivers) inherently fragile?  This is a natural 
extension and simple application of your argument.  This is a weak 
argument at best, and some of its fundamental premises are fatally 
flawed.  If nVidia owned all the IP in everything they released, and 
chose simply to release binary only drivers, that would be a completely 
different case.  Unfortunately, a fair amount of the IP in OpenGL and 
other related standards is owned by companies that have no interest in 
open source other than demolishing it.  SGI sold off most of its IP in 
OpenGL to some other outfit.

> "supported configuration" is nothing more or less than a way to 
> "download" support costs to the platform vendor (PV).  it's a lever,
> acting on the customer as a pivot, to force the PV to avoid changes
> of any sort, since its impossible to tell what internals the proprietary
> product depends on.

Uh....  I think we disagree again.  A supported configuration is 
something that a customer, an end user, a developer should have a 
reasonable and fighting chance of having it work right.  This means that 
the internals that are exposed to developers will no change (including 
driver developers).  This means that end users and customers have a 
reasonable expectation that their configuration on the supported list 
should work, and the onus is on the platform vendor (nice to see you 
switched to the definition of platform that I was using BTW) to make it 
work without breaking other stuff.

> drastistically
> similarly, SOP in the Fibrechannel world is to provide only negative
> definitions of support (nothing but HP disks in HP SANs.)  this can be 
> seen as a flaw in standard-defining, since Ethernet provides a fairly
> decent counterexample where interoperability is the norm because 
> products need to conform, not "qualify".

A standard is only useful if people pay attention to it, and 
engineer/design/build to it.  Standards are very useful to developers, 
in that if they code in a particular manner that adheres to the 
standard, they have a fighting chance of developing something that will 
work.  If the standard suddenly changes on them, and their stuff breaks, 
who do they turn to?  If the target is moving, how much time/effort will 
they expend to chase it?

In some cases (development tools) it makes sense to chase some specific 
moving targets (though it costs time/effort and therefore real money). 
In other cases it makes sense to wait for stable releases where things 
will not change, so your customers/end users can get your stuff and make 
it work, because you have a fighting chance at making it work.

Greg's company (and the folks at the Portland Group) have to chase these 
targets... many of their customers are there (I'd bet that a small 
fraction of their collective total customer base are using the 
development tools to generate commercial code, most are using the tools 
for their research/development tasks).

Yeah, there are significant interoperability problems in things like SAN 
and what-not-else.  These are unfortunate.  This is part of the reason 
why I try to avoid such things (I don't like vendors locking me in, and 
I know my customers don't like being locked in, so I don't waste my 
companies time trying to figure out how to do this).  Don't assume that 
a companies / end users misapplication of a standard, hijacking of a 
standard, or abuse of a standard somehow makes all standards bad.  They 
are not.  Standards are sometimes the only lever you have in a 
commercial closed source context... demanding that a company adhere to 
what it claims to sell is sometimes a necessary path.  Interoperability 
means that when people interpret the standards, that all parties agree 
on the definitions, and that they guarantee that their products will in 
fact conform to the standard, and that there will be tests of the 
standard compliance, and out of compliant systems will be adjusted to be 
in-compliance, and that interoperability with other standards will be 
guaranteed.  This is why IDE, SCSI, and Ethernet work so well.  This is 
why some others do not.  IB is likely to work quite well going forward. 
  This is why the SAMBA folks are chasing a moving target, as the CIFS 
"standard" is a moving one (just go ahead and update that XP with a 
SAMBA server around .... grrrrr).

I like and use FC-x, we run FC-2 and FC-3 on various machines (AMD64, my 
laptop as part of a triple boot, and x86).  I make sure our software 
runs on this, we compile and test on FC as well as for others 
(RH/Centos, SuSE, looking at Ubuntu/Debian) .  I am happy that our 
binary packages seem to work nicely across multiple distributions 
(though we usually bring the source along to be sure), and our large 
systems are built from source, so they should work (as long as the 
underlying technology works).  Our software works at a high level, and 
depends upon lower level bits.  I don't see the effect of the OS changes 
as much as the tool/hardware vendors do, though every now and then 
something breaks a driver.  But, and this is the critical point for us, 
if our software breaks at our customers site, we own the fixes, it is 
our job to make them happen.  More importantly, if something breaks in 
the chain of software (whether we own it or not), we try to help, as it 
is critical to make sure that failure modes are understood, and problems 
are resolved.  We have been and will be helping our customers resolve 
problems with third party software, commercial and otherwise.   If our 
target platform were moving, so that the C compiler structures were 
changing, and we had to rebuild time and time again with each OS update, 
I would wait until we saw this settle out.  Otherwise we are spinning 
our wheels, as each change is more work, and in the end, it should 
converge to a final state.  It is the final state that is worth 
targetting (for us, for others such as PathScale, they have to follow 
what their customers use).

The issue in FC-x is that it is open to internals changing.  I think 
this is a good thing.   It is doing what it was intended to do, and I 
like seeing the directions I need to worry about going forward.  I will 
not likely deploy this as an OS for a cluster customer without the 
customer understanding exactly what they are getting, and making sure 
they understand what is needed to support this.  If they really want a 
cheap RH, they can get Centos/Tao.  If they want internal structural 
stability, and support from commercial vendors for their commercial 
codes, they will have to run something that the commercial vendors will 
support.   PathScale and possibly the Portland group (and I am going to 
guess Etnus and a few others) do or will likely support it.  LSTC, MSC, 
Accelrys, Tripos, Oracle, ... will likely not (though it will probably 
run fine with no issues).

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615



More information about the Beowulf mailing list