[Beowulf] Clusters and Distro Lifespans

Wed Jul 19 10:55:01 PDT 2006

On Wed, 19 Jul 2006, Joe Landman wrote:

> Robert G. Brown wrote:
>> On Wed, 19 Jul 2006, Stu Midgley wrote:
>> 
>>> We also have our install process configured to allow booting different
>>> distros/images, which is useful to boot diagnostic cd images etc.
>> 
>> Good point and one I'd forgotten to mention.  It is really lovely to
>> keep a PXE boot image pointed at tools like memtest86, a freedos image
>> that can e.g. flash bios or do other stuff that expects an environment
>> that can execute a MS .exe, boot into a diskless config for repair
>> purposes (or to bring up a node diskless while waiting for a replacement
>> disk). 
>
> [...]
>
> The tools we set up do all of this, and for those whom are brave (or foolish, 
> not sure which) we also have dban ... .  Still working on getting Knoppix to 
> do this, I know its possible, haven't seen docs on how to do it.

Yeah!  Nuking!  Just like those antique DOS postal bombs!

Actually, nuking isn't completely crazed -- if you do a COD-type cluster
design that permits users in a highly heterogeneous, test-heavy
environment to boot into custom disk images, then you effectively give
them low-access to the node disks.  If the previous user was using
those disks to perform sensitive or valuable research, they might not
appreciate somebody being able to scrape the data out of individual
undeleted blocks...

> Sadly, not all distros do yum, nor do all distros have sensible dependency 
> trees, nor even sane/common naming.

Yes, these things are all very sad.  Sad enough to be good reasons not
to use them professionally.  That way, the process of evolution can
selectively favor ones that are less sad, and the overall sadness level
in the Universe will decrease.  This implies at least a moderate degree
of relative growth in the mean gladness index of the Univers (however
much it is being reduced by little things like war, poverty, and
religious intolerance) which I think we'd all agree is a good thing;-)

> SuSE as of 10.0 can work with yum.  We have/host a repo for 
> ourselves/customers.  The problem is that yum is not a first class system 
> tool on SuSE like rug or zmd or whatever.  Which means that there are things 
> that break yum under SuSE that don't break running Yast/zmd/rug.  Grrrrr. 
> (If anyone from SuSE is reading, this was a really bad idea, go to yum, your 
> life, my life, and your customers lives will be *much* easier).  Well there 
> is that and yum on 10.1 is slightly borked.

Yeah, I've heard that before for sure.  But SuSE is one of those
"commercial" distributions, and their view is doubtless a)
differentiation is good, because their customers become accustomed to
their way of doing things (however screwed up they might be) and will
resist change; b) as long as things "work well enough" they are unlikely
to get enough complaints to "force" them to work past a).  Instead
they'll expend energy continuing to make things work well enough.

Consequently, the universe remains a bit sadder than it might be.  Sob.

> I assume that rare == infrequent.  Basically the argument for production 
> cycle shops are that you don't upgrade unless there is a need to.  That is, 
> stuff could/does break with upgrades, and you have to be really careful. 
> Test test test.  If you need a security patch, I am not sure any production 
> cycle shop considers this an upgrade, but again, test test test.  The rules 
> of thumb that I see followed are "if it ain't broke, don't fix it".
>
> If you install new hardware, you likely need newer kernels and drivers to 
> deal with it (say like SATA and RHEL4 before U1).

I think Mark's point is the best one here.  Production or not, some
places are comfortable living closer to the edge than others.  This is
fine -- it generates a lovely differential economy, where some folks are
risk takers (so to speak) but who gain many benefits in exchange for
their risks.  Others are risk averse, but have to live for years with
old hardware and antique kernels because newer hardware literally won't
work without a reasonably modern kernel.  I have several systems in my
house that won't even >>install<< an FC4 system, and have ended up
having to run FC5 (warts and all) as the only game in town.

Running FC5 as Hobson's choice, one would be foolish not to stay
connected to an update chain, as FC5 really NEEDS its updates...;-).

Although honestly, it is by now fairly stable.

    rgb

>
>>> 
>>> You can even do a kernel upgrades to the file servers/front end nodes
>>> (which requires a reboot) without killing or disrupting jobs.  Having
>>> complete control has a lot of benefits.
>
> It does, and you often need a fairly competent staff around to make this 
> work.  There are a shortage of Mark Hahn's in the world, so not every site 
> can work the stuff he does.  Similarly for other sites.
>
> [...]
>
>> On the whole, though, updates are there for a reason and STABILIZE
>> systems more often than the DESTABILIZE them.
>
> The last Centos 4.3 x86_64 kernel update almost nuked one of our very 
> important servers.  Had to back it out, and thankfully I had backups of the 
> affected files.  Updates are *supposed* to increase stability.  They don't 
> always do that.  Remember that an update is brain surgery, if you treat it 
> anything less than that you are going to be burned someday. The folks 
> advising caution are not advising it because they like to be cautious, but 
> because they have been burned before, and they don't want to see others fall 
> into the same behavior that burned them.
>
>>    rgb
>
>
>
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu