[Beowulf] Clusters and Distro Lifespans

Wed Jul 19 10:03:13 PDT 2006

Robert G. Brown wrote:
> On Wed, 19 Jul 2006, Stu Midgley wrote:
> 
>> We also have our install process configured to allow booting different
>> distros/images, which is useful to boot diagnostic cd images etc.
> 
> Good point and one I'd forgotten to mention.  It is really lovely to
> keep a PXE boot image pointed at tools like memtest86, a freedos image
> that can e.g. flash bios or do other stuff that expects an environment
> that can execute a MS .exe, boot into a diskless config for repair
> purposes (or to bring up a node diskless while waiting for a replacement
> disk).  

[...]

The tools we set up do all of this, and for those whom are brave (or 
foolish, not sure which) we also have dban ... .  Still working on 
getting Knoppix to do this, I know its possible, haven't seen docs on 
how to do it.

> Honestly, for MOST work people do with clusters, running pretty much the
> (PXE-installable) distro of your choice will almost certainly work.  I
> tend to use FC-even or Centos (a.k.a. FC-even-frozen) on cluster nodes
> simply because we have long since gotten to where we can make RH-derived
> distributions jump through hoops.  With Seth Vidal in charge of the core
> mirrors and repos, Duke is "Repo World" not just to campus but to much
> of the world.  Heck, I PXE-boot and kickstart install my systems at
> HOME using mirrors of the duke repos, and if I ever bothered to figure
> out Icon's toolset for customizing kickstart boots per system (using
> some very clever CGI scripts and a bit of XML) it would make those
> installs even easier than they are now.

Sadly, not all distros do yum, nor do all distros have sensible 
dependency trees, nor even sane/common naming.

SuSE as of 10.0 can work with yum.  We have/host a repo for 
ourselves/customers.  The problem is that yum is not a first class 
system tool on SuSE like rug or zmd or whatever.  Which means that there 
are things that break yum under SuSE that don't break running 
Yast/zmd/rug.  Grrrrr.  (If anyone from SuSE is reading, this was a 
really bad idea, go to yum, your life, my life, and your customers lives 
will be *much* easier).  Well there is that and yum on 10.1 is slightly 
borked.

>>> > iii) Do people regularly upgrade their clusters in relation to
>>> > distros?  I guess this is like asking how long is a piece of string
>>> > because everyone's needs are different.
>>>
>>> Cluster upgrades are rare unless you are missing functionality or
>>> something is broken.  That is of course one opinion, some here do
>>> upgrades nightly.  From a purely production oriented viewpoint, where
>>> downtime == lost money for our customers, we usually advise against 
>>> that.
>>
>> I think rare is a strong word.  Infrequent may be better.  We
>> regularly apply patches and upgrades to the front end nodes (globally
>> connected) and infrequently (~ every 6 months) upgrade all the cluster
>> nodes in the rolling fashon mentioned above.

I assume that rare == infrequent.  Basically the argument for production 
cycle shops are that you don't upgrade unless there is a need to.  That 
is, stuff could/does break with upgrades, and you have to be really 
careful.  Test test test.  If you need a security patch, I am not sure 
any production cycle shop considers this an upgrade, but again, test 
test test.  The rules of thumb that I see followed are "if it ain't 
broke, don't fix it".

If you install new hardware, you likely need newer kernels and drivers 
to deal with it (say like SATA and RHEL4 before U1).

>>
>> You can even do a kernel upgrades to the file servers/front end nodes
>> (which requires a reboot) without killing or disrupting jobs.  Having
>> complete control has a lot of benefits.

It does, and you often need a fairly competent staff around to make this 
work.  There are a shortage of Mark Hahn's in the world, so not every 
site can work the stuff he does.  Similarly for other sites.

[...]

> On the whole, though, updates are there for a reason and STABILIZE
> systems more often than the DESTABILIZE them.

The last Centos 4.3 x86_64 kernel update almost nuked one of our very 
important servers.  Had to back it out, and thankfully I had backups of 
the affected files.  Updates are *supposed* to increase stability.  They 
don't always do that.  Remember that an update is brain surgery, if you 
treat it anything less than that you are going to be burned someday. 
The folks advising caution are not advising it because they like to be 
cautious, but because they have been burned before, and they don't want 
to see others fall into the same behavior that burned them.

>    rgb

-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615