disadvantages of linux cluster - admin

alvin at Maggie.Linux-Consulting.com alvin at Maggie.Linux-Consulting.com
Wed Nov 6 15:56:13 PST 2002


hi ya robert

On Wed, 6 Nov 2002, Robert G. Brown wrote:

> On Tue, 5 Nov 2002 alvin at Maggie.Linux-Consulting.com wrote:
> 
> > - one person should be able to maintain 100-200 servers ... within an 8hr
> >   period .... ( to hit reset if needed and boot it properly )
... 
> > -- think google .. they have 10,000 machines... and at 5,000 servers,
> >    they still had 5-10 people maintainting 5,000 servers
> 
> The limiting element (we've found) in either LAN or cluster is not
> software scaling at all.  We have OS installation down to a few minutes
> of work, and once installed tools like yum automate most maintenance.

I'd say ... one needs to have a "admin policy" put into place that all
users and root-admins must obey and follow

> It is hardware, humans, and changes. 

hardware problem is sorta fixable ...
	- as you say... dont buy cheap hw to say a buck-or-two
	( its NOT worth it )

	- buy stuff you're comfy with, and wont be awaken at 3AM when
	the hw fails

	- hardware lifespan is 3 months.. after that, your favorite
	motherboard is out-of-stock or discontinued..

humans ... ( trainable and teachable and avoidable issues )
	- nobody has root passwd unless they are the ones receiving
	the 3AM phone calls to come fix the machines "now"

	- give users an html web page  to clicky-t-click to their
	hearts content to run the jobs or "command lines" to do the
	same

	- harden the server from the user standpoint
		- remove /usr/sbin and /usr/local/sbin from user access
		- no user has root passwd
		- remove passwd command, remove tar, remove make/gcc...

changes...  ( most expensive if it breaks - depending on 3rd party sw )
	- after the server is built and patched and hardened..
		- you dont need to apply any new changes unless it
		prevents some kind of functionality that is needed 
		and or a security vulnerability/exploitability

> Hardware breaks and the
> probability of failure is proportional to the number of systems.  Humans
> have problems with this package or application or that printer and those
> problems also scale with the number of systems. 

if they have problems... they do NOT get to fix it..
	and have a redundant backup methodology and systems to do the
	same job ...

if done right ... i think things scales well... :-)

< Even if your
> OS/software setup is "perfect", you cannot avoid it costing minutes to
> hours of systems person time every time a system breaks, every time a
> human you're supporting needs help. You also cannot avoid constantly

if one human needs help... its likely others will need the same help...
	- send um to the internal "help docs"

> working on the future -- preparing for the next major revision upgrade,
> installing new hardware, building a new tool to save yourself time on
> some specific task that isn't yet scalably automated.

> 
> This teaches us how to minimize administrative expense.
> 
>   * Buy high quality hardware with on-site service contracts (expensive

if you need "on-site contracts" ... buy other hardware instead ...
as it implies that the hw will die ...

if you need onsite contracts ... they are expensive ... and your systems
will be down till they show up ... you cannot wait for them ???

lots of inhouse people probably wuld like to fiddle with the
dead/dying/flaky box for the same expensive time the outside service
contractors are paying ??? ( whom actually gets $10/hr -$20/hr to go from
site-to-site ... while the contract incident costs are $150/hr -
$300/incidents  and goes upwards

> up front but cheap later on) OR be prepared to deal with the higher rate
> of failure and increase in local labor cost.  Note that either strategy
> might be cost-benefit optimal depending on the number of systems in
> question and your local human resources and how well, quickly, and
> cheaply your vendor can provide replacement parts.  To achieve the
> highest number of systems per admin person, though, you'll definitely
> need to go with the high quality hardware option.

cheaper to buy 2 systems.... than it is to buy support contracts..
	( keep lots of "spare parts" floating around 

	- you need to back things up anyway.... those are your spare parts
	too and emergency replacements

>   * Shoot your users.  G'wan, admit it, you've thought about it. 

give um GUIs to use .. :-)

> They
> just clutter up the computing landscape.  Well, OK, so we can't do that
> <sigh>.  So user support costs are relatively difficult to control,
> especially since it is a well known fact that all the things one might
> think of to reduce user administrative costs (providing extensive online
> documentation, providing user training sessions, providing individual
> and personalized tutorial sessions) are metaphorically equivalent to
> pissing into a category 5 hurricane.  
> 
>   * Don't upgrade.  Don't change.  Don't customize.

unfortunately... customization is always needed ???

customization is how to keep the costs down to almost zero ??
and self automating ... no matter which distro is used or compute
platform

>  It is a well-known
> fact that one could get as much work done with the original slackware or
> RH 5.2 -- or even DOS -- as one can today with RH 8.0 (scaled for CPU
> speed, of course).  A further advantage of never changing is that
> eventually even the dullest of users figures out pretty much everything
> that can be done with the snapshot you've stuck with for the last five
> years.
> 
> So Google can manage with a relatively few admin humans because they
> probably hide hardware expenses behind a fancy service contract (so that

they do NOT have service contracts...

	- if the pc dies for any reason... it goes offline and stays
	offline ... they have about 5,000 PCs that are just sitting there
	occupying space .. and doing nothing... its powered off

	its cheaper for them to just buy a new P4-2G machine for $500
	than to go figure out what broke and why ... on an old server
	that is obsolete ?? ( some being Pentium-class pcs )

> they REALLY have another ten full time Dell maintenance folks who do
> nothing but pull and fix systems all day long)

last i heard ..few months ago ..  they buy generic PCs ... parts only ...
( think they might be farming out the generic physical pc-assembly part 
( of using this mb w/ that cpu and xxx memory w/ specific disk drive )
	- it needs to be able to do network boot/install
	- it needs to be able to run pxe
	- they build and install in something like 5 minutes
	and convert to a front end www machine or a "search index box" or
	user workstattion or ??

fun stuff ... to try to keep machines up and running .. while keeping
the noise level from users at a minimum ...

c ya
alvin

> and because they don't
> have any users.  Well, they have LOTS of users but they're all far away
> and can't come into their offices ranting and don't expect their hands
> to be held while learning a simple command like ls with no more than a
> few dozen command line options.  And I'm sure that THEY never change a
> thing they don't have to, and dread the day they have to.
> 
> More realistically, we're finding that in an active LAN/cluster
> environment, two full time admins are a bit stretched when the total
> number of LAN seats plus cluster nodes reach up towards 400-500, over
> 200 apiece, with all of the above (HHC) being the limiting factors.  One
> reason the Tyans we opted for for our last round of cluster nodes have
> been a problem is the anomalously high costs of installing them (see
> ongoing discussion of their quirky BIOS) and their relatively high rate
> of hardware failure.  We're now considering going back to Intel Xeon
> duals and are evaluating a loaner -- they are a bit more expensive but
> if they reduce human costs they'll be worth it.
> 




More information about the Beowulf mailing list