[Beowulf] RE: Capitalization Rates - How often should you replace a cluster? (resent - 1st sending wasn't posted ).

Thu Jan 15 13:26:00 PST 2009

Hello Again to all you Beowulf users!
Similar to Doug Edline's poll....

I was a subscriber in the late 1990s, but had to sign off the list for several years as my job changed.
I used to follow many of your posts with such interest, and hope you oldtimers (and newer subscribers as well) are all doing well !!

I'm working on a study on computer system recapitalization and thought this might make an excellent topic to engage on the Beowulf list.
We expect the results to have broad implications for US Federal Government IT planning as well.

I am researching the optimal life expectancy for a military program to plan for some commercial computer based systems.
My customer has upgraded many systems to COTS based products (Linux/Intel based blade servers), and will upgrade more in the future.
They manage about 100 sites, each with a modest (1-room) sized installation of maybe 12 racks or so (200 to 300 nodes).
The question was raised as "When should all these servers be upgraded or replaced again?"

We know that from a purchasing perspective, the longer the life we plan/allow, the fewer systems to be bought per year, and the cheaper it gets.
But there are other factors - over time the "older systems" are harder to maintain.... don't run newer licenses of SW products,
need spare parts, some of which are hard or very hard to find (e.g. old RAM modules  - on Ebay?!).
Sometimes the newer technology uses less power and is cheaper to operate....(anyone ever create a KW/MFLOP vs. Time curve?  has that really gone down? )
After several years (e.g.. 6,7, or 8) the systems Admin costs on the older systems may be higher - e.g.. more labor, specialized training, unique tools ..
If we know the required life is a long time our customer insists on tracking end of life points and buying spares to have on hand.  That costs more for longer time spans.
At some point in time the reliability of the fans and disk drives starts to really impact computational production - downtime costs $$,
and at that point the repair means "complete replacement", but that cost may be lower (cheaper parts) or higher (new facility? rewrite code?) due to waiting so long to do it ...
We think that the cost vs.. length-of-life" curve has a parabolic shape, and the left is dominated by the cost decrease as noted above.  The right side' is the more troublesome.
In economics or manufacturing this is the classic capitalization problem - what is the life expectancy of the asset?  The optimal point is at the bottom of the parabola.
Do operating costs really go up as a cluster ages?  What other factors are there?
For some the upgrade is necessary to tackle a larger or more complex problem  - but others can just let the system churn a bit longer to get the answer.  Is that the real driver?

I recall that one Beowulf user facility operated both a new and an old production cluster, and replaced the "old one" with a "newer" one (the new "new one" ) on a regular basis.
Is that common?  Do most of you as users see the old system just chucked out as the new one is brought online?
What are the advantages or disadvantages of replacing maybe all 100 sites of hardware with 50% new at 25 per year, vs. replacing all (100%) of the system at 12 units per year?)
(on a coarse level, both plans cost the same in hardware purchasing terms).

I expect that many of you have experience in upgrading and replacing clusters - I'd love to hear the feedback.
Could any/some of you please respond with information or help?
Maybe in one of three ways ....

1) Reply to this thread, with general comments, especially if you know of a study, presented paper, or archived thread
(if you recall a month I'll go scouring the archives gain - effort so hard is fruitless!).   Has anyone modeled this problem?
I know I covered a lot of different aspects - feel free to comment on whatever issue or item you wish.

2) Send me info  ( directly to DLechner at mitre.org<mailto:DLechner at mitre.org>) ) on how old your current operational cluster or systems is (and a little about it - e.g. # nodes).
I'll tally those #s and come up with an average age of operating clusters (I'll post a summary of all the responses I get in a week).

3) Send me info on the age of a cluster you just recently replaced - if you send that directly to me (DLechner at mitre.org<mailto:DLechner at mitre.org>) I'll post a generic table of data back to the list in a week or so.
This will at least tell us the "hard facts" of when maybe 20 or 30 commercial systems were last replaced, with the assumption being that most of you or your lab managers were making a best case judgment that it was time for the unit to be replaced - this gives us statistical comparison data to the Beowulf user community experience.
I think actual (factual) replacement #s are more valid - that shows the result of collective decisions to actually spend $ and take action.
Please tell me, if you know:
a) How many nodes there were in the "old system" ?
b) How old was that "old system" when replaced?
c) How many nodes does the "new cluster" have?
d) If known, what was the biggest reason for the replacement?

If responding directly, please put "CLUSTER RECAPITALIZATION" in the title of the email.

Hopefully this wasn't done too recently (last few years) - if it was can someone please send a pointer and I'll check it out!
I'm reviewing published data on this questions also, but would specifically like to get the Beowulf community view.

Thanks again & in advance
Dave Lechner
Principal Systems Engineer
MITRE Corp.
DLechner at mitre.org<mailto:DLechner at mitre.org>