Small clusters Re: [Beowulf] Why I want a microsoft cluster...

Sun Nov 27 17:04:43 PST 2005

At 07:40 PM 11/26/2005, Joe Landman wrote:

>Jim Lux wrote:
>>At 09:04 AM 11/26/2005, Mark Hahn wrote:
>
>[...]
>
>>>why?  because low-end clusters are mostly a mistake.
>
>I disagree on this for a number of reasons.  It may make sense for some 
>university and c

<snip>

>>However, this sort of logic (the economies of scale) push clusters 
>>towards the model of the "mainframe supercomputer" in it's special 
>>machine room shrine tended by white garbed acolytes making sure the 
>>precious bodily fluids continue circulating.
>
>Yes!!!  Precisely.    But worse than this, it usually causes university 
>administrators to demand to know why department X is going off on their 
>own to get their own machine versus using the annointed/blessed hardware 
>in the data center.  As someone who had to help explain in the past why 
>our university did not want us to be running on the same million dollar 
>plus IBM 43xx mainframe that ran student records, and buy a real 
>supercomputer, I can tell you that this is a painful battle at best. 
>Department X may be getting its own machine due to the central resource 
>being inappropriate for the task, or the wait times being unacceptably 
>long, or the inability to control the software load that you need on it.
>
>There are possibilities to find a happy middle ground, where the central 
>folks manage the resource, and allow cycles to be used by others when the 
>owners of the machine are not using so much of it.   Moreover they can 
>connect it to central disks, authentication, and so forth.  That is, the 
>value of the centralized IT is realized, even if it is a separate resource.
>
>Forcing all users to use the same resource usually winds up with uniformly 
>unhappy users (apart from the machine designers who built it for a 
>particular purpose).
>
>>One of the biggest values of the early clusters was that they let people 
>>"fool around" with supercomputing and get real work done, without 
>>hassling the instutional overheads.  Sure, they may have been 
>>non-optimized (in a FLOPS/dollar sense, certainly, and in others), but 
>>because they were *personal supercomputers*, nobody could complain.
>>They were invisible.
>
>Not really invisible.  And not all workloads make sense to characterize in 
>terms of FLOPS/$.

I suppose I used FLOPS/$ as a shorthand for whatever optimization cluster 
builder A optimizes for, in distinction with whatever optimization cluster 
builder B cares about, which are all probably different from what the 
actual user cares about, which is getting *easy* access to more 
computational horsepower than they can get from their desktop workstation.

I think (and according to what Joe cites in the market studies) that there 
are a lot of people who have problems that are "just a little bit bigger" 
than what is practical on a single desktop machine (say, 1 or maybe 2 
orders of magnitude).

Sometimes, it's from a lack of vision (or well-justified skepticism of 
availability) on the part of the person asking the question.  Researchers 
tend not to ask for huge leaps from what they have available... be it 
computing time or data rates from outerspace, because historically, a 
little bit more right now has a higher chance of becoming reality than a 
whole lot more in some speculative future..

It's also in the nature of how answering questions works out.. one tends to 
want to scale up gradually, rather than going for the whole shooting match 
all at once.  If for no other reason than you need to validate the output 
of your big new code, and one way is to run a formerly small problem on the 
new code and see if it matches what you got with the old code, and then do 
some scalability of results tests.  (make the polygons smaller, make the 
mesh a bit finer, do a few more variables, etc.)

[An exception might be a monte carlo type analysis, where you're just 
trying to get large numbers of "tries"...  ]

>>There is, I maintain, a real market for smallish clusters intended to be 
>>operated by and under the control of a single person.  In one sense, I'm 
>>wasting compute resources, but, I'm also doing this because my desktop 
>>CPU spends most of its time at <5% utilization.  Having that desktop 
>>under my personal control means that if something isn't working right, I 
>>can just abort it. Or, if I am seized by the desire to dump all the other 
>>stuff running temporarily and run that big FEM model, I can do that 
>>too.  No calling up the other users and asking if it's ok to kill their 
>>jobs. No justifying processor utilization to a section level committee, etc.
>
>The interesting thing is that if you look at the numbers from IDC and 
>others you realize something very interesting.  First, the real HPC 
>hardware market (this is a 7B$ market today, growing at > 10% CAGR) has 
>its largest section by volume and largest growth in the 25-50k$ 
>region.  Second, the large market is shrinking.  Again, this is not a slap 
>against Jim and Mark.  The real HPC cluster market is moving down scale, 
>and the larger ones are growing more slowly or shrinking.  This is going 
>to apply some rather Darwinian forces to the market (has been for a while).

Well.. I'd hardly consider a slap.. since I too think that smaller is where 
its at in the near term.  I personally LOVE the idea of personal 
supercomputing.

>This is not to say that there is not a big cluster market.  There is. Its 
>real.  Its just not growing in dollar volume.  Some of us wish it would, 
>but budgets get trimmed, and hard choices are made.  You have to go 
>through fewer levels of management to justify spending 50k$ than you do 
>500k$, or 5M$.

In rough terms, I'd say that the number of levels scales as 
log10(dollars)..  It would be the rare researcher who has to get approval 
to spend $1K.  Even 10K is usually within the "I can decide to spend it, 
but I have to write a justification, or get multiple bids to make sure 
we're not getting ripped off".  At 100K, you're starting to look at things 
that would be considered "capital equipment" and at 1E6 and up, you're in 
the "big investment with justifications" regime.

It's probably correlated to the annual wages of the person making the 
purchase.

>>To recapitulate from previous discussions: a cluster appliance
>
>Yup.  A 100k$ cluster appliance won't sell well according to IDC.  If your 
>appliance (board/card/box/...) sits in the 25-50k$ region and delivers 
>10-100x the processing power of your desktop, then you should see them 
>sell well.  This is where the market growth is.

And this is a fairly straightforward price:performance point to hit... The 
$2500 DIY cluster was  probably in the 5-6x performance area, and there was 
a uncompensated labor component in the build.  Figure 16 nodes instead of 
the 8 in that cluster, and double or triple the price to account for 
packaging, sales, configuration and support, etc. and you're right at 
20-30k with 12-15x improvement in power.

>Companies retreating to the high end of the market risk the same fate as 
>all the other companies that have tried this before (e.g. any exec pushing 
>that strategy should take a long hard look at the HPC market and the 
>players that have retreated to the high end, most of them are gone and buried).
>
>[...]
>
>>this is the real MS/Cluster disconnect... Clusters just don't seem to fit 
>>in the MS .NET world.  On the other hand, MS has a lot of smart people 
>>working there, and it's not out of the question that they'd figure out 
>>some way to leverage the idea of "commodity computers with commodity 
>>interconnects" in some MS useful way.
>
>For one thing, I would expect that .NET will eventually == grid. Doesn't 
>make it HPC, but this is what I expect.
>
>What Microsoft could bring to the table is forcing people to build 
>reasonable interfaces for the HPC systems.  Today we have lots of APIs for 
>MPI, DRMAA for schedulers,....  Technology doesn't become useful for the 
>masses until it becomes easy for them to use.  If Microsoft makes this 
>simpler by either creating/forcing a standard upon the cluster world, this 
>might not be such a bad thing in the longer term.  I find nothing wrong 
>with the notion that more HPC users means more HPC.
>
>That said, I am not sure that they can do this without trying to force 
>windows upon the users.

That kind of depends on whether MS really believes "the road ahead" is the 
internet or Windows.
Certainly, the "windows-ness" of all the backend stuff is pretty buried.. 
once you get that 1000 processor SQL server farm up and running, the fact 
that the OS is "Windows" is sort of a non-issue.  You've bought a pile of 
hardware and software and people who know how to feed and care for it at 
some price that you presumably thought was reasonable.  All you really care 
is that it responds to those billions of transactions.  To a high level IT 
manager, they're just buying SQL queries, and the gory details of the 
underlying technology is just that: "underlying details" about which they 
give the same amount of thought to what sort of padding is installed under 
the carpet in your new house.. A lot when you buy it, not much after that, 
other than to budget and plan for ongoing costs, and perhaps to consider 
whether it was a good choice when you go out to buy the next batch in a few 
years..

Just in case my wife, who IS a highlevel IT manager and spends quite a lot 
of time in purchase decisions, reads this, I concede might have overly 
simplified the comments about how much thought they put into the 
technology.. but it's in illustration of  a point... computing resources 
are perceived as a commodity, and the details of how a commodity is 
manufactured or grown don't usually have a big influence on how purchase 
decisions get made.  Marketing (as opposed to sales) is, in large part, an 
effort to de-commodity the thing you want to sell.

>The demo at SC was ok, but you can do stuff like that today with web 
>services systems.  The issue is that there are no real standards 
>associated with that.   Not w3c stuff, but higher level standards that 
>make programming these things easier for end users.

It's tough to say what level of abstraction those standards should be 
at.  Make them too low, and they might not accomodate the  next new 
technology.  Make them too high, and they can be either so general they 
aren't really a standard or so specific to a given application that they're 
of limited utility.

The standards that endure are, oddly, low level things that have a high 
degree of commonality.. think of RS-232 or the fact that everywhere in the 
US has 60Hz power.

>Joe

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875