[Beowulf] Why I want a microsoft cluster...

Wed Nov 23 17:15:46 PST 2005

Jim Lux wrote:
> At 01:30 PM 11/23/2005, Joe Landman wrote:

[...]

> This is particularly pernicious for documents that get viewed with a 
> projector, and then get zoomed to look at the details.

Agreed.  This is a good reason to use vector formats in general whenever 
possible.

[...]

> 
> Yes, assuming your netops folks allow you to have a "public" sharepoint 
> that is visible from your desktop machine, something that requires a 
> fair amount of institutional negotiation in a MS centric shop.  Many 
> aspects of FUD will be raised about whether that's secure, etc.

Hmmm.... We have set up a cluster in a tightly locked down environment, 
and not even heard a peep out of them (apart from IP, DNS, ...).  I 
suspect this varies by institution.

>> We set up the transparent access for our customers, and I had assumed 
>> that all cluster vendors did.  Maybe I am wrong.
> 
> Many "locked down" shops want a fair amount of control over the software 
> that's hanging on their internal network, and if your particular cluster 
> distro isn't one they "trust", then it's a problem.

For us thats not an issue, as we use anything our customers tell us they 
want, though if they don't have a preference we will argue the technical 
merits and other bits of the most suitable distro.

We do have customers using and building RH7.3 based clusters due to 
application support issues.  Whether we like it or not is not an issue, 
its what runs the software correctly that matters.

> Clearly, if a vendor is providing a turnkey systems to someone, then 
> they've taken on the responsibility (and expense) of complying with all 
> the rules for the customer.  That can be a major project in itself (it 
> can also be easy.. depends a lot on the customer...)

Yes it does.  If the customer tells us "use X with patch set Y", we do 
it.  If it is inadvisable, we tell them why.  Compliance with this is 
not hard, unless they have hardwired loading processes at their factory. 
  Our cluster load process start out using whatever distro you want, and 
loads from there.

[...]

> But not vector formats as easily.. And, yes, you can interconvert on the 
> Linux side to get something useful, but you've also lost a lot of the 
> "seamless user interaction" and starting to get more like a batch 
> oriented sequence of operations: I set up my cluster job, run it, get 
> the output file, then run this script to convert it to a format that's 
> Windows pastable, then copy the file to my windows box, then do "insert 
> file...." in ppt...  A lot more steps than "Click on graph, Ctrl-C (or 
> Ctl-Ins), Click on ppt slide, Shift-Ins"..

Hmmm.... I could tell you things but I would be spamming.  Short version 
is that for our customers, its fill out a web form for their job and 
then after it finishes, open up your post analysis tool on the desktop, 
pointing over to the result.  This is IMO how it should be for most 
folks who don't care about the inner workings of clusters, batch 
systems, etc.  And most dont.  And they shouldnt.

> In the Linux cluster but Windows desktop scenario, you're running on two 
> different machines, with two different user interfaces, and you have a 
> cognitive "context switch" every time you go back and forth.  (Just the 
> "single click" in kde vs "double click" in Windows distinction is bad 
> enough... how many times have I opened multiple copies of the editor on 
> the headnode?)

Hmmm... we are hiding the cluster from the casual user.  All they see is 
a web page and a big disk.  Everything else is magic.  They do not log 
into the cluster.  The log onto the web page.  It takes quite a bit of 
work to make this simple for them, but this is the goal.

>> As for "pandering", well, that may be an option for some, but my 
>> customers and users need to get work done.  The issue is how do you 
>> get it done in the least amount of time and effort, and the greatest 
>> impact.
> 
> And my contention is that in some circumstances, that might be with a 
> cluster using Windows rather than Linux.

My point is that it almost doesn't matter whats on the cluster as long 
as the application software runs and runs well on it.  In many cases you 
cannot use windows due to application compatibility issues with service 
packs.  In some cases you cannot use Linux due to glibc issues.  To a 
large degree the job and workload decide what is on it, and yes, you are 
right in that if you use windows OS, you could take a fairly substantial 
hit in performance.

Oddly enough, most of the folks I have spoken to are unwilling to take 
that hit.  I am sure that there are some out there that are willing to 
pay more for less.

[...]

>> Also true, quite true of my customers.  They are looking for minimum 
>> time to insight (hey look, its a marketing phrase).  Fussing around 
>> with transport and conversion of data is not an option.  So they don't 
>> have to.
> 
> presumably, though, your customers are Linux centric?  That is, they are 
> doing their analysis and reporting in the Linux environment, not the 
> Windows environment?

Nope.  For most of our customers, everything is in windows.   Seamless 
is the order of the day.

The unix folk are far easier to support and don't want any of this web 
page stuff.  Fine.  They get a command line.  The rest of them get a web 
page and a big disk.

> Indeed.. and an essential resource they are.  However, if you're a naive 
> buyer of cluster computing, you might think that it's easier to just buy 
> that MS cluster and have it supported by the inhouse MS support folks.  

Yup, and that is a danger.  Just look at SQL server.

> Fear/Uncertainty/Doubt are powerful, powerful forces when it comes to 
> support.  There's a big psychological difference between having your 
> boss call the boss of the support division (keeping it within the 
> company) and calling your outside support vendor.

Somewhat, though for people interested in performance (most everyone 
using a cluster) you have to balance many issues.  Tight interconnection 
or not.  Intel or AMD.  Disks or diskless.  Windows or Linux.   You will 
see costs and benefits to every decision.  I simply don't see the points 
you are raising as being as significant to my customer base, though it 
is possible that there are windows only shops who cannot wait to 
replicate the CTC machine.

In that case, why haven't they?

> Practically speaking, the actual support might be identical, but it's 
> the thought that "I can get one of those PC support guys over here in an 
> hour to fix my dead node" without spending a fortune (because you've 
> already spent the money for them to be always there for all those 
> thousands of desktops).

Hmmmm.... we have trained the PC support guys on how to fix the nodes, 
right at the level they are comfortable with.  No issues from them. 
They seem to like playing with the stuff.  Their bosses like them 
learning it.

There are other processes at work here.

> 

>>> The odds of getting someone to fix my broken cluster, today or 
>>> tomorrow, are much higher if it's Windows based, just because there's 
>>> more folks around who are capable of doing it. If that 1 Linux 
>>> cluster weenie happens to be on vacation, I'm dead... the odds of all 
>>> 10 Windows cluster weenies being on vacation simultaneously is much 
>>> lower.
>>
>> Hmmmm.  Again, thats why there are external experts who do this stuff. 
>> As for the 50-100, I think the number is closer to 20-50 desktops per 
>> admin.  I have seen 4000 node clusters supported by 2 people full 
>> time.    I am not going to comment on the other aspects of this.
> 
> Typical desktop support costs are around $150/month for a large shop 
> (paying for the help desk and the roving techs)... That's roughly 
> $2K/yr.  Figuring a support person costs $200K, that's 100 systems per 
> person.  But, as you say, it could easily be a factor of 5 either way, 
> and I think a very good case can be made that the boxes/body ratio for a 
> cluster (which, by definition is all identical boxes configured the 
> same) might be higher.

We have seen in-company costs of $800/month and higher for boxes sitting 
in a server room at a number of our customers.  We have seen somewhat 
higher desktop/laptop support costs.

> OTOH, a lot of big corporate desktop management systems do the "You can 
> have system configuration A, B, or C, and no other" thing, so they are 
> pretty homogeneous too.

Heh... It keeps the support costs down.

[...]

> yes, huge firedrills on Patch Tuesday.. BUT... the MS cluster won't 
> incur *additional* hassles, because it's just the same as everything 
> else. (Well, it might.. the applications are different)

Again, I am going to disagree with you on this, as I have seen the 
admins who are spending all their time working on keeping the existing 
systems up and going after the patch bolus, who have no time to handle 
incoming priority one tickets (things which are deemed critical loss of 
business functionality), because they are busy trying to deal with the 
issues that the patches create.  In fact this problem is so bad at one 
customer (and I am told it is the same pretty much everywhere), that 
they pull the unix admins and others over during these sessions to help 
them.

Basically the current patch management system is horribly broken.  I 
know lots of folks will disagree with this.   I purposely avoid patching 
any of my windows machines before major events, even if there are day-0 
nasties running around, as I cannot deal with the critical loss of 
functionality at inopportune times.  The risk from a day-0 nasty is far 
lower for us than the risk of a patch borking a working system.

While it might be nice to have everyone doing exactly the same patches, 
I would argue that the HPC would not be considered a priority-1 or 
priority-2 system in most cases (mail, dns, web, ...) .  It is not core 
infrastructure for these admins.  No one will lose their jobs if it 
fails.  It would be patched as they have time.  This is how it is now, 
as the admins struggle to deal with the major issues.

That said, we have some neat ideas on how to effectively mitigate these 
issue (completely on a cluster, and almost completely on a desktop). 
Requires you take a performance hit on your systems running windows. 
The worst that would happen is you need to copy a file back over, even 
if a virus hit.

>> Look carefully at what the CTC has to go through with their systems.  
>> If you are running 1000 copies of Norton on your disks, with each one 
>> loaded up with a personal fire wall, anti spyware and virus ... do you 
>> really have a cluster?  I don't think so.
> 
> Sure.. why wouldn't it be a cluster?  Sure, you've encountered some 
> inefficiencies in having all those antivirus programs, etc. running 
> where they're not really needed, but still, you're doing cluster work.  
> Mostly, all that dreck fills up disk space and doesn't hugely affect 
> computational performance most of the time, except once a night when SAV 
> phones home for the latest virus pattern files, etc.

No... if you have it scan all files, and all IO, the performance hit is 
*huge*.   And that nice security center, will happily block all those 
nice outgoing sockets, for stuff like MPI ...

>>> Say I wanted to install a Linux cluster.  Ooops.. they're not quite 
>>> as familiar with that.
>>
>> And this is why there is a market for this expertise.
> 
> But then, you have two sources of expense:  The outside expertise PLUS 
> the inside people who have to deal with them.  Yes, the inside expense 
> is less than it would have been before, but it might well be that the 
> total cost is higher.

Our customers tell us otherwise.  It costs them less to handle problems 
correctly the first time.

> 
> 
>>> They don't have all the patch rollout stuff, they don't have a patch 
>>> validation methodology, etc.  Sure, there's all kinds of patch 
>>> management stuff for Linux, in a bewildering variety of options, but 
>>> now we've got to have a Linux security expert, in addition
>>
>> ... all of this outsourceable for a tiny fraction of what they pay for 
>> their required in house windows staff ...
> 
> Is it really that much cheaper?  I suspect not.  Large windows shops 
> aren't all that inefficient.. they can't be.  It's not going to be, say, 
> 10 times more expensive to manage windows boxes than Linux boxes in an 
> apples/apples comparison.  And, there's a huge commercial pressure to 
> come up with ways to reduce the management costs of Windows boxes (so 
> you have products like Patchlink).

There are some real and unbiased studies out there somewhere.   It is 
significantly less expensive for a well design and built infrastructure. 
  For a cluster, and this is the point of the thread, the cost 
differences are going to be huge, and from what I can see, massively in 
favor of linux.

Per node base software costs scale linearly on windows clusters.  They 
are flat on linux as RGB pointed out.  The management systems for linux 
clusters are fairly well honed right now.  If you want to manage a 1000 
node cluster, reload every node using a toolkit like Rocks, Warewulf or 
similar, you can do it with a single command line (or even hook it 
together in a web page, though I would advise against this particular 
one).    It is unknown how many mouseclicks per compute node you will 
need to do this under windows.

Now this is not to say the Microsoft might not have something up its 
sleeve.  I am keeping my mind open on this.  But the pain and cost of 
management for things like linux clusters is fairly well known and 
fairly low.   The tools are very powerful and very flexible (Warewulf 
and others).   Huge and functional complex computing infrastructures can 
be built, rebuilt, re-rebuilt in very little time (Rocks, Warewulf).

If they can show a convincing argument that they can do a better job at 
lower cost, thats when it starts to get interesting.  Some folks may 
deny the better job.  I am thinking that the cost is just not going to 
be in line with what it needs to be.  Flat.

> 
> And once you add in the overhead to make the inhouse compliance folks 
> feel warm and fuzzy, outsourcing might not be as cost effective.

Hmmm.   This is dependent upon the organization.  I don't expect the 
NNSA to outsource.  I would expect the FDA, HUD and others to.

[...]

> 
> But this is essentially the same choice.. either your software works on 
> your configuration (be it Linux distro or Windows build) or it doesn't. 
> In the Linux case, you've got the potential option of spending serious 
> time convincing the powers that be that you can make it work and still 

Rarely do you have that luxury.   All projects have a choice A and a 
choice B, with choice B being the default choice you make if A does not 
work in the alloted time.  Choice B must work.  Choice A is the thing 
you would like to work because it is better than B.  Call it a fallback 
plan.

In most cases, our customers use a very simple rubric.  If the thing is 
not up and functional in very few hours (to days for complex things) it 
ain't gonna work.   Pursuing something to the point of nearly infinite 
time might be great for academic circles :) , but it doesn't fly in 
industry.

> be in conformance with the instutitional computing rules.  In the 
> Windows case, you're just plain out of luck.

I have as of yet to see an individual contributor told that they are 
being let go due to the inability of the IT staff to make the IT 
infrastructure work for them (and therefore rendering them effectively 
unable to do their work).  I have seen IT missioned to make something 
work, regardless of how they have to do it.

> Of course, in the Windows world, most shops have dealt with the "how to 
> support multiple SP levels of Windows" problem.. and, of late, this 
> problem is much, much reduced from the days of Win95,Win98,WinME,NT4.0. 
> I haven't had anything break with Win2K or WinXP patches in a long time, 

Heh... about 3 weeks ago, a major customer had all sorts of joy.  LOTS 
of stuff was hosed.  All "Standard" stuff.  Delayed a project we were 
working on with them (HPC, go figure) almost to the point of missing it 
for this year.

I am having trouble remembering a clean and simple patch install under 
windows, either at any of my customers sites or at my site, which just 
worked.  You have to watch the system for days.

I built one XP system (my desktop XP system) which was rock solid stable 
until that last patch bolus a month ago.  It now crashes after about 4 
days.  Like clockwork.  My laptop is suffering under the same patch 
stream.  A number of tools no longer function.

> with the exception of a weird interaction between the OpenGL drivers in 
> Matlab R13 and Win2K.

I don't have what one might call a positive sense about the Microsoft 
patches.  I wish they would stick to bug fixes, and not change 
functionality on me.

> As far as the "MS only supports 2 versions back".. that comes with the 
> territory in Windows applications. If you're selling applications for 
> Windows based clusters, you'd have to factor that into your support 
> strategy, same as if you're selling applications for any other flavor of 
> Windows.

So customers with business dependencies upon w95 (hey, could happen) are 
toast if they cannot upgrade.  Yet customers with business dependencies 
upon RH5 could probably run just fine.

[...]

> 
> I think so... and you've got to "look" to the Windows centric world 
> appropriately "trustworthy", just like that network connected HP printer 
> down the hall.  What you can't do is "look like a computer", because 
> that will scare them.

This isn't actually a problem in most cases.  A well implemented SAMBA 
system will appear to be a NAS.  A well implemented cluster will appear 
to be a web page and a NAS.  You can direct both systems to authenticate 
against the corporate oracles of truth.

[...]

> Indeed... I agree... the problem is that someone has to pay for building 
> that "web appliance" and I'm not entirely sure that the market for 
> clusters is big enough to allow that (substantial) expense to be spread 
> thin enough.

So what if it already exists?  What if a fortune 500 or two were using it?

> My Linksys WRT54G has Linux inside, as does my Moxi digital cable box, 
> but the vast majority of users of either are not aware of it.  They're 
> also being produced in million scale quantities, so the considerable 
> work required to "hide" Linux (or equally valid for some other devices, 
> WinCE) and make it truly "appliance like" is small in a per unit sense.

:)

> 
> I think there IS a market for an appliance with some applications 
> tailored to use it.  Say you have someone doing a lot of work with 
> NASTRAN or HFSS (both are computationally intensive FEM programs).  You 
> could give them their familiar windows executable that just happens to 
> feed work off to a networked attached computational engine.  I've 
> contemplated doing something like this with a program called 4NEC2, 
> which wraps a nice Windows front end around a compiled FORTRAN backend 
> program (NEC -Numerical Electromagnetics Code).  NEC runs just fine on 
> Linux (heck, there's even cluster versions of it) and the "interface" is 
> just text files in 80 column card images and 132 character printer output.

Would you like a URL?

> 
> 
> Jim

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615