[Beowulf] MS Cray

Wed Sep 17 10:21:24 PDT 2008

Joe Landman wrote:
> Eric Thibodeau wrote:
>> Joe Landman wrote:
>>> Gus Correa wrote:
>>>
>>>> Otherwise, your "newbie scientist" can put his/her earbuds and pump 
>>>> up the volume on his Ipod,
>>>> while he/she navigates through the Vista colorful 3D menus.
>>>
>>> Owie .... I can just imagine the folks squawking about this at SC08 
>>> "Yes folks, you need a Cray supercomputer to make Vista run at 
>>> acceptable performance ..."
>> Maybe they have a "tune options for performance" option ;)
>>>
>>> The machine seems to run w2k8.  My own experience with w2k8 is that, 
>>> frankly, it doesn't suck.  This is the first time I have seen a 
>>> windows release that I can say that about.
>> A few questions (not necessarily expecting a response):
>>
>> POSIX?
>> VERBS?
>> Kernel latency and scheduler control?
>
> Don't mistake me for a w2k8 apologist.  I reamed them pretty hard on 
> the lack of a real posix infrastructure (they claim SUA, but frankly 
> it doesn't build most of what we throw at it, so it really is a 
> non-starter and not worth considering IMO).  They need to pull Cygwin 
> in close and tight to get a good POSIX infrastructure.  It is in their 
> best interests.  Sadly, I suspect the ego driven nature of this will 
> pretty much prevent them from doing this.  Can't touch the "toxic" OSS 
> now, can they ...
Cygwin...yerk, that emulation bloat is slow as hell and can barely run 
some of my basic scripts. A simple find would put the CPU in 100% usage 
state. Now I don't blame Cygwin for this per say but rather the way that 
windows (probably) runs this as a DOS(ish) app in a somewhat polled 
mode. My lack of interest in that OS stopped me from diggin deeper into 
why Cygwin is so slow. Given it's a more than mature project, I'd have 
expected such poor performance to have been addressed by now.
>
> IB Verbs?  Well through OFED, yes.  Through the windows stack?  Who 
> knows.  We were playing with it on JackRabbit for a customer 
> test/benchmark.
...and...the results? ;)
>
> Kernel latency?  Much better/more responsive than w2k3.  Scheduler 
> control?  Not sure how much you have.  I don't like deep diving into 
> registries ... that is a pretty sure way to kill a windows machine.
Well, let's just say these are mechanisms I expect an HPC machine to 
have when "squeezing the last drop of performance" is mentioned.
>>
>> These are the real barriers IMHO, without minimally supporting POSIX 
>> (threads), there is very little incentive to use the machine for 
>> development unless you're willing to accept the code will _only_ run 
>> on your "desktop".
>>>
>>> The low end economics probably won't work out for this machine 
>>> though, unless it is N times faster than some other agglomeration of 
>>> Intel-like products.  Adding windows will add cost, not performance 
>>> in any noticeable way.
>>>
>>> The question that Cray (and every other vendor building 
>>> non-commodity units) is how much better is this than a small cluster 
>>> someone can build/buy on their own?  Better as in faster, able to 
>>> leap more tall buildings in a single bound, ... (Superman TV show 
>>> reference for those not in the know).  And the hard part will be 
>>> justifying the additional cost.  If the machine isn't 2x the 
>>> performance, would it be able to justify 2x the price?  Since it 
>>> appears to be a somewhat well branded cluster, I am not sure that 
>>> argument will be easy to make.
>> I just rebuilt a 32 core cluster for ~5k$ (CAD) (8*Q6600 1Gig 
>> RAM/node + gige netwroking). Bang for the buck? I can't wait to see 
>> the CX1's performance specs under _both_ windows and Linux.
>
> The desktop CPUs/MBs will get you best bang per buck, as long as you 
> don't mind no ECC, and 8GB ram limits per node.  For your 
> applications, this might be fine.  For others, with large memory 
> footprint and long run times, I see people need/require ECC (as memory 
> density increases, ECC becomes important .... darned cosmic 
> rays/natural decays/noisy power supplies/...)
Well, the nodes I built have MBs that state that they can go as high as 
8Gigs, but any one with a little experience with the 8Gig mix + 800+MHz 
RAM know that very little hardware (MB) can actually do it in a stable 
fashion. My recent experiences with "fast" RAM (800 -1066MHz) is that 
they end up costing you time ($$$) since it would seem most MBs claim to 
support it but they all seem to have some impedance problems of some 
sort (totally unstable). And if one reads the fine prints and the QVLs 
(Qualifies Vendor Lists), you notice these really high throughputs are 
for low memory density of the banks (a _total_ of 2-4 Gigs max). I'd 
personally say this is more of an issue than the ECCism of RAM.

I work on "Clusering Algorithms", and not to confuse people, I mean the 
k-means type which we could call "data aggregation/mining" algorithms. 
They are long running and applied to sizable databases (1.2Gigs) which 
need to be loaded onto each node. This is where having multi-core nodes 
comes in quite handy as there is way less time lost in data propagation 
and loading (the databases).

...which brings me to wonder how the I/O is managed under the CX1...is 
it as basic as one I/O node and GFS or do all nodes have their own I/O 
paths. I mention this since I've too often seen people ignore the I/O 
(load times) ignored in their performance assessments ;)