[Beowulf] Shared memory

Thu Jun 23 10:33:15 PDT 2005

Of course apart from the embarrassingly nature of certain software, where
game tree search and artificial intelligence in general doesn't fall under,
there is a simple programming skill difference between the average unix tool
and commercial software. 

Probably the difference is that i've put 7 years of programming into the
parallel algorithm of DIEP and 2 years fulltime to modify an excellent
algorithm to work fine under NUMA conditions. It is commercial software,
the best you can get in its fields, especially with respect to the actual
speedup the algorithm gets out of a NUMA environment such as a quad opteron
dual core provides.

A single dual core cpu costs $823 or something similar clocked at 1.8Ghz,
i'll leave it up to you to call that 'expensive'. 

This, whereas the average 'mpi' program, first gets slowed down factor 40
or so, but then the scientist in question just puts some signatures and
gets 512 processors for a long period of time. Effectively 512 / 40 =
factor 12 speedup.

The scientist just cares shit, goes take a holiday and comes back, and
writes a positive report.

That's the difference between that scientist and me. I try to make software
that also runs real fast at a SINGLE cpu for my clients. 

If you try to run over a cluster software that's utmost optimized with 50
algorithms to run ultimately fast single cpu, then getting a good speedup
out of a cluster is not so easy.

It's a simplistic programming skill difference nothing else.

Now it is of course possible to get some sort of speedup out of a cluster,
but you cannot compare a 500Mhz MIPS R14000 cluster of 512 processors with
a quad dual core 2.2Ghz opteron. The quad dual core 2.2Ghz opteron just
eats it alive.

This where the SGI origin3800 cluster is a factor 1000 cheaper than the first.

I deliberately call it cluster because at 512 cpu's the latency is similar
to what todays network cards deliver and a factor 50 away from the latency
speeds a quad opteron dual core delivers (one way ping pong TLB trashing
memory reads and writes).

Just suppose the speedup of say 10%-30% or something effectively at such a
cluster at slow time controls i managed to get (let's not discuss the first
30 seconds as that's not fair for a cluster which needs 3 hours starting
time to just allocate shared memory, let alone wake up 500 processors). I
actually used 460 cpu's when running at that partition, as with 500 cpu's
didn't work out real well. The scientist will claim then 20% in his report,
which indeed was the average speedup i had, but the worst case is what
counts in competative environments.

The worst case was in fact around 10% (still guessed, could be worse,
didn't have enough system time to do ANY statistical significant test as
that would run for a week).

10% * 460 * 0.5Ghz = 23Ghz

By any measure my average speedup of 20% was real good. Deep Blue team
claimed around 5% speedup (no evidence given though). The 20% speedup i
calculated for my program at the big machine also is based upon a lot of
statistical inaccuracy, at such big chicken machines you never get enough
system time to do some serious testing!

When you calculate that to opteron speeds. With the improvements of
compilers lately, an opteron is 2 times faster per cycle like that R14000
(off chip L2 cache, YES BABY!).

So that 23Ghz ==> 11.5Ghz opteron.

Actually a quad opteron dual core 2.2Ghz = 8 x 2.2 = 17.2Ghz

And the speedup even worst case is real good at it, for sure far superior
to 11.5Ghz effectively. 

Exact speedup numbers i'll have for you in not too many days from now.

So a quad opteron dual core just completely outperforms such hardware,
simply because you never can test seriously at a 512 processor origin3800.

Of course this 512 processor machine would completely outperform the quad
dual core opteron on the left and on the right, if each processor inside
that machine would be a good cpu... ..like a dual core opteron inside!

However that's not the case, such big iron machines usually have outdated
cpu's. This where such a quad dual core opteron you order and you have it
at home within a few work days. It's this where the beowulf system
engineers have to fight against.

"If you were plowing a field, which would you rather use? Two strong oxen
or 1024 chickens?"
	Seymour Cray

Vincent

At 09:14 AM 6/23/2005 -0700, Michael Will wrote:
>Michael Will wrote:
>
>> I was just yesterday benchmarking our A3400 quad-opteron with dual cores
>> using UnixBench 4.1 which is not really an SMP benchmark except for the
>> 8 and 16-concurrent shell script runs, and was not too impressed with 
>> the speed
>> increase of those runs either,  judging how much more the CPUs cost.
>>
>> Compare A3140  (dual opteron 248 single core) with A3400 (quad opteron 
>> 875 dual core):
>>
>>
>> A3150/raid5     dual opteron 248     8G     FC3     668     859     443
>> A1300     dual opteron 852     4G     FC3     806     964     497
>> A1300     dual opteron 875     4G     RHEL3u5     724     1329     744
>> A3400     quad opteron 875     32G     RHEL3u5     736     1691     1030
>
>
>Sorry for the messed up table. Here we go. Index Score is a compound of 
>the weighted
>results of UnixBench 4.1, 8-scripts and 16-scripts is the specific 
>result in lines per seconds
>achieved when running 8 resp. 16 shell-scripts concurrently, which are 
>two partial tests of
>the Benchmark Suite.
>
>Machine           CPU                       RAM OS              Index 
>Score     8-scripts lps     16-scripts lps
>A3150/raid5    dual opteron 248     8G     FC3               668        
>         859                 443
>A1300            dual opteron 852      4G     FC3               
>806                 964                 497
>A1300            dual opteron 875      4G     RHEL3u5     
>724               1329                 744
>A3400            quad opteron 875   32G     RHEL3u5     
>736               1691               1030
>
>Michael Will
>
>
>