[Beowulf] CCL:Question regarding Mac G5 performance

Mon May 24 15:40:45 PDT 2004

Konstantin Kudin wrote:

>--- Joe Landman <landman at scalableinformatics.com> wrote:
>
>  
>
>>> It is unlikely that one can gain much speed from going to 64 bits,
>>>      
>>>
>>but
>>    
>>
>>>the support for larger memory and unlimited scratch files is very
>>>worthwhile in itself.
>>>      
>>>
>>I have seen in md43, moldy, and a few others, about 20-30% under gcc
>>recompilation with -m64 on Opteron.  For informatics codes it was
>>about the same.  
>>    
>>
>
> Well, bioinformatics codes presumably run mostly integer operations.
>This is very different from heavy floating point calculations in G03
>which are double precision in all architectures and run with
>approximately the same speed regardless of 32/64 bitness.
>  
>

md43 and moldy are molecular dynamics codes.  I had been thinking of 
running some tests with GAMESS, or some other codes just so I have a 
sense of how electronic structure codes do on the system.  For 
computationally intensive codes, going to 64 bits on the Opteron gets you

a) double the number of general purpose registers,
b) double the number of SSE registers.

This means that the optimizer, if it is dealing with a heavy register 
pressure code, can do a better job of scheduling the resources.  It also 
means that some codes may be able to leverage more instructions in 
flight per cycle because of resource availability.  The address space is 
also flat as compared to the segmented space of the 32 bit mode.  It is 
most definitely not a simple case of there being just a 32 vs 64 bit 
address space.   That advantage is there, but it is not the only one.

One of the interesting side effects of the NUMA systems has to do with 
memory bandwidth per CPU as you increase the number of CPUs in a node.   
For a correctly populated system (e.g. memory evenly distributed among 
the CPUs), each CPU has full bandwidth to its local memory, and an 
additional latency hop to remote memory on the same node.   If you stack 
all the memory on a single CPU (as I have seen many folks do, then run 
benchmarks, and report their results), you share memory bandwidth.  In 
this case, you get the sort of results we see occasionally reported 
here.  Similar results occur if you have a kernel (say an ancient one 
like 2.4.18) that doesn't know much about NUMA and related scheduling.

> 
>  
>
>>Your mileage will vary of course, but I expect with Gaussian and
>>others that
>>overflow memory, the overall system design will be as important (if
>>not more so)
>>to the overall performance than the CPU architecture, unless you can
>>somehow
>>isolate the computation to never spill to disk.
>>    
>>
>
> With G03, some types of jobs will mostly be compute bound, and others
>will be mostly I/O bound. This is reasonably trivial to predict
>beforehand. I've tested jobs which were compute bound because testing
>the other side of the equation is more difficult due to more factors.
>
> For I/O bound jobs a box with loads of RAM and fast sequential I/O is
>the best. Something like dual-quad Opteron with 16-32Gb of RAM and 2-4
>ATA disks with RAID0 (striping) is a good choice these days.
>  
>

:)  I might suggest the 3ware folks for their controllers.   Just pick 
your file systems and stripe width carefully.

Joe