[Beowulf] programming multicore clusters

Sat Jun 16 05:36:17 PDT 2007

Greg Lindahl wrote:
> On Fri, Jun 15, 2007 at 01:49:49PM +0200, Toon Knapen wrote:
> 
>> AFAICT this is not always the case. E.g. on systems with glibc, this 
>> functionality (set_process_affinity and such) is only available starting 
>> from libc-2.3.4.
> 
> Nearly every statement about Linux is untrue at some point in the
> past.

Indeed, this is true for every system that is still in development.
But as I responded to Mark Hahn, there are still many linux 
distributions deployed that have libc-2.3.3 or older. I guess your 
products (I had a quick look but could not find the info directly) are 
also still supporting linux distributions with libc-2.3.3 or older.

> 
>> E.g. you can obtain a big boost when running an 
>> MPI-code where each process performs local dgemm's for instance by using 
>> an OpenMP'd dgemm implementation. This is an example where running 
>> mixed-mode makes a lot of sense.
> 
> First off, I see people using *threaded* DGEMM, not OpenMP. 

I did not differentiate between these two in my previous mail because to 
me it's an implementation issue. Both come down to using multiple threads.

> Second,
> I've never seen anyone show an actual benefit -- can you name an
> example? i.e. "for N=foo, I get a 13% speedup on..."

We have benchmarked our code with using multiple BLAS implementations 
and so far GotoBLAS came out as a clear winner. Next we tested GotoBLAS 
using 1,2 and 4 threads and depending on the linear solver (of which one 
is http://graal.ens-lyon.fr/MUMPS/) we had a speedup of between 30% and 
70% when using 2 or 4 threads.
The scalability of GotoBLAS in respect to the number of threads is 
actually much better. But of course when integrated in a solver, the 
speedup is strongly dependent on the size of the matrices being passed 
to BLAS: the larger the better of course.

toon