[Beowulf] programming multicore clusters
toon.knapen at fft.be
Sat Jun 16 05:36:17 PDT 2007
Greg Lindahl wrote:
> On Fri, Jun 15, 2007 at 01:49:49PM +0200, Toon Knapen wrote:
>> AFAICT this is not always the case. E.g. on systems with glibc, this
>> functionality (set_process_affinity and such) is only available starting
>> from libc-2.3.4.
> Nearly every statement about Linux is untrue at some point in the
Indeed, this is true for every system that is still in development.
But as I responded to Mark Hahn, there are still many linux
distributions deployed that have libc-2.3.3 or older. I guess your
products (I had a quick look but could not find the info directly) are
also still supporting linux distributions with libc-2.3.3 or older.
>> E.g. you can obtain a big boost when running an
>> MPI-code where each process performs local dgemm's for instance by using
>> an OpenMP'd dgemm implementation. This is an example where running
>> mixed-mode makes a lot of sense.
> First off, I see people using *threaded* DGEMM, not OpenMP.
I did not differentiate between these two in my previous mail because to
me it's an implementation issue. Both come down to using multiple threads.
> I've never seen anyone show an actual benefit -- can you name an
> example? i.e. "for N=foo, I get a 13% speedup on..."
We have benchmarked our code with using multiple BLAS implementations
and so far GotoBLAS came out as a clear winner. Next we tested GotoBLAS
using 1,2 and 4 threads and depending on the linear solver (of which one
is http://graal.ens-lyon.fr/MUMPS/) we had a speedup of between 30% and
70% when using 2 or 4 threads.
The scalability of GotoBLAS in respect to the number of threads is
actually much better. But of course when integrated in a solver, the
speedup is strongly dependent on the size of the matrices being passed
to BLAS: the larger the better of course.
More information about the Beowulf