[Beowulf] programming multicore clusters

Mon Jun 18 17:00:50 PDT 2007

> Indeed, this is true for every system that is still in development.
> But as I responded to Mark Hahn, there are still many linux 
> distributions deployed that have libc-2.3.3 or older. I guess your 
> products (I had a quick look but could not find the info directly) are 
> also still supporting linux distributions with libc-2.3.3 or older.

My memory is that older versions of x86_64 libc have a different set
of affinity functions (different # of args). PathScale supported both.

> >First off, I see people using *threaded* DGEMM, not OpenMP. 
> 
> I did not differentiate between these two in my previous mail because to 
> me it's an implementation issue. Both come down to using multiple threads.

It's extremely inconvenient to express an efficient DGEMM in OpenMP,
just like it's pretty inconvent to express an efficient serial DGEMM.
So you won't find anyone using an OpenMP DGEMM. You can call
everything in the universe an implementation issue if you like.

> We have benchmarked our code with using multiple BLAS implementations 
> and so far GotoBLAS came out as a clear winner. Next we tested GotoBLAS 
> using 1,2 and 4 threads and depending on the linear solver (of which one 
> is http://graal.ens-lyon.fr/MUMPS/) we had a speedup of between 30% and 
> 70% when using 2 or 4 threads.

Sorry, did you compare against a pure MPI implementation? For example
the HPL code can run either way, so it's easy to compare. But if
you're comparing a serial code to a threaded code, it's no surprise
that the threaded code can be faster, especially solving a problem
which is not memory intensive. In fact I'd expect an even bigger win
than 1.7X, perhaps you aren't using Opterons ;-)

-- greg