[Beowulf] Re: OT: informatics software for linux clusters

Mon May 15 13:09:18 PDT 2006

David Mathog wrote:
>>    Scalable Informatics has released Scalable HMMer, an optimized 
>> version of HMMer 2.3.2 that is 1.6-2.5x faster per node on benchmark 
>> tests run on Opteron systems.
> 
> Did you remove the memory organization changes SE put in to make
> it run better on the Altivec Macs?  Those really made life hard when I
> was trying to optimize this code to run

Hi Dave:

   We didn't start from the Altivec patch.  It is in a large "ifdef" in 
fast_algorithms.c.  I didn't see memory organization changes in the 
non-altivec code (though there was a line about some issue with the 
Intel compilers).

   We started from the base p7Viterbi in fast_algorithms, and rewrote 
the loops a bit.

> on our Beowulf with Athlon MP processors.  The problem was the
> P7Viterbi data structures didn't fit entirely into cache (no matter

   I was worried about cache thrashing (and still am) with our changes. 
  The code isn't complex, but the particulars of the original 
implementation weren't terribly cache friendly.

> how it was organized) and this resulted in toxic query lengths that ran
> several times slower.  That is, take a query sequence
> of length 1000, run hmmpfam, nip off the last character, run it again,
> etc.  It was anything but a smooth function of execution time vs. query

Ohhh.... I would love a test like that.  Is this something that you 
found in general with the baseline code or with the Altivec'ed code? 
This would be very good to include in our regression testing...

> length.  Working around the Altivec stuffed helped some but didn't
> entirely eliminate the effect.  Probably the bigger cache on the
> Opteron would eliminate this effect for smaller sequences but I'm
> guessing you could still run into it with a long query.

We ran an 8000 letter query length as our longest test.  If you have 
some specific test cases which exercise bugs, please let me know what 
they are and I will see if we can use them.

> 
> This has nothing to do with the Parallel implementation though, it
> was a data size vs. cache size effect.

That is an issue with this code.  The Athlon has a 256k L2 last I 
remember, and a 128k L1.  Rather hard to keep lots of stuff in cache.

Right now the big issue we are running into for another aspect of this 
project is the lack of a vector max/min function in SSE*.  (If anyone 
from AMD/Intel is listening, this is a *big* issue, and I even have a 
rough idea how to do it "quickly" in SSE at the expense of many SSE 
registers.

Joe

> 
> Regards,
> 
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615