[Beowulf] Stroustrup regarding multicore

Tim Cutts tjrc at sanger.ac.uk
Thu Aug 28 08:43:11 PDT 2008


On 27 Aug 2008, at 11:47 am, Vincent Diepeveen wrote:

>> There is an unwritten recruitment rule, certainly in my field of  
>> science, that the programmer "must understand the science", and  
>> actually being able to write good code is very much a secondary  
>> requirement.
>
> I couldn't disagree more.

Actually, I think we're furiously agreeing with each other.

> Maybe your judgement is not objective.

I did say "in my field of science", which is bioinformatics.

>
> For any serious software, let's be objective. Only a few will learn  
> how to program real well and manage to find their way in complex  
> codes.
> That's for just very few. That usually and not seldom takes a year  
> or 10 to learn. If someone has a PHD or Master or whatever in some  
> science,
> he's usually capable of explaining and understanding things.

>
> Becoming a very good low level programmer is a lot harder than  
> learning a few more algorithms that can solve a specific problem.
> Especially understanding how to program efficiently parallel is not  
> so easy. I spoke with a guy who figured past week some stuff
> that was used in the 90s at supercomputers, and he concluded it was  
> very inefficient.

Wasn't that exactly my point?  That what we need to be hiring are very  
smart scientists who can design the analysis (check - we've got  
those), and then a smaller number of very smart programmers who can  
actually implement them efficiently.  And that's precisely what the  
bioinformatics world hasn't been doing much of up until now; it's just  
hired the very smart scientists, and not bothered with the programmers.

>
> A single good low level programmer can speedup things not seldom  
> factor 50.

Indeed.  There have been several cases where I've improved scientists'  
code by a factor of 100 or more.  My record so far is seven orders of  
magnitude, for a widely used piece of code from a certain high-profile  
bioinformatics centre in the US.

> According to his opinion: "getting a speedup less than a factor 1000  
> in scientific number crunching software you can do with your eyes  
> closed".

A slight exaggeration, I think, but if you said a factor of 100  
instead, I'd agree completely.

> Knowing everything about efficient caching and hashing and how to  
> divide that over the nodes without getting the full latency, nor  
> losing
> factor 50+ to just MPI messaging, that's just simply a fulltime  
> expertise in itself, and there is far fewer you can find who can do  
> that,
> than the huge amount of people who can explain you the field's stuff.

Agreed.

> Note bio-informatics is a bad idea to mention, it's eating a grand  
> total of < 0.5% system time at supercomputers
> and that's already system time that hardly gets used in an efficient  
> manner. There is just not much to calculate there,
> when compared to math, physics and everything that has to do with  
> the weather from climate in X years from now to earthquake prediction.

I think here you're showing a little old-time HPC bias.  There *is* a  
lot to calculate in bioinformatics, but what's different about it,  
compared to the other fields that you mention, is the ratio of  
computation to data.  Most bioinformatics code is very data intensive,  
and that's why it runs so badly on traditional HPC rigs (and as a  
result you see so little time devoted to it on supercomputers). It's  
actually a result of exactly what I initially said in this thread; the  
computations are out there to do, but bioinformatics sites have not  
generally been hiring people capable of writing code which runs well  
on traditional HPC systems.

> Physics in itself is eating 50%+ of all supercomputer time.

Most large bioinformatics sites have their own clusters, and don't use  
large national facilities.  That's part of the reason you don't see  
them there.  Our sites tend not to appear in the Top 500 either, not  
because they're not actually powerful enough to be in there, but  
because they're built differently and don't usually run the required  
benchmarks that well - usually because we don't have low latency  
interconnects.

The large quantities of data with, as you say, relatively modest  
amounts of CPU, make it impractical to use national centres, because  
we just get swamped in the overhead of shipping data into and out of  
the facility.  We have the problem badly enough within our own LAN,  
let alone shipping stuff out to remote sites.  :-)  For example, a  
single sequencing run on a current technology sequencer is about 2TB  
of data, which then requires of the order of 24 CPU hours to process.   
The sequencer then produces another chunk of data the same size two  
days later.  It just wouldn't be practical to shunt that offsite to do.

Tim


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 



More information about the Beowulf mailing list