[Beowulf] Glenn Lockwood's Thoughts on the NSF Future Directions Interim Report

Mon Feb 2 08:37:13 PST 2015

On 02/02/2015 08:38 AM, Michael Di Domenico wrote:
> Glenn's article is good and hits on many topics correctly (of which
> i've seen, having sat on the vendor side of NSF proposals in a former
> life).  However I'm a little concerned by what i perceive of his
> attitude towards stripping funding from centers that don't have the
> technical prowess to run an HPC resources.
>
> NSF's goal is to further science.  stripping funding, i don't believe
> is the correct solution.  if a center isn't keeping up or doesn't have
> the skills from the start, there should be a mentor put in place from
> one of the other bigger centers.  stripping funding is only going to
> shrink the pool of knowledge to a few key installations around the US,
> which probably isn't the best way to spread knowledge.  but i do
> concur there is a point where the NSF would probably/already has
> spread itself too thin
>
> seems to me NFS needs to get back into building the HPC community of
> PEOPLE rather then building hero machines at six or seven
> installations across the us.
>

I interpreted it differently. I think he was saying that the NSF funding 
for HPC should be concentrated in fewer sites, similar to what the DOE 
has done with their leadership computing facilities (LCFs): Argonne 
Leadership Computing Facility  (ALCF) and Oak Ridge Leadership Computing 
Facility (ORLCF). By concentrating their resources in fewer locations, 
they can take advantage of economies of scale:

1. Pay for two large data centers instead of 5 or 10
2. Higher a somewhat larger, but much more talented staff whose talents 
can be spread out over several clusters and storage systems rather than 
many smaller support staffs with (most likely) less capabilities for 
each site.

And on, and on.

By committing heavily to less sites, it's easier for the NSF to focus on 
providing a stable financial footing, than having to constantly spread 
the money around many different sites like they're broadcasting seeding 
a lawn.

TL;DR: Put all your eggs into 2-3 baskets, and keep a really good eye on 
those baskets.

Regarding your comment about 'hero' systems: I read a paper a couple of 
years ago that the large majority of computational scientists don't need 
these massive exascale systems - most only need a 'department'-sized 
cluster with ~1024 cores. I believe SDSC did their own study with XSEDE 
data and came to the same conclusion (Glenn actually told me this. I'm 
not sure if this is published anywhere).

This reminds me of 'The Long Tail' 
(http://en.wikipedia.org/wiki/Long_tail): The hero systems cater to the 
small percentage of extremely talented computational scientists at the 
top of their fields, and the long tail, which is your 'average' 
computational science PI or grad student at universities around the 
world, still has to rely on an antiquated small departmental cluster. 
because the NSF focuses on  the hero users to the detriment of the long 
tail, which actually represents the bulk of their funded scientists.