[Beowulf] CSharifi Next generation of HPC

Tue Dec 4 14:00:52 PST 2007

[[[ Hmmmm, OK, I seem to have moderation-approved pretty much a repeat of 
a wide-spread posting.  So I'll answer with the response I was planning a 
few days ago. ]]

On Tue, 4 Dec 2007, Ehsan Mousavi wrote:

> C-Sharifi Cluster Engine: The Second Success Story on "Kernel-Level
> Paradigm" for Distributed Computing Support
> 
> Contrary to two school of thoughts in providing system software support for
> (like MPI, Kerrighed and Mosix), Dr. Mohsen Sharifi hypothesized another
> school of thought as his thesis in 1986 that believes all distributed
> systems software requirements and supports can be and must be built at the
> Kernel Level of existing operating systems;

In 1986 I had been working for a few years on shared memory systems with 
a hefty proportion of custom-designed hardware.

I learned from that experience.  That's why I now work on distributed 
memory systems based on off-the-shelf commodity hardware.

I also think that there are some important aspects of cluster 
infrastructure that (at present) can only be implemented by tweaking the 
kernel.  But most of the features to make a cluster easy to use don't need 
special kernel support, and indeed can't be implemented inside the kernel 
at all.

You might initially think "you can put any program inside the kernel, 
therefore you can do everything inside the kernel".  But as a 
counter-example consider name services.  Essentially all programs use the 
standard library interface to name services, which in turn uses the Name 
Service Switch.  You can add a bunch of really powerful feature by 
using a cluster-specific name service.  And this can only be done by 
working with the existing user-level library code.  (Well, unless you 
build a new library within your kernel.)

This argument almost misses the main point:
Cluster systems exist for to simplify the system for the end users.

When you think in terms of kernel modifications, most of the changes end 
up being tricks to prove to other developers how clever you are, not 
features that make the system easier to use (example: Plan 9). And most of 
the clever tricks end up getting in the way of the developer, rather than 
speeding up the application or really simplifying the programming model.

DSM / Distributed Shared Memory (which I prefer to call NVM, Network 
Virtual Memory) is a prefect example of this.  It certainly doesn't help 
the end user.  The only aspect an end user or system administrator sees is 
that NVM causes cascading system failures when one machine drops out of 
the cluster.

The programmer doesn't benefit either.  They initially 
think that NVM gives them an easy to use shared memory model.  They 
quickly find that it only appears to be normal memory.  To get even barely 
acceptable performance they have to treat the shared memory very 
differently than regular memory.  Variables written by different processes 
have to be segregated into different pages.  Writes have to grouped.  You 
have to think about when to manually cache structures to avoid a re-read 
that might trigger a network page fault, but refresh that structure when 
you need potentially updated values.  Many independent attempts have 
concluded that most application ports take a long time to tune for NVM, 
and almost all end up using NVM as a stylized message passing mechanism.

-- 
Donald Becker				becker at scyld.com
Penguin Computing / Scyld Software
www.penguincomputing.com		www.scyld.com
Annapolis MD and San Francisco CA