[Beowulf] SATA II - PXE+NFS - diskless compute nodes

Fri Dec 8 13:14:31 PST 2006

Geoff Jacobs wrote:
> Mark Hahn wrote:
>   
>> it's interesting that SAS advertising has obscured the fact that SAS is
>> just a further development of SCSI, and not interchangable
>> with SATA.  for instance, no SATA controller will support any SAS disk,
>> and any SAS setup uses a form of encapsulation to communicate with
>> the foreign SATA protocol.  SAS disks follow the traditional price
>> formula of SCSI disks (at least 4x more than non-boutique disks),
>> and I suspect the rest of SAS infrastructure will be in line with that.
>>     
> Yes, SAS encapsulates SATA, but not vice-versa. The ability to use a
> hardware raid SAS controller with large numbers of inexpensive SATA
> drives is very attractive. I was also trying to be thorough.
>
>   
>>> and be mindful of reliability issues with desktop drives.
>>>       
>> I would claim that this is basically irrelevant for beowulf.
>> for small clusters (say, < 100 nodes), you'll be hitting a negligable
>> number of failures per year.  for larger clusters, you can't afford
>> any non-ephemeral install on the disks anyway - reboot-with-reimage
>> should only take a couple minutes more than a "normal" reboot.
>> and if you take the no-install (NFS root) approach (which I strongly
>> recommend) the status of a node-local disks can be just a minor node
>> property to be handled by the scheduler.
>>     
> PXE/NFS is absolutely the slickest way to go, but any service nodes
> should have some guarantee of reliability. In my experience, disks
> (along with power supplies) are two of the most common points of failure
Most of the clusters we configure for our customers use diskless compute 
nodes to minimize compute node failure for
precisely the reason you mentioned unless either the application can 
benefit from additional
local scratchspace (i.e. software raid0 over four sata drives allows to 
read/write large datastreams
at 280MB/s in a 1U server with 3TB of disk space on each compute node), 
or because they need to
sometimes run jobs that require more virtual memory than they can afford 
to put in physically -> local swapspace.

We find that customers don't typically want to pay for the premium for 
redundant power supplies+pdus+cabling
for the compute nodes through, that's something that is typically 
requested for head nodes and NFS servers.

Also we find that NFS-offloading on the NFS-server with the rapidfile 
card helps avoid scalability issues
where the NFS server bogs down under massively parallel requests from 
say 128 cores in a 32 compute node dual
cpu dual core cluster. The rapidfile card is a pci-x card with two fibre 
channel ports + two gige ports +
nfs/cifs offloading processor on the same card. Since most bulk data 
transfer is redirected from fibre channel to gige nfs
clients without passing through the NFS server cpu+ram itself, the nfs 
servers cpu load is not becoming the bottleneck,
we find it's rather the amount of spindles before saturating the two 
gige ports.

We configure clusters for our customers with Scyld Beowulf which does 
not nfs-mount
root but rather just nfs-mounts the home directories because of its 
particular lightweight
compute node model, (PXE booting into RAM) and so  does not run into the 
typical
nfs-root scalability issues.

Michael

Michael Will
SE Technical Lead / Penguin Computing / www.penguincomputing.com