[Beowulf] Infiniband: MPI and I/O?

Thu May 26 15:30:52 PDT 2011

On 5/26/2011 4:23 PM, Mark Hahn wrote:
>> Agreed.  Just finished telling another vendor, "It's not high speed
>> storage unless it has an IB/RDMA interface".   They love that.  Except
>
> what does RDMA have to do with anything?  why would straight 10G ethernet
> not qualify?  I suspect you're really saying that you want an efficient
> interface, as well as enough bandwidth, but that doesn't necessitate 
> RDMA.
>
RDMA over IB is definitely a nice feature.  Not required, but IP over IB 
has enough limits that we prefer to avoid it.
>> for some really edge cases, I can't imagine running IO over GbE for
>> anything more than trivial IO loads.
>
> well, it's a balance issue.  if someone was using lots of Atom boards
> lashed into a cluster, 1Gb apiece might be pretty reasonable.  but for 
> fat nodes (let's say 48 cores), even 1 QDR IB pipe doesn't seem all 
> that generous.
>
> as an interesting case in point, SeaMicro was in the news again with a 
> 512
> atom system: either 64 Gb links or 16 10G links.  the former (.128 
> Gb/core)
> seems low even for atoms, but .3 Gb/core might be reasonable.
>
agreed
>> I am Curious if anyone is doing IO over IB to SRP targets or some
>> similar "Block Device" approach.  The Integration into the filesystem by
>> Lustre/GPFS and others may be the best way to go, but we are not 100%
>> convinced yet.  Any stories to share?
>
> you mean you _like_ block storage?  how do you make a shared FS namespace
> out of it, manage locking, etc?
Well, it's a use case issue for us.  You don't make a shared FS on the 
block devices (well, maybe you could just not in a scalable way)... but 
we envision leasing block devices to customers with known 
capacity/performance capability.  Then the customer can make the call if 
they want to use it for a CIFS/NFS backend, possibly even lashed 
together via MD, through a single server.  They can also lease multiple 
block devices and create a lustre type system.

The flexibility is if they disappear and come back they may not get the 
same compute/storage nodes, but they can attach any server to their 
dedicated block storage devices.  There are also some multi-tenancy 
security options that can be more definitively handled if they have 
absolute control over a block device.  So in this case, they would 
semi-permanently lease the block devices, and then fire up front end 
storage nodes and compute nodes on an "as needed / as available" basis 
anywhere in our compute farm.  Effectively we get the benefits of a 
massive Fibre Channel type SAN over the IB infrastructure we have to 
every node.  If we can get the performance and cost of the block storage 
right, it will be compelling for some of our customers.

We are still prototyping how it would work and characterizing 
performance options...  but it's interesting to us.

Cheers!
Greg
>
> regards, mark hahn.