[Beowulf] Re: Lustre failover

Greg Keller Greg at keller.net
Wed Sep 10 08:08:27 PDT 2008

Re: Lustre failover

I've worked on a number of large'sh lustre configs over the years, and  
all of them have been configured with Active/Active type mappings.   
There are a few issues being confused here:

1) Active/Active does not mean both OSS are accessing the same luns at  
the same time.  Each "pair" of OSS nodes can see the same group of OST  
Luns exist on shared storage, but each OSS normally accesses only it's  
1/2 of them until an OSS dies.

2) Active/Active does not mean *automatic* failover.  In all but 1  
case I have worked on the choice was made to have a rational Human  
Being like creature decide if the safest/fastest repair was to bring  
back the original OSS, or to failover the orphaned luns to their  
alternate OSS node.  When the phone rings at 3am the rational and  
human being like descriptors are diminished, but still smarter than  
most scripts at assessing the best response to a failure.

3) Automatic Failover is completely doable if you can STONITH the  
failed node (Shoot the other node in the head).  With a good network  
controlled power strip you can kill the failed node so it can't come  
back and continue writing to the OST Luns it used to own (which  
confetti's your 1s and 0s).  Linux HA with heartbeats on Serial and  
TCP/GbE is the most common approach to automation. Once the failed/ 
suspect node is guaranteed not to make a surprise comback, the OST's  
it left behind will need to be started by the surviving OSS.

4) IPMI Power control has just enough lag/inconsisteny that the  
"Shooting draw" between 2 functional OSS servers can result in BOTH  
servers (or neither) powering down.... don't depend on it unless your  
IPMI implementation is ultra responsive and reliable.  Make sure your  
script verifies it's "tripple dog sure" the other node can't come back  
before taking over the abandoned  OST's.

**Shameful Plug** DataDirect has a whitepaper that demonstrates many  
Lustre Failover Best Practices complete with pictures etc that is spot  
on from my experience.  Here's the link:


On Sep 10, 2008, at 8:05 AM, beowulf-request at beowulf.org wrote:

>> With OST servers it is possible to have a load-balanced active/active
>> configuration.
>> Each node is the primary node for a group of OSTs, and the failover
>> node for other
> ...
>> Anyone done this on a production system?
> we have a number of HP's Lustre (SFS) clusters, which use
> dual-homed disk arrays, but in active/passive configuration.
> it works reasonably well.
>> Experiances? Comments?
> active/active seems strange to me - it implies that the bottleneck
> is the OSS (OST server), rather than the disk itself.  and a/a means
> each OSS has to do more locking for the shared disk, which would seem
> to make the problem worse...

More information about the Beowulf mailing list