[Beowulf] Re: Lustre failover
Greg at keller.net
Wed Sep 10 08:08:27 PDT 2008
Re: Lustre failover
I've worked on a number of large'sh lustre configs over the years, and
all of them have been configured with Active/Active type mappings.
There are a few issues being confused here:
1) Active/Active does not mean both OSS are accessing the same luns at
the same time. Each "pair" of OSS nodes can see the same group of OST
Luns exist on shared storage, but each OSS normally accesses only it's
1/2 of them until an OSS dies.
2) Active/Active does not mean *automatic* failover. In all but 1
case I have worked on the choice was made to have a rational Human
Being like creature decide if the safest/fastest repair was to bring
back the original OSS, or to failover the orphaned luns to their
alternate OSS node. When the phone rings at 3am the rational and
human being like descriptors are diminished, but still smarter than
most scripts at assessing the best response to a failure.
3) Automatic Failover is completely doable if you can STONITH the
failed node (Shoot the other node in the head). With a good network
controlled power strip you can kill the failed node so it can't come
back and continue writing to the OST Luns it used to own (which
confetti's your 1s and 0s). Linux HA with heartbeats on Serial and
TCP/GbE is the most common approach to automation. Once the failed/
suspect node is guaranteed not to make a surprise comback, the OST's
it left behind will need to be started by the surviving OSS.
4) IPMI Power control has just enough lag/inconsisteny that the
"Shooting draw" between 2 functional OSS servers can result in BOTH
servers (or neither) powering down.... don't depend on it unless your
IPMI implementation is ultra responsive and reliable. Make sure your
script verifies it's "tripple dog sure" the other node can't come back
before taking over the abandoned OST's.
**Shameful Plug** DataDirect has a whitepaper that demonstrates many
Lustre Failover Best Practices complete with pictures etc that is spot
on from my experience. Here's the link:
On Sep 10, 2008, at 8:05 AM, beowulf-request at beowulf.org wrote:
>> With OST servers it is possible to have a load-balanced active/active
>> Each node is the primary node for a group of OSTs, and the failover
>> node for other
>> Anyone done this on a production system?
> we have a number of HP's Lustre (SFS) clusters, which use
> dual-homed disk arrays, but in active/passive configuration.
> it works reasonably well.
>> Experiances? Comments?
> active/active seems strange to me - it implies that the bottleneck
> is the OSS (OST server), rather than the disk itself. and a/a means
> each OSS has to do more locking for the shared disk, which would seem
> to make the problem worse...
More information about the Beowulf