[Beowulf] Re: Lustre failover
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Greg Keller Greg at keller.netWed Sep 10 08:08:27 PDT 2008
- Previous message: [Beowulf] Lustre failover
- Next message: [Beowulf] GPU boards and cluster servers
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Re: Lustre failover I've worked on a number of large'sh lustre configs over the years, and all of them have been configured with Active/Active type mappings. There are a few issues being confused here: 1) Active/Active does not mean both OSS are accessing the same luns at the same time. Each "pair" of OSS nodes can see the same group of OST Luns exist on shared storage, but each OSS normally accesses only it's 1/2 of them until an OSS dies. 2) Active/Active does not mean *automatic* failover. In all but 1 case I have worked on the choice was made to have a rational Human Being like creature decide if the safest/fastest repair was to bring back the original OSS, or to failover the orphaned luns to their alternate OSS node. When the phone rings at 3am the rational and human being like descriptors are diminished, but still smarter than most scripts at assessing the best response to a failure. 3) Automatic Failover is completely doable if you can STONITH the failed node (Shoot the other node in the head). With a good network controlled power strip you can kill the failed node so it can't come back and continue writing to the OST Luns it used to own (which confetti's your 1s and 0s). Linux HA with heartbeats on Serial and TCP/GbE is the most common approach to automation. Once the failed/ suspect node is guaranteed not to make a surprise comback, the OST's it left behind will need to be started by the surviving OSS. 4) IPMI Power control has just enough lag/inconsisteny that the "Shooting draw" between 2 functional OSS servers can result in BOTH servers (or neither) powering down.... don't depend on it unless your IPMI implementation is ultra responsive and reliable. Make sure your script verifies it's "tripple dog sure" the other node can't come back before taking over the abandoned OST's. **Shameful Plug** DataDirect has a whitepaper that demonstrates many Lustre Failover Best Practices complete with pictures etc that is spot on from my experience. Here's the link: http://www.datadirectnet.com/resource-downloads/best-practices-for-architecting-a-lustre-based-storage-environment-download ***** Cheers! Greg On Sep 10, 2008, at 8:05 AM, beowulf-request at beowulf.org wrote: >> >> With OST servers it is possible to have a load-balanced active/active >> configuration. >> Each node is the primary node for a group of OSTs, and the failover >> node for other > ... >> Anyone done this on a production system? > > we have a number of HP's Lustre (SFS) clusters, which use > dual-homed disk arrays, but in active/passive configuration. > it works reasonably well. > >> Experiances? Comments? > > active/active seems strange to me - it implies that the bottleneck > is the OSS (OST server), rather than the disk itself. and a/a means > each OSS has to do more locking for the shared disk, which would seem > to make the problem worse... >
- Previous message: [Beowulf] Lustre failover
- Next message: [Beowulf] GPU boards and cluster servers
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
