[Beowulf] Re: IBRIX Experiences (Wally Edmondson)

Naveed Near-Ansari naveed at caltech.edu
Mon Jun 4 15:08:07 PDT 2007


On Sat, 2007-06-02 at 20:47 -0700, beowulf-request at beowulf.org wrote:

> Date: Fri, 01 Jun 2007 11:24:40 -0400
> From: Wally Edmondson <Wally-Edmondson at utc.edu>
> Subject: Re: [Beowulf] IBRIX Experiences
> To: beowulf at beowulf.org
> Message-ID: <46603A38.3060006 at utc.edu>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> On Thu, 10 May 2007, Ian Reynolds wrote:
> 
>  > Hey all -- we're considering IBRIX for a parallel storage cluster
>  > solution with an EMC Clarion CX3-20 at the center, as well as a handful
>  > of storage servers -- total of roughly 40 client servers, mix of 32 and
>  > 64 bit OSs.
>  >
>  > Can anyone offer their experiences with IBRIX, good or bad? We have
>  > worked with gpfs extensively, so any comparisons would also be helpful.
> 
> It looks like you aren't getting many answers your question, Ian.  I'll quickly share 
> my IBRIX experiences.  I have been running IBRIX since late 2004 on around 540 
> diskless clients and 50 regular servers and workstations with 8 segment servers and a 
> Fusion Manager connected to a DDN S2A 3000 couplet with 20TB of usable storage.  The 
> storage is 1Gb FibreChannel to the Segment Servers and it's non-bonded GigE for 
> everything else.
> 
> I'll start with the bad, I guess.  We had our share of problems with the 1.x version 
> of the software in the early days.  I suppose all parallel filesystems with 600 
> clients are going to hit bumps.  That's what CFS said back then, anyways.  Stability 
> wasn't a problem, but occasionally a file wouldn't be readable and to fix it you had 
> to copy the file, stuff like that.   This was no longer an issue beginning with 
> version 2.0.  You have to get a new build of the software if you want to change 
> kernels.  Their are two RPMS, one generic for the major kernel number and the other 
> specific to your kernel containing some modules.  They only support RHEL/CENTOS and 
> SLES as far as I know, and SLES was only recently added.  I asked about Ubuntu and 
> they don't yet support it, which sucks because I would like to use it on some 
> workstations.  Oh, and make sure that the segment servers can always see each other. 
>   Use at least two links through different switches.  We had some bad switch ports 
> that caused the segment servers to miss heartbeats.  This caused automatic failovers 
> to segment servers that also couldn't be seen.  This is a disaster.  I thought it was 
> IBRIX's fault the whole time.  Turned out to be intermittent switch port problems. 
> It was avoidable with a little bit more planning and a better understanding of how 
> the whole thing worked.  Redundancy is set up with buddies rather than globally, so 
> you tell it that one server should watch some other server's back.  It works, but it 
> could be a problem if a failing server's buddy is down or a server goes down while it 
> owns a failed segment.  In either case, some percentage of your files won't be 
> accessible until one of the servers is fixed.  It hasn't happened to me, but it is a 
> possibility.  I can bring down four of my eight servers without a problem, for 
> instance, but it needs to be the right four.  Servers have failed and it has never 
> been a problem for me.  The running jobs never know the difference.
> 
> Support has been top-notch.  Last year, we had a catastrophic storage controller 
> failure following a scheduled power outage, major corruption, the works.  A guy at 
> IBRIX stayed with me all weekend on the phone and AIM.  He logged in and remotely 
> restored all the files he could (tens of thousands).  Apparently he could have 
> restored more if I had already been running 2.0 or higher.  They know their product 
> very well.  I'm not sure if I am the right person to compare it to GPFS or Lustre 
> since I looked into those products back in 2004 and haven't really researched them 
> since.  My setup is simple, too, so I only use the basics.  The performance is fine, 
> using nearly all of my GigE pipes.  With more segment servers and faster storage you 
> could get some pretty amazing speeds.  I don't use the quotas or multiple interfaces. 
>   Their GUI looks nice at first but you really don't need it because their 
> command-line tools make sense and have excellent help output if you forget something. 
>   Adding new clients is a breeze.  There is a Windows client now but I haven't used 
> it.  I use CIFS exports and it works just fine.  I also use NFS exports for my few 
> remaining Solaris clients.  Everything is very customizable and the documentation 
> seems pretty thorough.  You can put any storage you like behind it, which is nice.  I 
> think I could use USB keys if I felt like it.  I have been very please with IBRIX 
> overall, especially since we upgraded out of 1.x land.  It's usually the last thing 
> on my mind, so I guess that's a good thing.  That's all I have time for right now. 
> Let me know if you have any specific questions.
> 
> Wally
> 

I would agree with some of this.  The support is indeed top notch, but
our switch to 2.x wasn't as smooth. we have had some problems with files
not writing and some performance issues. this is being used on 520
nodes.  For us, alot of our (recent) problems have been related to
ibrix.

Ibrix has been very good about helping fix things. I have had the same
experience with ibrix being there when i needed them. when i have a
problem, they work on it until fixed regardless of whether it is
nighttime or weekends.

At this point, i think we are stable and you probably would not have the
same issues on a new system.


-- 

Naveed Near-Ansari
California Institute of Technology
Division of Geology and Planetary Sciense





More information about the Beowulf mailing list