[Beowulf] Third-party drives not permitted on new Dell servers?
michf at post.tau.ac.il
Tue Feb 16 04:53:28 PST 2010
On Mon, 15 Feb 2010 20:41:08 -0500
Joe Landman <landman at scalableinformatics.com> wrote:
> Rahul Nabar wrote:
> > This was the response from Dell, I especially like the analogy:
> > [snip]
> >> There are a number of benefits for using Dell qualified drives in
> >> particular ensuring a ***positive experience*** and protecting
> >> ***our data***. While SAS and SATA are industry standards there are
> >> differences which occur in implementation. An analogy is that
> >> English is spoken in the UK, US >and Australia. While the language
> >> is generally the same, there are subtle differences in word usage
> >> which can lead to confusion. This exists in >storage subsystems as
> >> well. As these subsystems become more capable, faster and more
> >> complex, these differences in implementation can have >greater
> >> impact.
> > [snip]
> > I added the emphasis. I am in love Dell-disks that get me "the
> > positive experience". :)
> Please indulge my taking a contrarian view based upon the products we
> I see significant derision heaped upon these decisions, which are called
> "marketing decisions" by Dell and others. It couldn't be possible, in
> most commenter's minds that they might actually have a point ...
> ... I am not defending Dell's language (I wouldn't use this or allow
> this to be used in our outgoing marketing/customer communications).
> Let me share an anecdote. I have elided the disk manufacturers name to
> protect the guilty. I will not give hints as to whom they are, though
> some may be able to guess ... I will not confirm.
> We ship units with 2TB (and 1.5TB) drives among others. We burn in and
> test these drives. We work very hard to insure compatibility, and to
> make sure that when users get the units, that the things work. We
> aren't perfect, and we do occasionally mess up. When we do, we own up
> to it and fix it right away. Its a different style of support. The
> buck stops with us. Period.
> So along comes a drive manufacturer, with some nice looking specs on 2TB
> (and some 1.5 and 1 TB) drives. They look great on paper. We get them
> into our labs, and play with them, and they seem to run really well.
> Occasional hiccup on building RAIDs, but you get that in large batches
> of drives.
> So now they are out in the field for months, under various loads. Some
> in our DeltaV's, some in our JackRabbits. The units in the DeltaV's
> seem to have a ridiculously high failure rate. This is not something we
> see in the lab. Even with constant stress, horrific sustained workloads
> ... they don't fail in ou testing. But get these same drives out into
> the users hands ... and whammo.
> Slightly different drives in our JackRabbit units, with a variety of
> RAID controllers. Same types of issues. Timeouts, RAID fall outs, etc.
> This is not something we see in the lab in our testing. We try
> emulating their environments, and we can't generate the failures.
> Worse, we get the drives back after exchanging them at our cost with new
> replacements, only to find out, upon running diagnostics, that the
> drives haven't failed according to the test tool. This failing drive
> vendor refuses to acknowledge firmware bugs, effectively refuses to
> release patches/fixes.
> Our other main drive vendor, while not currently with a 2TB drive unit,
> doesn't have anything like this manufacturers failure rate in the field.
> When drives die in the field, they really ... really die in the field.
> And they do fix their firmware.
> So we are now moving off this failing manufacturer (its a shame as they
> used to produce quality parts for RAID several years ago), and we are
> evaluating replacements for them. Firmware updates are a critical
> aspect of a replacement. If the vendor won't allow for a firmware
> update, we won't use them.
> So ... this anecdote complete, if someone called me up and said "Joe, I
> really want you to build us an siCluster for our storage, and I want you
> to use [insert failing manufacturer's name here] drives because we
> like them", what do you think my reaction should be? Should it be
> "sure, no problem, whatever you want" ... with the subsequent problems
> and pain, for which we would be blamed ... or should it be "no, these
> drives don't work well ... deep and painful experience at customer sites
> shows that they have bugs in their firmware which are problematic for
> RAID users ... we are attempting to get them to give us the updated
> firmware to help the existing users, but we would not consider shipping
> more units with these drives due to their issues."
> Is that latter answer, which is the correct answer, a marketing answer?
But what if the customer tells you, ship me your system without a drive, I'll
put whatever I want in there so you are not my point of contact for failing
drives but you say, no, I won't allow them in my system and I won't even sell
you a replacement of what I do allow in the system?
> Yeah, SATA and SAS are standards. Yeah, in theory, they all do work
> together. In reality, they really don't, and you have to test.
> Everyone does some aspect slightly different and usually in software, so
> they can fix it if they messed up. If their is a RAID timeout bug due
> to head settling timing, yeah, this is fixable. But if the disk
> manufacturer doesn't want to fix it ... its your companies name on the
> outside of that box. You are going to take the heat for their problems.
> Note: This isn't just SATA/SAS drives, there are a whole mess of things
> that *should* work well together, but do not. We had some exciting
> times in the recent past with SAS backplanes that refused to work with
> SAS RAID cards. We've had some excitment from 10GbE cards, IB cards,
> etc. that we shouldn't have had.
> I can't and won't sanction their tone to you ... they should have
> explained things correctly. Given that PERC are rebadged LSI, yeah, I
> know perfectly well a whole mess of drives that *do not* work correctly
> with them.
> So please don't take Dell to task for trying to help you avoid making
> what they consider a bad decision on specific components. There could
> be a marketing aspect to it, but support is a cost, and they want to
> minimize costs. Look at failure rates, and toss the suppliers who have
> very high ones.
More information about the Beowulf