[Beowulf] SSDs for HPC?

Tue Apr 8 08:25:38 PDT 2014

On 4/8/14, 11:05 AM, Michael Di Domenico wrote:
> On Tue, Apr 8, 2014 at 10:57 AM, Joe Landman
> <landman at scalableinformatics.com> wrote:
>>  From a general purpose point of view, Intel and Samsung make great lower end
>> devices.  SanDisk makes great higher end devices.  We are working on getting
>> some Toshiba's and a few others for enterprise to ultra-high-end testing.
>>
>> With some of the SSDs, we found that a hot plug event was permanently
>> terminal to the device.  Neat, huh?  Other SSDs we played with had 40+%
>> failure rates.
> is that 40% infant mortality or after some period of time?

That was 40% across a very large swath of parts, within a 2 week window 
of each other, for lightly used boot drive SSDs.  We ripped them out, 
globally, and replaced them.  Including non-failed parts.

> i've held off on ssd's in our environment mostly because of the
> general feeling that ssd's still have a much shorter life expectancy
> then hdd's.  some anecdotal evidence would be helpful.
The cheap drives are crap.   The good drives will cost you.   The good 
drives will be as reliable as spinning rust, if not more so. The meh 
drives have 2-5 random drive writes per day (DWPD) over a 5 year 
window.  The crappy drives have sub 1 (usually sub 0.1). The good drives 
have 10+ DWPD.

Huge hint:  if they don't give explicit figures on durability, there is 
a very good reason for that.

Huge hint 2:  You can take the analysis Prentiss suggested to calculate 
the number of single block erasures that the drive can tolerate during 
its lifetime.  Crap drives are way sub 3k.  Meh drives are 3k-7k 
(nothing important on them, avoid them in write amplified ... RAID5/6 
... scenarios).  Good drives are 10+k erasures.

For 1PB of total writes during lifetime, a 100GB drive would be written 
10k times.  If this is over 5 years (call it 1825 days), then you get 
roughly 10k/1825 -> 5.5 DWPD.  Upper end of meh into "lower good" 
range.  This is 10k erasure/rewrite cycles.

Note that this analysis is *highly* oversimplified, and a good academic 
would take strong issue with it.  But it also appears to match reality 
quite well from what we observe.

Our high end SSDs in our siflash box have a lower average yearly failure 
rate than our high end spinning rust drives.

Good SSDs will cost you more than the crap ones.  But you will not 
regret buying the good ones.  You will regret buying the crap ones.

Just remember this if you are specing out a new storage 
box/cluster/computing system, that you need to make engineering and cost 
tradeoffs.  And in the ultracompetitive academic cluster market, it just 
may be that the margins are so incredibly thin to begin with, that 
anything that helps increase the margin is a good thing for the company 
offering the system.  I know people here may not be sympathetic to this 
viewpoint, and thats OK.  Until, that is, you are on the other side, 
trying to pay your team with the slivers of margins you make on these 
sales.  I'd recommend, instead of automatically picking the cheapest 
(acquisition) cost item, that you focus upon the best.  The latter will 
cost you more and you will have less headache.

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615