[Beowulf] Storage

Mark Hahn hahn at physics.mcmaster.ca
Thu Oct 7 15:08:27 PDT 2004


> that means:-) in addition to commensurate amounts of tape backup.  The

ick!  our big-storage plans very, very much hope to eliminate tape.

> tape backup is relatively straightforward -- there is a 100 TB library
> available to the project already that will hold 200 TB after an
> LTO1->LTO2 upgrade, and while tapes aren't exactly cheap, they are
> vastly cheaper than disk in these quantities.

hmm, LTO2 is $0.25/GB; disks are about double that.  considering the 
issues of tape reliability, access time and migration, I think 
disk is worth it.  from what I hear in the storage industry, this 
is a growing consensus among, for instance, hospitals - they don't 
want to spend their time reading tapes to see whether the media is 
failing and content needs to be migrated.  migrating content that's 
online is ah, easier.  in the $ world, online data is attractive in part
so its lifetime can be more explicitly managed (ie, deleted!)


> The disk is a real problem.  Raw disk these days is less than $1/GB for
> SATA in 200-300 GB sizes, a bit more for 400 GB sizes, so a TB of disk
> per se costs in the ballpark of $1000.  However, HOUSING the disk in
> reliable (dual power, hot swap) enclosures is not cheap, adding RAID is
> not cheap, and building a scalable arrangement of servers to provide
> access with some controllable degree of latency and bandwidth for access
> is also not cheap.

no insult intended, but have you looked closely, recently?  I did some 
quick web-pricing this weekend, and concluded:

vendor          capacity        size    $Cad list per TB        density
dell/emc	12x250          3U      $7500                   1.0 TB/U
apple		14x250          3U      $4000                   1.166
hp/msa1500cs	12x250x4        10U     $3850                   1.2

(divide $Cad by 1.25 or so to get $US.)  all three plug into FC.
the HP goes up to 8 shelves per controller or 24 TB per FC port, though.

> Management requirements include 3 year onsite
> service for the primary server array -- same day for critical
> components, next day at the latest for e.g. disks or power supplies that
> we can shelve and deal with ourselves in the short run.  The solution we

pretty standard policies.

> adopt will also need to be scalable as far as administration is
> concerned -- we are not interested in "DIY" solutions where we just buy
> an enclosure and hang it on an over the counter server and run MD raid,
> not because this isn't reliable and workable for a departmental or even
> a cluster RAID in the 1-8 TB range (a couple of servers) it isn't at all
> clear how it will scale to the 10-80 TB range, when 10's of servers
> would be required.

Robert, are you claiming that 10's of servers are unmanagable
on a *cluster* mailing list!?!  or are you thinking of the number
of moving parts?

> Management of the actual spaces thus provided is not trivial -- there
> are certain TB-scale limits in linux to cope with (likely to soon be
> resolved if they aren't already in the latest kernels, but there in many
> of the working versions of linux still in use) and with an array of

I can understand and even emphathize with some people's desire to 
stick to old and well-understood kernels.  but big storage is a very 
good reason to kick them out of this complacency - the old kernel are 
justifiable only on not-broke-don't-fix grounds...

> partitions and servers to deal with, just being able to index, store and
> retrieve files generated by the compute component of the grid will be a
> major issue.

how so?  I find that people still use sensible hierarchical organization,
even if the files are larger and more numerous than in the past.

>   a) What are listvolken who have 10+ TB requirements doing to satisfy
> them?

we're acquiring somewhere between .2 and 2 PB, and are planning machinrooms
around the obvious kinds of building blocks: lots of servers that are in 
the say 4-20 TB range, preferably connected by some fast fabric (IB seems
attractive, since it's got mediocre latency but good bandwidth.)

>   b) What did their solution(s) cost, both to set up as a base system
> (in the case of e.g. a network appliance) and

I'm fairly certain that if I were making all the decisions here, I'd 
go for fairly smallish modular servers plugged into IB.

>   c) incremental costs (e.g. filled racks)?

?

>   d) How does their solution scale, both costwise (partly answered in b
> and c) and in terms of management and performance?

my only real concern with management is MTBF: if we had a hypothetical 
collection 2PB of 250G SATA disks with 1Mhour MTBF, we'd go 5 days between
disk replacements.  to me, this motivates toward designs that have fairly
large numbers of disks that can share a hot spare (or maybe raid6?)

>   e) What software tools are required to make their solution work, and
> are they open source or proprietary?

I'd be interested in knowing what the problem is that you're asking to be
solved.  just that you don't want to run "find / -name whatever" on 
a filesystem of 20 TB?  or that you don't want 10 separate 2TB filesystems?

>   f) Along the same lines, to what extent is the hardware base of their
> solution commodity (defined here as having a choice of multiple vendors
> for a component at a point of standardized attachment such as a fiber
> channel port or SCSI port) or proprietary (defined as if you buy this
> solution THIS part will always need to be purchased from the original
> vendor at a price "above market" as the solution is scaled up).

as far as I can see, the big vendors are somehow oblivious of the fact
that customers *HATE* the proprietary, single-source attitude.  
	oh, you can plug any FC devices you want into your san, 
	as long as they're all our products and we've "qualified" them.

> Rules:  Vendors reply directly to me only, not the list.  I'm in the
> market for this, most of the list is not.  Note also that I've already

I think you'd be surprised at how many, many people are buying 
multi-TB systems for isolated labs.  there are good reasons that 
this kind of scattershot approach is not wise in, say, a university
setting, where a shared resource pool can respond better to burstiness,
consistent maintenance, stable environment, etc.

regards, mark hahn.




More information about the Beowulf mailing list