[Beowulf] Doing i/o at a small cluster

Fri Aug 17 12:45:49 PDT 2012

How about something like putting all your disks in one basket and
getting a ZFS / NFSoRDMA solution such as nexenta.

They have a nice open source distribution.

2012/8/17 Vincent Diepeveen <diep at xs4all.nl>:
> The idea someone brought me on by means of a private email is
> to use a distributed file system and split each drive in 2 partitions.
>
> the outside which is fastest for local storage and the inside for a
> global distributed partition for long term
> storage of endresults and automatically compressing with a scripts
> results and decompressing when it
> seems soon a specific EGTB is needed.
>
> Then using 3 disks a node i can get a 133MB-150MB /s on the outside
> of the drives in a raid-0.
> That'll be around a 3TB the minimum needed for generation.
>
> And the inside then gets a partition that uses redundancy, maybe
> raid-6 ?
> any thoughts there.
>
> So say a node or 4 i can dedicate to this. that's 12 drives.
>
> Then i'll take 6 months instead of 3 months to generate but i have 4
> other nodes free for other jobs.
>
> Also i need to pay less to harddrives then. Question now is whether
> i'll go for the 3TB then or the 2TB.
>
> As for the filesystem that's most interesting to do this. Is gluster
> a good idea for this?
>
> Can it handle this split between partitions in local and global? Does
> it have raid-6 or maybe some other
> sort of redundancy you'd advice?
>
> As for hadoop that's a java thing you know. If i want to get my
> cluster hacked from India i know an easier way to get that done :)
>
> Thanks in Advance,
> Vincent
>
>
> On Aug 17, 2012, at 4:42 PM, Ellis H. Wilson III wrote:
>
>> On 08/17/12 08:03, Vincent Diepeveen wrote:
>>> hi,
>>>
>>> Which free or very cheap distributed file system choices do i have
>>> for a 8 node cluster that has QDR infiniband (mellanox)?
>>> Each node could have a few harddrives. Up to 8 or so SATA2. Could
>>> also use some raid cards.
>>
>> Lots of choices, but are you talking about putting a bunch of disks in
>> all those PCs or having one I/O server?  The latter is the classic
>> solution but there are ways to do the former.
>>
>> Short answer is there are complicated ways to fling your hdds into
>> distributed machines using PVFS and get good performance provided you
>> are okay with those non-posix semantics and guarantees.  There are
>> also
>> ways to get decent performance from the Hadoop Distributed File
>> System,
>> which can handle a distributed set of nodes and internal HDDs well,
>> but
>> for a /constrained set of applications./  Based on your previous posts
>> about GPUs and whatnot, I'm going to assume you will have little to
>> zero
>> interest in Hadoop.  Last, there's a new NFS version out (pNFS, or NFS
>> v4.1) that you can probably use to great impact with proper
>> tuning.  No
>> comments on tuning it however, as I haven't yet tried myself.  That
>> may
>> be your best out of the box solution.
>>
>> Also, I assume you're talking about QDR 1X here, so just 8Gb/s per
>> node.
>> Correct me if that's wrong.
>>
>>> And i'm investigating what i need.
>>>
>>> I'm investigating to generate the 7 men EGTBs at the cluster. This is
>>> a big challenge.
>>
>> For anyone who doesn't know (probably many who aren't into chess, I
>> had
>> to look this up myself), EGTB is end game table bases, and more
>> info is
>> available at:
>> http://en.wikipedia.org/wiki/Endgame_tablebase
>>
>> Basically it's just a giant dump of exhaustive moves for N men left on
>> the board.
>>
>>> To generate it is high i/o load. I'm looking at around a 4 GB/s i/o
>>> from which a tad more than
>>> 1GB/s is write and a tad less than 3GB/s is readspeed from
>>> harddrives.
>>>
>>> This for 3+ months nonstop. Provided the CPU's can keep up with that.
>>> Otherwise a few months more.
>>>
>>> This 4GB/s i/o is aggregated speed.
>>
>> I would LOVE to hear what Joe has to say on this, but going out on a
>> limb here, it will be almost impossible to get that much out of your
>> HDDs with 8 nodes without serious planning and an extremely narrow
>> use-case.  I assume you are talking about putting drives in each
>> node at
>> this point, because with just QDR you cannot feed aggregate 4GB/s
>> without bonding from one node.
>>
>> We need to know more about generating this tablebase -- I can only
>> assume you are planning to do analyses on it after you generate all
>> possible combinations, right?  We need to know more about how that
>> follow-up analysis can be divided before commenting on possible
>> storage
>> solutions.  If everything is totally embarrassingly parallel you're
>> in a
>> good spot to not bother with a parallel filesystem.  In that case you
>> just might be able to squeeze 4GB/s out of your drives.
>>
>> But with all the nodes accessing all the disks at once, hitting 4GB/s
>> with just strung together FOSS software is really tough for
>> anything but
>> the most basic and most embarrassingly parallel stuff.  It requires
>> serious tuning over months or buying a product that has already done
>> this (e.g. a solution like Joe's company Scalable Informatics makes or
>> Panasas, the company I work for, makes).  People always love to say,
>> "Oh, that's 100MB/s per drive!  So with 64 drives I should be able to
>> get 6.4GB/s!  Yea!"  Sadly, that's really only the case when these
>> drives are accessed completely sequentially and completely separately
>> (i.e. not put together into a distributed filesystem).
>>
>>> What raid system you'd recommend here?
>>
>> Uh, you looking for software or hardware or object RAID?
>>
>>> A problem is the write speed + read speed i need. From what i
>>> understand at the edges of drives the speed is
>>> roughly 133MB/s SATA2 moving down to a 33MB/s at the innersides.
>>>
>>> Is that roughly correct?
>>
>> I hate this as much as anybody, but........ It Depends (TM).
>> You talking plain-jane "dd".  Sure, that might be reasonable for
>> certain
>> vendors.
>>
>>> Of course there will be many solutions. I could use some raid cards
>>> or i could equip each node with some drives.
>>> Raid card is probably sata-3 nowadays. Didn't check speeds there.
>>>
>>> Total storage is some dozen to a few dozens of terabytes.
>>>
>>> Does the filesystem automatically optimize for writing at the edges
>>> instead of starting at the innerside?
>>> which 'raid' level would you recommend for this if any is appropriate
>>> at all :)
>>
>> Again, depends on RAID card and whatnot.  Some do, some don't.
>>
>>> How many harddrives would i need? What failure rate can i expect with
>>> modern SATA drives there?
>>> I had several fail at a raid0+1 system before when generating some
>>> EGTBs some years ago.
>>
>> Yup, things will break especially during the shakeout (first few
>> days or
>> weeks).  I assume you're buying commodity drives here, not enterprise,
>> so you should prepare for upwards of, /after the shakeout/, maybe
>> 4-8 of
>> your drives to fail or start throwing SMART errors in the first year
>> (ball-parking it here based solely on experience).  Rebuilds will suck
>> for you with lots of data unless you have really thought that out
>> (typically limited to speed of a single disk -- therefore 2TB drive
>> rebuilding itself at 50MB/s (that's best case scenario) is like 11
>> hours.  I hope you haven't bought all your drives from the same batch
>> from the same manufacturer as well -- that often results in very
>> similar
>> failure times (i.e. concurrent failures in a day).  Very non-uniform.
>>
>>> Note there is more questions. Like which buffer size i must read/
>>> write. Most files get streamed.
>>>   From 2 files that i do reading from, i read big blocks from a
>>> random
>>> spot in that file. Each file is
>>> a couple of hundreds of gigabyte.
>>>
>>> I used to grab chunks of 64KB from each file, but don't see how to
>>> get to gigabytes a second i/o with
>>> todays hardware that manner.
>>>
>>> Am considering now to read blocks of 10MB. Which size will get me
>>> there to the maximum bandwidth the i/o
>>> can deliver?
>>
>> I actually do wonder if Hadoop won't work for you.  This sounds like a
>> very Hadoop-like workload, assuming you are OK with write-once read-
>> many
>> semantics.  But I need to know way more about what you want to do with
>> the data afterwards.  Moving data off of HDFS sucks.
>>
>> Best,
>>
>> ellis
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf