[Beowulf] Suggestions to what DFS to use

John Hanks griznog at gmail.com
Tue Feb 14 05:40:18 PST 2017


All our nodes, even most of our fileservers (non-DDN), boot statelessly
(warewulf) and all local disks are managed by ZFS, either with JBOD
controllers or with non-JBOD controllers configuring each disk as a 1 drive
RAID0. So if at all possible, ZFS gets control of the raw disk.

ZFS has been extremely reliable. The only problems we have encountered was
an underflow that broke quota's on one of our servers and a recent problem
using a zvol as swap on CentOS 7.x. The ZFS on linux community is pretty
solid at this point and it's nice to know that anything written to disk is
correct.

Compute nodes use striping with no disk redundancy, storage nodes are
almost all raidz3 (3 parity disks per vdev). Because we tend to use large
drives, raidz3 gives us a cushion should a rebuild from a failed drive take
a long time on a full filesystem. There are some mirrors in a few places,
we even have the occasional workstation where we've set up a 3 disk mirror
to provide extra protection for some critical data and work.

jbh


On Tue, Feb 14, 2017 at 1:45 PM Jörg Saßmannshausen <
j.sassmannshausen at ucl.ac.uk> wrote:

> Hi John,
>
> thanks for the very interesting and informative post.
> I am looking into large storage space right now as well so this came really
> timely for me! :-)
>
> One question: I have noticed you were using ZFS on Linux (CentOS 6.8). What
> are you experiences with this? Does it work reliable? How did you
> configure the
> file space?
> From what I have read is the best way of setting up ZFS is to give ZFS
> direct
> access to the discs and then install the ZFS 'raid5' or 'raid6' on top of
> that. Is that what you do as well?
>
> You can contact me offline if you like.
>
> All the best from London
>
> Jörg
>
> On Tuesday 14 Feb 2017 10:31:00 John Hanks wrote:
> > I can't compare it to Lustre currently, but in the theme of general, we
> > have 4 major chunks of storage:
> >
> > 1. (~500 TB) DDN SFA12K running gridscaler (GPFS) but without GPFS
> clients
> > on nodes, this is presented to the cluster through cNFS.
> >
> > 2. (~250 TB) SuperMicro 72 bay server. Running CentOS 6.8, ZFS presented
> > via NFS
> >
> > 3. (~ 460 TB) SuperMicro 90 dbay JBOD fronted by a SuperMIcro 2u server
> > with 2 x LSI 3008 SAS/SATA cards. Running CentOS 7.2, ZFS and BeeGFS
> > 2015.xx. BeeGFS clients on all nodes.
> >
> > 4. (~ 12 TB) SuperMicro 48 bay NVMe server, running CentOS 7.2, ZFS
> > presented via NFS
> >
> > Depending on your benchmark, 1, 2 or 3 may be faster. GPFS falls over
> > wheezing under load. ZFS/NFS single server falls over wheezing under
> > slightly less load. BeeGFS tends to fall over a bit more gracefully under
> > load.  Number 4, NVMe doesn't care what you do, your load doesn't impress
> > it at all, bring more.
> >
> > We move workloads around to whichever storage has free space and works
> best
> > and put anything metadata or random I/O-ish that will fit onto the NVMe
> > based storage.
> >
> > Now, in the theme of specific, why are we using BeeGFS and why are we
> > currently planning to buy about 4 PB of supermicro to put behind it? When
> > we asked about improving the performance of the DDN, one recommendation
> was
> > to buy GPFS client licenses for all our nodes. The quoted price was about
> > 100k more than we wound up spending on the 460 additional TB of
> Supermicro
> > storage and BeeGFS, which performs as well or better. I fail to see the
> > inherent value of DDN/GPFS that makes it worth that much of a premium in
> > our environment. My personal opinion is that I'll take hardware over
> > licenses any day of the week. My general grumpiness towards vendors isn't
> > improved by the DDN looking suspiciously like a SuperMicro system when I
> > pull the shiny cover off. Of course, YMMV certainly applies here. But
> > there's also that incident where we had to do an offline fsck to clean up
> > some corrupted GPFS foo and the mmfsck tool had an assertion error, not a
> > warm fuzzy moment...
> >
> > Last example, we recently stood up a small test cluster built out of
> > workstations and threw some old 2TB drives in every available slot, then
> > used BeeGFS to glue them all together. Suddenly there is a 36 TB
> filesystem
> > where before there was just old hardware. And as a bonus, it'll do
> > sustained 2 GB/s for streaming large writes. It's worth a look.
> >
> > jbh
> >
> > On Tue, Feb 14, 2017 at 10:02 AM, Jon Tegner <tegner at renget.se> wrote:
> > > BeeGFS sounds interesting. Is it possible to say something general
> about
> > > how it compares to Lustre regarding performance?
> > >
> > > /jon
> > >
> > >
> > > On 02/13/2017 05:54 PM, John Hanks wrote:
> > >
> > > We've had pretty good luck with BeeGFS lately running on SuperMicro
> > > vanilla hardware with ZFS as the underlying filesystem. It works pretty
> > > well for the cheap end of the hardware spectrum and BeeGFS is free and
> > > pretty amazing. It has held up to abuse under a very mixed and heavy
> > > workload and we can stream large sequential data into it fast enough to
> > > saturate a QDR IB link, all without any in depth tuning. While we don't
> > > have redundancy (other than raidz3), BeeGFS can be set up with some
> > > redundancy between metadata servers and mirroring between storage.
> > > http://www.beegfs.com/content/
> > >
> > > jbh
> > >
> > > On Mon, Feb 13, 2017 at 7:40 PM Alex Chekholko <
> alex.chekholko at gmail.com>
> > >
> > > wrote:
> > >> If you have a preference for Free Software, GlusterFS would work,
> unless
> > >> you have many millions of small files. It would also depend on your
> > >> available hardware, as there is not a 1-to-1 correspondence between a
> > >> typical GPFS setup and a typical GlusterFS setup. But at least it is
> free
> > >> and easy to try out. The mailing list is active, the software is now
> > >> mature
> > >> ( I last used GlusterFS a few years ago) and you can buy support from
> Red
> > >> Hat if you like.
> > >>
> > >> Take a look at the RH whitepapers about typical GlusterFS
> architecture.
> > >>
> > >> CephFS, on the other hand, is not yet mature enough, IMHO.
> > >> On Mon, Feb 13, 2017 at 8:31 AM Justin Y. Shi <shi at temple.edu> wrote:
> > >>
> > >> Maybe you would consider Scality (http://www.scality.com/) for your
> > >> growth concerns. If you need speed, DDN is faster in rapid data
> ingestion
> > >> and for extreme HPC data needs.
> > >>
> > >> Justin
> > >>
> > >> On Mon, Feb 13, 2017 at 4:32 AM, Tony Brian Albers <tba at kb.dk> wrote:
> > >>
> > >> On 2017-02-13 09:36, Benson Muite wrote:
> > >> > Hi,
> > >> >
> > >> > Do you have any performance requirements?
> > >> >
> > >> > Benson
> > >> >
> > >> > On 02/13/2017 09:55 AM, Tony Brian Albers wrote:
> > >> >> Hi guys,
> > >> >>
> > >> >> So, we're running a small(as in a small number of nodes(10), not
> > >> >> storage(170TB)) hadoop cluster here. Right now we're on IBM
> Spectrum
> > >> >> Scale(GPFS) which works fine and has POSIX support. On top of GPFS
> we
> > >> >> have a GPFS transparency connector so that HDFS uses GPFS.
> > >> >>
> > >> >> Now, if I'd like to replace GPFS with something else, what should I
> > >>
> > >> use?
> > >>
> > >> >> It needs to be a fault-tolerant DFS, with POSIX support(so that
> users
> > >> >> can move data to and from it with standard tools).
> > >> >>
> > >> >> I've looked at MooseFS which seems to be able to do the trick, but
> are
> > >> >> there any others that might do?
> > >> >>
> > >> >> TIA
> > >>
> > >> Well, we're not going to be doing a huge amount of I/O. So performance
> > >> requirements are not high. But ingest needs to be really fast, we're
> > >> talking tens of terabytes here.
> > >>
> > >> /tony
> > >>
> > >> --
> > >> Best regards,
> > >>
> > >> Tony Albers
> > >> Systems administrator, IT-development
> > >> Royal Danish Library, Victor Albecks Vej 1, 8000 Aarhus C, Denmark.
> > >> Tel: +45 2566 2383 <+45%2025%2066%2023%2083> / +45 8946 2316
> <+45%2089%2046%2023%2016>
> > >> _______________________________________________
> > >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > >> To change your subscription (digest mode or unsubscribe) visit
> > >> http://www.beowulf.org/mailman/listinfo/beowulf
> > >>
> > >>
> > >> _______________________________________________
> > >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > >> To change your subscription (digest mode or unsubscribe) visit
> > >> http://www.beowulf.org/mailman/listinfo/beowulf
> > >>
> > >> _______________________________________________
> > >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > >> To change your subscription (digest mode or unsubscribe) visit
> > >> http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> > > --
> > > ‘[A] talent for following the ways of yesterday, is not sufficient to
> > > improve the world of today.’
> > >
> > >  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > > To change your subscription (digest mode or unsubscribe) visit
> > > http://www.beowulf.org/mailman/listinfo/beowulf
> > >
> > >
> > >
> > > _______________________________________________
> > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> > > To change your subscription (digest mode or unsubscribe) visit
> > > http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> --
> *************************************************************
> Dr. Jörg Saßmannshausen, MRSC
> University College London
> Department of Chemistry
> 20 Gordon Street
> London
> WC1H 0AJ
>
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-- 
‘[A] talent for following the ways of yesterday, is not sufficient to
improve the world of today.’
 - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170214/44764723/attachment-0001.html>


More information about the Beowulf mailing list