Large FOSS filesystems, was Re: [Beowulf] 512 nodes Myrinet cluster Challanges

Michael Will mwill at
Thu May 4 11:36:29 PDT 2006

The question is:

Can you use LVM to do mirroring and striping at the same time, or do you
need to use
software raid1 below LVM for that?

When having multiple chassis, the risk is that a chassis goes down and
that part of a striped
LVM can affect the whole filesystem, going offline and if the drives
became corrupted for some
reason, it can have corrupted all of your data. If you used mirroring
between two physical volumes
on two enclosures and then stripe across those mirrors, you could
mitigate that.



-----Original Message-----
From: beowulf-bounces at [mailto:beowulf-bounces at]
On Behalf Of Joe Landman
Sent: Thursday, May 04, 2006 11:17 AM
To: Dan Stromberg
Cc: beowulf at; Robert Latham
Subject: Re: Large FOSS filesystems,was Re: [Beowulf] 512 nodes Myrinet
cluster Challanges

Dan Stromberg wrote:

> On a somewhat related note, are there any FOSS filesystems that can 
> surpass 16 terabytes in a single filesystem - reliably?


Maximum File Size

For Linux 2.4, the maximum accessible file offset is 16TB on 4K page
size and 64TB on 16K page size. For Linux 2.6, when using 64 bit
addressing in the block devices layer (CONFIG_LBD), file size limit
increases to 9 million terabytes (or the device limits).

Maximum Filesystem Size

For Linux 2.4, 2 TB. For Linux 2.6 and beyond, when using 64 bit
addressing in the block devices layer (CONFIG_LBD) and a 64 bit
platform, filesystem size limit increases to 9 million terabytes (or the
device limits). For these later kernels on 32 bit platforms, 16TB is the
current limit even with 64 bit addressing enabled in the block layer.


You can put up to 48 drives in a single 5U chassis, so in theory each 
chassis could give you 24 TB raw.   If you want hundreds of TB to PB and

larger in a single file system, you are going to have to go to a cluster
FS of some sort.

> Even something like a 64 bit linux system aggregating gnbd exports or 
> similar with md or lvm and xfs or reiserfs (or whatever filesystem) 
> would count, if it works reliably.

Testing the reliability of something like this would be hard.  I would
strongly suggest that you had reasonable failure modes (as compared to
spectacular ones) and graceful reduction in service (rather than
abrupt).  This means that you would likely want to do lots of mirroring
rather than RAID5/6.  You would also want private and redundant storage
networks behind your gndb.  Not necessarily SAN level, though you could
do that.  lvm/md requires a local block device (last I checked) so you
would need a gnbd below it if you wanted to cluster things.  In this
case, I might suggest building large RAID10s (pairs of mirrors), and
having each unit do as much of the IO work on a dedicated and high
quality card.  Each RAID10 subunit would have about 4 "devices" attached
as a stripe across mirrors.  Without expensive hardware, the
stripe/mirror would need to be done in software (lvm level).  This may
have serious issues unless you can make your servers redundant as well.

If you use iSCSI or similar bits, or even AoE, you can solve the block
problem.  I can have each tray of Coraid  (for example, could be done
with iSCSI as well) disks appear as a single block.  I can then run lvm
and build a huge file system.  With a little extra work, we can build a
second path to another tray of disks, and set up an lvm mirror (RAID1). 
  Thats 7.5TB mirrored.  Now add in additional trays in mirrored pairs,
and use LVM to stripe or concatenate across them.  In the iSCSI case and
in the AoE case, the issue will be having sufficient bandwidth to the
trays for the large file system.  You will want as many high speed
connections as possible to avoid oversubscribing one.  With IB
interconnects (or 10GBe) it shouldn't be too hard to have multiple trays
per network connection (disks will be slower than the net).  With AoE,
you will need a multiport GBe card or two (disks close to same speed as

We have built large xfs and ext3 file systems on such units.  I wouldn't
recommend the latter (or reiserfs) for this.  Jfs is reported to be
quite good for larger file systems as well.

Basically, with the right FS, and right set up, it is doable, though
management will be a challenge.  Lustre may or may not help on this. 
Some vendors are pushing it hard.  Some are pushing GPFS hard.  YMMV.



Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at
web  :
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615

Beowulf mailing list, Beowulf at To change your subscription
(digest mode or unsubscribe) visit

More information about the Beowulf mailing list