Large FOSS filesystems, was Re: [Beowulf] 512 nodes Myrinet cluster Challanges

Thu May 4 11:17:24 PDT 2006

Dan Stromberg wrote:

> On a somewhat related note, are there any FOSS filesystems that can
> surpass 16 terabytes in a single filesystem - reliably?

http://oss.sgi.com/projects/xfs/

Quoting:

"
Maximum File Size

For Linux 2.4, the maximum accessible file offset is 16TB on 4K page 
size and 64TB on 16K page size. For Linux 2.6, when using 64 bit 
addressing in the block devices layer (CONFIG_LBD), file size limit 
increases to 9 million terabytes (or the device limits).

Maximum Filesystem Size

For Linux 2.4, 2 TB. For Linux 2.6 and beyond, when using 64 bit 
addressing in the block devices layer (CONFIG_LBD) and a 64 bit 
platform, filesystem size limit increases to 9 million terabytes (or the 
device limits). For these later kernels on 32 bit platforms, 16TB is the 
current limit even with 64 bit addressing enabled in the block layer.

"

You can put up to 48 drives in a single 5U chassis, so in theory each 
chassis could give you 24 TB raw.   If you want hundreds of TB to PB and 
larger in a single file system, you are going to have to go to a cluster 
FS of some sort.

> Even something like a 64 bit linux system aggregating gnbd exports or
> similar with md or lvm and xfs or reiserfs (or whatever filesystem)
> would count, if it works reliably.

Testing the reliability of something like this would be hard.  I would 
strongly suggest that you had reasonable failure modes (as compared to 
spectacular ones) and graceful reduction in service (rather than 
abrupt).  This means that you would likely want to do lots of mirroring 
rather than RAID5/6.  You would also want private and redundant storage 
networks behind your gndb.  Not necessarily SAN level, though you could 
do that.  lvm/md requires a local block device (last I checked) so you 
would need a gnbd below it if you wanted to cluster things.  In this 
case, I might suggest building large RAID10s (pairs of mirrors), and 
having each unit do as much of the IO work on a dedicated and high 
quality card.  Each RAID10 subunit would have about 4 "devices" attached 
as a stripe across mirrors.  Without expensive hardware, the 
stripe/mirror would need to be done in software (lvm level).  This may 
have serious issues unless you can make your servers redundant as well.

If you use iSCSI or similar bits, or even AoE, you can solve the block 
problem.  I can have each tray of Coraid  (for example, could be done 
with iSCSI as well) disks appear as a single block.  I can then run lvm 
and build a huge file system.  With a little extra work, we can build a 
second path to another tray of disks, and set up an lvm mirror (RAID1). 
  Thats 7.5TB mirrored.  Now add in additional trays in mirrored pairs, 
and use LVM to stripe or concatenate across them.  In the iSCSI case and 
in the AoE case, the issue will be having sufficient bandwidth to the 
trays for the large file system.  You will want as many high speed 
connections as possible to avoid oversubscribing one.  With IB 
interconnects (or 10GBe) it shouldn't be too hard to have multiple trays 
per network connection (disks will be slower than the net).  With AoE, 
you will need a multiport GBe card or two (disks close to same speed as 
  net).

We have built large xfs and ext3 file systems on such units.  I wouldn't 
recommend the latter (or reiserfs) for this.  Jfs is reported to be 
quite good for larger file systems as well.

Basically, with the right FS, and right set up, it is doable, though 
management will be a challenge.  Lustre may or may not help on this. 
Some vendors are pushing it hard.  Some are pushing GPFS hard.  YMMV.

Joe

-- 

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615