moloney.brendan at gmail.com
Tue Apr 21 19:20:27 PDT 2015
A bit late to the discussion, but I am currently setting up Ceph for our
cluster storage and wanted to throw in my 2 cents.
It is important to realize that Ceph provides many layers of storage
services. At the lowest level we have the object storage layer (RADOS)
which can be accessed through the "RADOS Gate Way" (RGW) using an S3/Swift
compatible REST API. On top of the object storage there is the block
device layer "RADOS Block Device" (RBD) and the filesystem layer CephFS.
Out of these, CephFS is the only one that is not widely regarded as
production ready (though many people find it suitable for their needs
I am planning to use the block layer to provide storage for virtual
machines that in turn provide various storage services (mostly NFS and
Samba but also more specialized storage). Storage performance for any
single job in our cluster is less of a concern than aggregate performance,
which makes it acceptable to extract additional concurrency by splitting
data across these virtual machines. We can essentially provide a storage VM
for each group and thus avoid issues with one group overloading the storage
system and causing problems for other users. The block devices can be thin
provisioned which allows you to allocate a large XFS or ext4 file system
but only store blocks that are actually used. We also get cheap COW
snapshots for short term backups and as the available space goes down we
can scale storage and performance by just adding nodes.
The big downside to this approach is that high availability in the storage
services needs to be provided through traditional failover techniques.
This is one of the places where CephFS would be a huge improvement.
Perhaps most importantly, regardless of which storage layer you are using,
Ceph is extremely flexible in how data is ultimately stored. Intelligent
data placement with respect to failure domains (using the CRUSH algorithm)
is an integral part of Ceph. You can ensure that different replicas of each
object are stored on different servers/racks/switches/etc. You can use
erasure coding instead replication to boost storage efficiency, essentially
providing distributed parity RAID. There are even erasure codes that allow
single drive errors to be fixed within a single server/rack/etc to reduce
network load. You can use a small pool of fast drives to cache large pool
of slow drives. This flexibility is really the most interesting aspect of
Ceph to me.
On Wed, Apr 15, 2015 at 6:56 AM, Olli-Pekka Lehto <olli-pekka.lehto at csc.fi>
> On 15 Apr 2015, at 06:50, Mark Hahn <hahn at mcmaster.ca> wrote:
> >> In an environment that needs to adapt to evolving user needs, trading
> >> performance for the flexibility that Ceph offers does not seem like a
> >> deal.
> > it would be appreciated if you could be a bit more specific. what kind
> of performance, what kind of flexibility?
> > thanks, mark hahn.
> To give some background, we have two types of environments with different
> granularity of funding and customership:
> 1. HPC environment:
> We get a big chunk of funding every few years that needs to be invested
> within a limited time. The need is for fast parallel storage. Thus big,
> enterprise class storage boxes with Lustre. The system and SLA will remain
> fairly static for several years. Growth is fairly predictable.
> 2. Cloud environment:
> Ongoing streams of small-medium funding from various customers. Some of
> these can be sold services and some need to show an investment for the
> research-granting organization. The needs of
> price-performance-resilience-capacity might be different for different
> customers. Growth is unpredictable.
> For the first case the Lustre model works fine but for the latter it can
> be a bit more constrained: For this we should be able to grow our compute
> and storage capacity smoothly even for cases where the funding is
> fine-grained, while keeping the architecture simple. Also the workload
> profiles and resiliency requirements are not completely clear for future
> With Ceph we can scale storage in a way that’s more akin to the one that
> we scale compute nodes: We can throw more nodes at it to make it grow in a
> fairly linear fashion and with a fine granularity. We can also adjust
> resiliency parameters in software instead of having a large part of it
> fixed in the hardware design.
> I don’t see Lustre going away, at least in our environments, anytime soon
> and we have not done any real apples-to-apples comparisons yet on
> performance. Initially we’re not targeting huge scalability or performance.
> Basically something that is better than NFS is good enough initially.
> It’s also interesting to see how the resiliency will compare. Having
> experienced multiple generations of expensive “invincible” arrays having
> issues that baffle us (and often the vendors) time after time, something
> with cheaper but more decoupled HW might turn out to be better.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf