[Beowulf] GPFS and failed metadata NSD

Kilian Cavalotti kilian.cavalotti.work at gmail.com
Thu May 25 13:10:43 PDT 2017


On Thu, May 25, 2017 at 8:58 AM, Ryan Novosielski <novosirj at rutgers.edu> wrote:
> I’d be interested to hear what people are doing, generally, about backing up very large volumes of data (that probably seem smaller to more established centers), like 500TB to 1PB. It sounds to me like a combination of replication and filesystem snapshots (those replicated or not) do protect against hardware failure and user failure, depending on the frequency and whether or not you have any other hidden weaknesses.

At Stanford, we (Research Computing) have developed a PoC using Lustre
HSM and a Google Drive backend to back our /scratch filesystem up,
mostly because Google Drive is free and unlimited for .edu accounts
(^_^). We didn't announce anything to our users, so they don't start
relying on it, and use this more as an insurance against user
"creativity" than a real disaster-recovery mechanism.

We found out that this was working quite well for backing up large
files, but not so well for smaller ones because Google enforce secret
file operation rate limits (I say secret because they're not the ones
that are documented, and support doesn't want to talk about them),
which I guess is fair for a free and unlimited service. But that means
that for a filesystem with hundreds of millions of files, this is not
really appropriate.

We did some tests for restoring data from the Google Drive backend,
and another limitation with the current Lustre HSM implementation is
that the HSM coordinator doesn't prioritize restore operations.
Meaning that if you have thousands of "archive" operations in queue,
the coordinator will need to go through all of them before processing
your "restore" ops. Which again, in real life, might be a deal-breaker
for disaster recovery.

Anyway, we had quite some fun doing it, including some nice chats with
the Networking people on campus (which actually lead to a new 100G
data link being deployed). We've released the open source Lustre HSM
to Google Drive copytool that we developed on GitHub
(https://github.com/stanford-rc/ct_gdrive). And we're now the proud
users of about about 3.3 PB on Google Drive (screenshot attached,
because it happened).

Cheers,
-- 
Kilian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: bigdata_gdrive.png
Type: image/png
Size: 178116 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170525/76c85249/attachment-0001.png>


More information about the Beowulf mailing list