[Beowulf] GPFS and failed metadata NSD

Thu May 25 13:26:33 PDT 2017

> On May 25, 2017, at 4:10 PM, Kilian Cavalotti <kilian.cavalotti.work at gmail.com> wrote:
> 
> On Thu, May 25, 2017 at 8:58 AM, Ryan Novosielski <novosirj at rutgers.edu> wrote:
>> I’d be interested to hear what people are doing, generally, about backing up very large volumes of data (that probably seem smaller to more established centers), like 500TB to 1PB. It sounds to me like a combination of replication and filesystem snapshots (those replicated or not) do protect against hardware failure and user failure, depending on the frequency and whether or not you have any other hidden weaknesses.
> 
> At Stanford, we (Research Computing) have developed a PoC using Lustre
> HSM and a Google Drive backend to back our /scratch filesystem up,
> mostly because Google Drive is free and unlimited for .edu accounts
> (^_^). We didn't announce anything to our users, so they don't start
> relying on it, and use this more as an insurance against user
> "creativity" than a real disaster-recovery mechanism.
> 
> We found out that this was working quite well for backing up large
> files, but not so well for smaller ones because Google enforce secret
> file operation rate limits (I say secret because they're not the ones
> that are documented, and support doesn't want to talk about them),
> which I guess is fair for a free and unlimited service. But that means
> that for a filesystem with hundreds of millions of files, this is not
> really appropriate.
> 
> We did some tests for restoring data from the Google Drive backend,
> and another limitation with the current Lustre HSM implementation is
> that the HSM coordinator doesn't prioritize restore operations.
> Meaning that if you have thousands of "archive" operations in queue,
> the coordinator will need to go through all of them before processing
> your "restore" ops. Which again, in real life, might be a deal-breaker
> for disaster recovery.
> 
> Anyway, we had quite some fun doing it, including some nice chats with
> the Networking people on campus (which actually lead to a new 100G
> data link being deployed). We've released the open source Lustre HSM
> to Google Drive copytool that we developed on GitHub
> (https://github.com/stanford-rc/ct_gdrive). And we're now the proud
> users of about about 3.3 PB on Google Drive (screenshot attached,
> because it happened).

Boy, that’s great, Kilian, thanks! I’m already glad I asked.

--
____
|| \\UTGERS,  	 |---------------------------*O*---------------------------
||_// the State	 |         Ryan Novosielski - novosirj at rutgers.edu
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\    of NJ	 | Office of Advanced Research Computing - MSB C630, Newark
     `'

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 204 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170525/744cbe36/attachment.sig>