[Beowulf] Good post-mortem of a Lustre outage at CSC

Olli-Pekka Lehto olli-pekka.lehto at csc.fi
Sun Apr 3 07:50:33 PDT 2016


Thanks Adam! :)

I was planning to post it here as well but hadn't gotten to it yet. 

I'm happy to answer any questions and hear comments. We'll try to put the parallel ramdisk recipe on our Github soon. 

Best regards,
O-P
-- 
Olli-Pekka Lehto
Development Manager
Computing Platforms
CSC - IT Center for Science Ltd.
E-Mail: olli-pekka.lehto at csc.fi
Tel: +358 50 381 8604
skype: oplehto // twitter: ople

----- Original Message -----
> From: "Adam DeConinck" <ajdecon at ajdecon.org>
> To: beowulf at beowulf.org
> Sent: Friday, 1 April, 2016 18:53:08
> Subject: [Beowulf] Good post-mortem of a Lustre outage at CSC

> In case some of the folks on this list haven't seen this particular
> horror story yet :)
> 
> https://csc.fi/web/blog/post/-/blogs/the-largest-unplanned-outage-in-years-and-how-we-survived-it
> 
> "The DDN controller replacement went quite smoothly and around 10 a.m.
> we were ready to bring the system back online. However, when
> restarting the Lustre filesystem, the metadata server reported
> anomalies in its filesystem and requested to do a filesystem check
> (fsck). Typically these are fairly routine operations, especially when
> the filesystem has been up for a long time. Any problems that the
> check finds are typically fixed automatically with no impact.
> 
> In this case, however, the tool could not fix all the problems it
> identified. A faulty inode persisted. Trying to bring the Lustre up
> resulted in a system crash (kernel panic) with this inode a very
> likely cause."
> 
> -Adam
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf


More information about the Beowulf mailing list