[Beowulf] GPFS and failed metadata NSD

John Hearns hearnsj at googlemail.com
Fri May 19 00:48:13 PDT 2017


Regarding failed disks,  you of course have done the correct thing in
sending it to a data recovery company.

However, I would like to heartily recommend sysrescuecd as a first line
tool in these situations:
http://www.system-rescue-cd.org/

this is a live CD, which has a utility for scanning for and rebuilding
'lost' partitions.
In your case this probably would not have helped.
However I carry a sysrescue image around with me on a USB stick on my
lanyard,
so I am poised and ready to leap into the nearest telephone box and emerge
as a superhero,
should enyone ever cry out "Help - my Linux system has gone down"
Hasn't happened yet, but I can dream.


















On 19 May 2017 at 09:39, John Hanks <griznog at gmail.com> wrote:

> Thanks Arif, I'm signed up there now.
>
> As a general update, the most recently failed disk of the pair is at a
> data recovery company who thinks they can recover a workable image from it.
> We should have that back in two or three weeks and will try to use it to
> recover the filesystem.
>
> jbh
>
> On Thu, May 18, 2017 at 5:21 PM Arif Ali <mail at arif-ali.co.uk> wrote:
>
>> Hi John,
>>
>> I would recommend joining up at spectrumscale.org mailing list, where
>> you will find very good experts from the HPC industry who know GPFS well,
>> including, Vendors, users and integrators. More specifically, you'll you'll
>> find gpfs developers on there. Maybe someone on that list can help out
>>
>> More direct link to the mailing list, here,
>> https://www.spectrumscale.org:10000/virtualmin-mailman/
>> unauthenticated/listinfo.cgi/gpfsug-discuss/
>>
>>
>> On 29/04/2017 08:00, John Hanks wrote:
>>
>> Hi,
>>
>> I'm not getting much useful vendor information so I thought I'd ask here
>> in the hopes that a GPFS expert can offer some advice. We have a GPFS
>> system which has the following disk config:
>>
>> [root at grsnas01 ~]# mmlsdisk grsnas_data
>> disk         driver   sector     failure holds    holds
>>          storage
>> name         type       size       group metadata data  status
>>  availability pool
>> ------------ -------- ------ ----------- -------- ----- -------------
>> ------------ ------------
>> SAS_NSD_00   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_01   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_02   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_03   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_04   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_05   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_06   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_07   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_08   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_09   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_10   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_11   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_12   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_13   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_14   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_15   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_16   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_17   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_18   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_19   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_20   nsd         512         100 No       Yes   ready         up
>>           system
>> SAS_NSD_21   nsd         512         100 No       Yes   ready         up
>>           system
>> SSD_NSD_23   nsd         512         200 Yes      No    ready         up
>>           system
>> SSD_NSD_24   nsd         512         200 Yes      No    ready         up
>>           system
>> SSD_NSD_25   nsd         512         200 Yes      No    to be emptied
>> down         system
>> SSD_NSD_26   nsd         512         200 Yes      No    ready         up
>>           system
>>
>> SSD_NSD_25 is a mirror in which both drives have failed due to a series
>> of unfortunate events and will not be coming back. From the GPFS
>> troubleshooting guide it appears that my only alternative is to run
>>
>> mmdeldisk grsnas_data  SSD_NSD_25 -p
>>
>> around which the documentation also warns is irreversible, the sky is
>> likely to fall, dogs and cats sleeping together, etc. But at this point I'm
>> already in an irreversible situation. Of course this is a scratch
>> filesystem, of course people were warned repeatedly about the risk of using
>> a scratch filesystem that is not backed up and of course many ignored that.
>> I'd like to recover as much as possible here. Can anyone confirm/reject
>> that deleting this disk is the best way forward or if there are other
>> alternatives to recovering data from GPFS in this situation?
>>
>> Any input is appreciated. Adding salt to the wound is that until a few
>> months ago I had a complete copy of this filesystem that I had made onto
>> some new storage as a burn-in test but then removed as that storage was
>> consumed... As they say, sometimes you eat the bear, and sometimes, well,
>> the bear eats you.
>>
>> Thanks,
>>
>> jbh
>>
>> (Naively calculated probability of these two disks failing close together
>> in this array: 0.00001758. I never get this lucky when buying lottery
>> tickets.)
>> --
>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.’
>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>
>> --
>> regards,
>>
>> Arif Ali
>> Mob: +447970148122 <+44%207970%20148122>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> --
> ‘[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.’
>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20170519/f907e83f/attachment-0001.html>


More information about the Beowulf mailing list