To crash or not to crash

Eray Ozkural erayo at cs.bilkent.edu.tr
Thu May 9 20:12:57 PDT 2002


On Friday 10 May 2002 00:58, W Bauske wrote:
> Eray Ozkural wrote:
> > It's very easy to crash a node with a suitable code, so I shouldn't have
> > to re-install it or manually fsck it every time it fails to reboot after
> > such a crash...
>
> How do you "easily" crash a node. Are you exceeding some resource
> limit or??
>
> I run quite large problems and don't see problems. Perhaps you mean
> performance grinds to a halt because of paging or something like that
> which makes the node un-responsive so you power cycle it.
>

Well. :) I think it depends on the application, but it's a sure thing that I 
can't provide you with some minimal code that's going to freeze any system 
for good. It does happen from time to time, though, more so on certain kernel 
version / hardware combinations. It's hard to say when and how those things 
happen but exhausting system resources is a good way to disrupt normal 
operation as you say. But by crash I mean crash, not temporary inflation of 
the working set.

Let me try to give an example to what happens. I sometimes run a large 
program, ie one that uses lots of CPU/disk/network, and a node simply goes 
down. I'm sure almost everybody has had that kind of thing, for instance some 
GL programs used to crash Xfree86 and the whole system rather easily. The 
system would lock or reboot right away... If you write algorithms that use a 
lot of system resources or do unusual things, you may have done it with your 
own user-space code, too.

I have never used a system that cannot be crashed :) If you've used such a 
system feel free to advertise it, but linux is certainly not like that :) 
(Maybe the *BSD people would want to praise their systems right now :) )
 
After all, these kinds of things are to be expected because *nobody* can give 
a formal proof that the system cannot crash, if you know what I mean.

Unless, of course, the whole system was built upon such an invariant, which is 
not the case.

I'm hoping that this gives a little justification to why you would want a 
filesystem that will not lose precious files/dirs on an unexpected crash; 
well all crashes are unexpected....

Now if the computer that crashes is your home PC, and you are the only user, 
it may be possible to predict what might crash your system. Like when you're 
testing your uber-kernel-module or superb-ai-algorithm. The problem is even 
then you can't guarantee that it won't crash. My claim here is that you can 
crash a system with an appropriate user-space code.

On a cluster the probability that one of the nodes might crash is high.

Of course, I would like to have a system that is wholly immune from crashes 
but I think it is a little naive to claim that linux cannot crash. The uptime 
of some linux boxen does not show that linux is incapable of crashing, it's 
simply that the whole system there was at a stable region. Try changing the 
system components frequently, and you will get a crash. [*] 

Now I won't ever say "crash" again. :) Some people here might want to hear me 
say that "linux cannot crash, and ext2 is the best filesystem ever written" 
but I won't say it even if Linus Torvalds and gang join this thread :) I 
doubt they would say such an over-confident statement :)

And  I still think that ext3 is not the only filesystem that is better than 
ext2.

You could surely say that linux is more stable than, say, any version of 
windows which I would wholeheartedly agree with.

Cheers,

[*] Or maybe it might be said that I haven't configured my systems good 
enough, true, but what's the point of an OS if I have to configure it to 
prevent it from crashing? :)

-- 
Eray Ozkural (exa) <erayo at cs.bilkent.edu.tr>
Comp. Sci. Dept., Bilkent University, Ankara
www: http://www.cs.bilkent.edu.tr/~erayo  Malfunction: http://mp3.com/ariza
GPG public key fingerprint: 360C 852F 88B0 A745 F31B  EA0F 7C07 AE16 874D 539C




More information about the Beowulf mailing list