couple of questions

Ted Sariyski tsariysk at
Tue Feb 18 09:09:23 PST 2003


I have a couple of questions:
1. I'm running a cluster of 100 i386 cpus with channel bonding 100BaseT 
ethernet. I use RedHat7.2 and NFS mounted file server as user storage. 
Occasionally some process enter a  " D " state  forever and the only way 
I figured out to get rid of them is to reboot the node. What may cause 
this problem and is there more intelligent solution to it?

2. I'm planning to rebuild the cluster and to boot nodes through the 
net. I cannot afford Scyld or PBSPro so I am looking for a solution for 
a diskless cluster with mpich and OpenPBS. I believe that if I keep a 
local hard disk I should be able to provide the 'local space' required 
for OpenPBS to run. Any comments?

3. I'm shopping for a parallel  debugger and accurate parallel  profiler 
with minimal overload on the performance. Jumpshot seems to be 
inappropriate for profiling 100 cpus job.  Any recommendations?

Thanks in advance,

Ted Sariyski

More information about the Beowulf mailing list