couple of questions
tsariysk at craft-tech.com
Tue Feb 18 09:09:23 PST 2003
I have a couple of questions:
1. I'm running a cluster of 100 i386 cpus with channel bonding 100BaseT
ethernet. I use RedHat7.2 and NFS mounted file server as user storage.
Occasionally some process enter a " D " state forever and the only way
I figured out to get rid of them is to reboot the node. What may cause
this problem and is there more intelligent solution to it?
2. I'm planning to rebuild the cluster and to boot nodes through the
net. I cannot afford Scyld or PBSPro so I am looking for a solution for
a diskless cluster with mpich and OpenPBS. I believe that if I keep a
local hard disk I should be able to provide the 'local space' required
for OpenPBS to run. Any comments?
3. I'm shopping for a parallel debugger and accurate parallel profiler
with minimal overload on the performance. Jumpshot seems to be
inappropriate for profiling 100 cpus job. Any recommendations?
Thanks in advance,
More information about the Beowulf