Cluster Question (fwd)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Gerardo Andres Cisneros andres at chem.duke.eduTue Mar 20 08:48:27 PST 2001
- Previous message: PXE booting
- Next message: Ok so I ditched the LNE100TX
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hello All, As I have said below, I have built a very small cluster (8 nodes) running a slightly modified version of RedHat Linux 6.2 and I'm trying to run a parallel version of a computational chemistry program (g98). This program uses Linda for the paralellization but I'm having problems with it. As stated below I'm having problems with either g98 or Linda killing the processes on the slave nodes once they're done. We've looked into a bunch of things including hardware malfunction but everything seems Ok. We have checked almost everything Dr. Brown suggested as per his experience with PVM (included below) but we can find no problems in the Linda conf file or the UID's belonging to a different user or the dameons not running. I was wondering if anyone out there is using Linda and/or g98 and has encountered similar problems?. Any help is greatly appreciated. I would also very much apreciate if you could reply directly to me since I'm not subscribed to this list. Thank you very much in advance, Best Regards, Andres -- G. Andres Cisneros Department of Chemistry Duke University andres at chem.duke.edu ---------- Forwarded message ---------- Date: Tue, 20 Mar 2001 11:11:28 -0500 (EST) From: Robert G. Brown <rgb at phy.duke.edu> To: Gerardo Andres Cisneros <andres at chem.duke.edu> Subject: Re: Cluster Question On Tue, 20 Mar 2001, Gerardo Andres Cisneros wrote: > > Dear Prof. Brown, > > I'm a grad student working for Dr. W. Yang at the Chemistry Dept. > > We have built a beowulf cluster using 8 Dell PC's donated by intel, i > installed Dulug Linux 6.2 on all of them and I am now trying to run some > programs in parallel. > > Specifically I'm trying to run Gaussian98 on it so I had to download Linda > which is basically software based shared memory (virtual shared memory). > > I was wondering if you had ever used this software and if so if I could > get some pointers. Unfortunately I've never used G98 or Linda either one, so I don't know how helpful I can be. I'd recommend posting the problem to the beowulf list though, as there are probably folks out there who have used the two together. > My problem is that every time I try to run a big job on more than one node > the program crashes before finnishing. The program is supposed to kill > the processes on the slave nodes but it doesn't do it so they just sit on > the slave nodes occupying memory until eventually one of the nodes just > runs out of memory and the process dies. > > If I do a run with a veryverbose flag for linda I get a bunch of "Killed > by signal 15" messages stating that it killed the remote processes when > they're done but it doesn't actually do it. > > A message to CCL produced a bunch of replies telling me to upgrade the > kernel which I did (from 2.2.16-3 to 2.2.17-4) but still no go. > > Somebody else told me that he once had a simmilar problem but it was > caused by bad grounding of his network cards so static electricity was > building up and crashing his machines but I doubt that is the case here > since the network card is chipset to the motherboard. We have 8 Dell > Optiplex (I'm sorry I didn't mention that before). > > I would very much appretiate any suggestions you might have on this. I doubt very much that it is static electricity, and our Dells (probably from the same batch as yours) are rock stable under load and running a nearly identical setup. Besides, I can only assume that all the chassis are plugged into properly grounded three prong plugs and sit on a rack of some sort as well. I've never had any instabilities of any systems anywhere that I could identify with static electricity although perhaps you might if you had some sort of active source of high voltage nearby (a van DeGraff accelerator, a tesla coil, or some such). Ordinarily the ground wire of the power cable is connected to the chassis and absolutely prevents the buildup of static on connected components. Besides, this would be more likely to kill your whole computer than to just shut down one particular process. You haven't had any problems running e.g. NFS have you? Or connecting and transferring large files via scp? Why would a hardware problem pick on G98 with this whole raft of things to choose from? A problem in Linda seems much, much more likely especially given that it is failing to to successfully kill the remote processes when it claims that it is doing so. I've encountered the identical problem in recent versions of PVM -- the pvm_kill command is there, but I'll be damned if I could ever make it actually kill off the slaves in a master-slave calculation. Curiously, they could be killed off from the daemon command interface, so PVM had the capability -- there was just some sort of bug in the command implementation. I wish I could be of some help to you as you try to figure this out, but there isn't a lot I can think of trying without any hands on experience with Linda/G98. One thing might be permissions -- perhaps the remote slaves are being spawned but end up belonging to a UID that doesn't correspond with the source of the kill signal so that the kill signal is ignored, for example. If you can, look in the /var/log/messages on the slave nodes and see what kinds of things are being logged at the time of a kill. Look in the slave sources and see what the signal handler is doing. Snoop the net and verify that there are packets being sent that actually contain the kill signal. Run a remote host monitor tool (e.g. procstatd and watchman from the brahma site in physics) on the nodes and watch e.g. their memory consumption and network and CPU load -- is the problem a simple memory leak somewhere? Still, I think your best bet is the beowulf list itself. Surely somebody on it can help you better than I am able to. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: PXE booting
- Next message: Ok so I ditched the LNE100TX
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
