From wade.hampton at nsc1.net Tue Sep 2 14:13:01 2003 From: wade.hampton at nsc1.net (Wade Hampton) Date: Tue Nov 9 01:14:27 2010 Subject: [scyld-users] e1000 and Scyld 28cz Message-ID: <3F54DBA3.8060608@nsc1.net> G'day, I'm upgrading my first test cluster to 1G ethernet using the onboard e1000 parts on my nodes (Tyan motherboards with eepro100 and e1000). I have not gotten the nodes to boot and recognize the e1000 cards using a beoboot floppy. When I tried the latest driver from Intel (5.1.13), I could not get an e1000 module for my node boot floppy until I did the following: CFLAGS=$(KCFLAGS) make KCFLAGS="-D__BOOT_KERNEL_H_ -D__module__beoboot mv e1000.o /lib/modules/2.2.19-14.beobeoboot/net I also added the following to /etc/beowulf/config pci 0x8086 0x1008 e1000 (reference: http://www.beowulf.org/pipermail/beowulf/2002-September/004575.html) When I try to boot this floppy, it stopped booting after loading the e1000 driver and appeared to hang. I then removed the e1000.o modules and tried the Scyld RPM for 4.4.12. This seemed to hang at the same location. Any help in this matter would be appreciated. -- Wade Hampton From anandbedekar at yahoo.com Mon Sep 29 18:06:02 2003 From: anandbedekar at yahoo.com (Anand Bedekar) Date: Tue Nov 9 01:14:27 2010 Subject: [scyld-users] bpsh in background: defunct processes Message-ID: <20030929153053.14353.qmail@web12204.mail.yahoo.com> Hi, I'm trying to run bpsh in a script that calls bpsh in a loop, like this: for i in 1 2 3 do bpsh -n $i run.sh & done The script run.sh is another script that runs a simulation program (does not issue further calls to bpsh, just runs an executable and then some awk/perl for post-processing). The intention is to run an instance 'run.sh' on each of the nodes in the list that 'for i in...' loops over. What happens is that all the processes called within run.sh seem to go into a "defunct" state without finishing cleanly. This is making the process table fill up, so that no more processes can be run. Is this usual behaviour when calling bpsh to run a shell script, given the way I am calling 'bpsh -n $i run.sh &' ? Is there some other way to run it? Unfortunately all the nodes in the cluster are currently out of action because the process table is full on all of them, due to the above. So I can't report on which version of scyld has been installed, until the sysadmin reboots the whole thing. I do know the machines are P3 running RedHat 7.0, kernel version 2.2.19. Any help would be appreciated. Thanks, Anand __________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com From becker at scyld.com Mon Sep 29 18:32:10 2003 From: becker at scyld.com (Donald Becker) Date: Tue Nov 9 01:14:27 2010 Subject: [scyld-users] bpsh in background: defunct processes In-Reply-To: <20030929153053.14353.qmail@web12204.mail.yahoo.com> Message-ID: On Mon, 29 Sep 2003, Anand Bedekar wrote: > I'm trying to run bpsh in a script that calls bpsh in > a loop, like this: > > for i in 1 2 3 > do > bpsh -n $i run.sh & > done Suggestion: you should be using 'beomap' to get a dynamic schedule: for i in `beomap --np 3`; do ... > What happens is that all the processes called within > run.sh seem to go into a "defunct" state without > finishing cleanly. This is making the process table > fill up, so that no more processes can be run. This sounds like a long-fixed bug in the BProc. The status and termination messages were being processed in reverse order. > Is this usual behaviour when calling bpsh to run a > shell script, given the way I am calling > 'bpsh -n $i run.sh &' ? Is there some other way to run > it? With our new release there is a command named 'beorun' that automatically combines a scheduler mapping with efficiently controlling the resulting processes: beorun --np 3 command; > Unfortunately all the nodes in the cluster are > currently out of action because the process table is > full on all of them, due to the above. You should be able to restart the cluster nodes in about a second... > So I can't report on which version of scyld has been installed, > until the sysadmin reboots the whole thing. I do know > the machines are P3 running RedHat 7.0, kernel version > 2.2.19. That doesn't sound like a Scyld release. -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 914 Bay Ridge Road, Suite 220 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993 From anandbedekar at yahoo.com Tue Sep 30 16:47:02 2003 From: anandbedekar at yahoo.com (Anand Bedekar) Date: Tue Nov 9 01:14:27 2010 Subject: [scyld-users] bpsh in background: defunct processes In-Reply-To: Message-ID: <20030930200051.32272.qmail@web12203.mail.yahoo.com> Hello, Thanks for the reply. Some more questions: -- The cluster is running "Scyld Beowulf Basic Edition 27bz-8" (not RedHat 7: my mistake). Could you tell me if the bug with processing status and termination messages was fixed prior to or after this release? -- Does this release have the beomap and beorun dynamic scheduling functionality you described? -- If the answer to either of the above is "no", is there any alternative way (without upgrading) to make sure that the processes don't go defunct, e.g. by somehow sending a signal to the run.sh script called by bpsh or something? Our sysadmin appears reluctant to upgrade. Thanks, Anand --- Donald Becker wrote: > On Mon, 29 Sep 2003, Anand Bedekar wrote: > > > I'm trying to run bpsh in a script that calls bpsh > in > > a loop, like this: > > > > for i in 1 2 3 > > do > > bpsh -n $i run.sh & > > done > > Suggestion: you should be using 'beomap' to get a > dynamic schedule: > > for i in `beomap --np 3`; do ... > > > What happens is that all the processes called > within > > run.sh seem to go into a "defunct" state without > > finishing cleanly. This is making the process > table > > fill up, so that no more processes can be run. > > This sounds like a long-fixed bug in the BProc. The > status and > termination messages were being processed in reverse > order. > > > Is this usual behaviour when calling bpsh to run a > > shell script, given the way I am calling > > 'bpsh -n $i run.sh &' ? Is there some other way to > run > > it? > > With our new release there is a command named > 'beorun' that > automatically combines a scheduler mapping with > efficiently controlling > the resulting processes: > beorun --np 3 command; > > > Unfortunately all the nodes in the cluster are > > currently out of action because the process table > is > > full on all of them, due to the above. > > You should be able to restart the cluster nodes in > about a second... > > > So I can't report on which version of scyld has > been installed, > > until the sysadmin reboots the whole thing. I do > know > > the machines are P3 running RedHat 7.0, kernel > version > > 2.2.19. > > That doesn't sound like a Scyld release. > > -- > Donald Becker becker@scyld.com > Scyld Computing Corporation http://www.scyld.com > 914 Bay Ridge Road, Suite 220 Scyld Beowulf cluster > system > Annapolis MD 21403 410-990-9993 > __________________________________ Do you Yahoo!? The New Yahoo! Shopping - with improved product search http://shopping.yahoo.com From becker at scyld.com Tue Sep 30 17:08:04 2003 From: becker at scyld.com (Donald Becker) Date: Tue Nov 9 01:14:27 2010 Subject: [scyld-users] bpsh in background: defunct processes In-Reply-To: <20030930200051.32272.qmail@web12203.mail.yahoo.com> Message-ID: On Tue, 30 Sep 2003, Anand Bedekar wrote: > -- The cluster is running "Scyld Beowulf Basic Edition > 27bz-8" (not RedHat 7: my mistake). Could you tell me > if the bug with processing status and termination > messages was fixed prior to or after this release? It was fixed well after that release. > -- Does this release have the beomap and beorun > dynamic scheduling functionality you described? No, the beomap subsystem was added for the 28 series, and beorun is "previewed" in the current release with the full feature set in the upcoming 29 series release. > -- If the answer to either of the above is "no", is > there any alternative way (without upgrading) to make > sure that the processes don't go defunct, e.g. by > somehow sending a signal to the run.sh script called > by bpsh or something? Our sysadmin appears reluctant > to upgrade. I'll look into that. I know one site implemented a fix as a module, specifically so they wouldn't need to reboot the master node. Presumably they had a long-running job... -- Donald Becker becker@scyld.com Scyld Computing Corporation http://www.scyld.com 914 Bay Ridge Road, Suite 220 Scyld Beowulf cluster system Annapolis MD 21403 410-990-9993