Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Questions about a large job

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Tue Apr 18 12:35:03 PDT 2006


On Tue, 18 Apr 2006, Leandro Tavares Carneiro wrote:

> The MPI used was LAM-MPI. I have run some tests with 10 nodes and it
> runs well. But, when I tried to run with 2296 CPUs, the job won't start.

Are you able to run a simple "hello world" test ? If not, you might be
hitting the per-process descriptor limit, as each process will try to
open a TCP connection to each other process - in this case you should
still be able to run a job on something like 500 nodes (=1000
processes, slightly less than the 1024 maximum descriptors per
process).

> Various errors happened, one for each try. The Torque version installed
> is 2.0.0p8 and is working fine with other largers jobs, with 1000 CPUs.

This just confirms my suspicion expressed above.

To change the limits on a Red Hat like system, add a line like:

*	-	nofile	4096

to /etc/security/limits.conf.

-- 
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De




More information about the Beowulf mailing list