[Beowulf] Questions about a large job
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.deTue Apr 18 12:35:03 PDT 2006
- Previous message: [Beowulf] Questions about a large job
- Next message: [Beowulf] Cluster Interconnects: The Whole Shebang
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 18 Apr 2006, Leandro Tavares Carneiro wrote: > The MPI used was LAM-MPI. I have run some tests with 10 nodes and it > runs well. But, when I tried to run with 2296 CPUs, the job won't start. Are you able to run a simple "hello world" test ? If not, you might be hitting the per-process descriptor limit, as each process will try to open a TCP connection to each other process - in this case you should still be able to run a job on something like 500 nodes (=1000 processes, slightly less than the 1024 maximum descriptors per process). > Various errors happened, one for each try. The Torque version installed > is 2.0.0p8 and is working fine with other largers jobs, with 1000 CPUs. This just confirms my suspicion expressed above. To change the limits on a Red Hat like system, add a line like: * - nofile 4096 to /etc/security/limits.conf. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De
- Previous message: [Beowulf] Questions about a large job
- Next message: [Beowulf] Cluster Interconnects: The Whole Shebang
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
