[Beowulf] Questions about a large job

Tue Apr 18 07:28:46 PDT 2006

Hi,

I tried this weekend run HPL on our largest cluster, 1172 dual Opteron
nodes.The network is Gigabit ethernet as our applications don't need and
don't use a lot of process intercommunication.

I have available 1148 dual nodes, 2296 CPUs and configured HPL.dat to
run on that. I already have tested the parameters so i know it was good
for this cluster.

So, I have compiled HPL with Pathscale using ACML mathematical library.
The MPI used was LAM-MPI. I have run some tests with 10 nodes and it
runs well. But, when I tried to run with 2296 CPUs, the job won't start.

Various errors happened, one for each try. The Torque version installed
is 2.0.0p8 and is working fine with other largers jobs, with 1000 CPUs.

I must admit, I never have tried to run a job with this size. I know, I
can made some mistake, but what I wish know is about timeouts. The
processes takes a long time to start and don't start. When it start run,
  I saw it because the HPL.out was created, ir dies.

Do you guys have jobs larger than that running OK with Torque and
LAM-MPI? There are something I can do to accelerate the start of the job?

I know i lost the list, but any help will be great! Thanks a lot.

-- 

Leandro Tavares Carneiro
Petrobras TI/TI-E&P/STEP Suporte Tecnico de E&P
Av Chile, 65 sala 1501 EDISE - Rio de Janeiro / RJ
Tel: (0xx21) 3224-1427