Running two MPI jobs simultaneously
Miska Le Louarn
lelouarn at eso.org
Tue Dec 10 05:56:19 PST 2002
I am facing a strange problem (or "feature") related to either Linux or
MPI or maybe their interaction.
I have written two programs, in C, which both use MPI.
I run these programs on a 6 node cluster of PCs, each PC running Linux.
More precise hardware / software description at the end of this mail.
When I run one program (any of the two - with the command mpirun),
everything goes fine, the program doesn't crash and provides the right
result. All PCs work happily and everything seems to be ok.
I should say the two programs are completely independant (different
executables and so on, I don't make any communication between the two...).
BUT when I try to run these two programs at the same time, one of them
hangs. It just stops doing anything and sits there without crashing
until the other program is completed. Then it starts to work again.
I am surprised by this behavior. I would have expected that both
programs run independantly, slower (because they share resources like
network and CPU) but still run. Now this one program hogs all resources
and the other one just sits there doing nothing.
I have also tried to run two copies of the first ("hog") process. Now
one of the copies also freezes completely (but doesn't seem to restart
once the hog process is finished).
What I am doing now to avoid the problem is to run the programs
sequencially. It just would be conveniant sometimes to have the progs
run at the same time - although slower.
I haven't tried running the two programms as two different users. I
should maybe try that.
So does anybody have any idea why this is ? Is it a Linux scheduler
"feature" related to the network communication between the nodes (if I
launch 2 non-MPI jobs, I get the standard slow-down) ? Or maybe
interference inside MPI between the two processes ?
Any tests I could do to see what is going on ?
Thanks in advance,
5 Nodes are Pentium IV, 1.8 GHz, with 1 GB of RAM, running Linux RH 7.3
Master: Pentium Xeon 2 CPU, 1GB of RAM, RH 7.3 (stock SMP kernel)
All machines have a Gigabit network card, and we have a gigabit switch.
Programs written in C, compiled with gcc 2.96-112 (have also tried gcc
3.2 without any change).
The programs perform quite of lot of different operations, various
computations, MPI communications, disk access on the local disk etc...
More information about the Beowulf