[Beowulf] Cannot use more than two nodes on cluster

Gus Correa gus at ldeo.columbia.edu
Thu Sep 20 14:10:29 PDT 2012


If 6) works or worked before,
then the problem is not with your cluster setup
or with OpenMPI, but with the OpenFOAM setup,
and you may get better help in the OpenFOAM list.

If not, other people in the list already asked you
to disclose more information, your /etc/hosts, and so on.
That is the way to encourage people to help you.
Otherwise all becomes a lengthy guesswork.

Gus Correa

On 09/20/2012 04:58 PM, Antti Korhonen wrote:
> Hi Gus
>
> 1) it just hangs, nothing get written into log
> 2) /mirror/ is a NFS share and loaded on each slave with fstab
> 3) Yes
> 4) Yes
> 5) Private subnet
> 6) I tested with openMPI with 2 nodes before installing OpenFOAM, will test again now.
> 7) I will test with this (and read more
> 8) Did not but will now
>
> Thank you
>
>    Antti
>
> -----Original Message-----
> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Gus Correa
> Sent: Thursday, September 20, 2012 1:42 PM
> To: Bewoulf
> Subject: Re: [Beowulf] Cannot use more than two nodes on cluster
>
> Hi Antti
>
> 1) Do you get any error message back or does it just hang?
>
> 2) Presumably /mirror or one of its subdirectories is shared via NFS or equivalent by all nodes, right?
> Or is it duplicated on all computers?
>
> 3) Did you set your PATH and LD_LIBRARY_PATH to point to the OpenMPI directory on all computers?
>
> 4) Is the name convention (and/or IP addresses) in your "machines" file the same as in /etc/hosts?
>
> 5) Are you using Internet IP addresses or in a private subnet?
>
> 6) Have you tried something simpler than OpenFOAM?
>
> It will be easier to debug this with very simple code.
> If you downloaded the OpenMPI source code, in the untarred main directory there is a subdirectory "examples".  You can use mpicc to compile a program called "connectivity_c.c" that is there.
> The program exchanges messages across
> every pair of processes, and writes a message for each exchange.
> Then you can launch it with mpiexec and add the option "-v" after the executable name, to get a verbose log, and redirect the output to a log file.
>
> OpenMPI has also flags [mca parameters] to increase the verbosity of error messages, which may help.
>
> 7) Also, OpenMPI tries to use any channels available on all nodes.
> You can select a specific Ethernet port, which may help nail down the problem, adding this to the mpiexec command line:
>
> --mca btl sm,self,tcp --mca btl_tcp_if_include eth0
>
> [I am assuming all nodes use the eth0 interface, but you should adjust to your reality.]
>
> For details see the OpenMPI FAQ:
> http://www.open-mpi.org/faq
> http://www.open-mpi.org/faq/?category=tcp
>
> 8) Did you ask questions in the OpenMPI
> or the OpenFOAM mailing lists?
>
> Gus Correa
>
> On 09/20/2012 03:54 PM, Antti Korhonen wrote:
>> Hi Gus
>>
>> /ets/hosts are identical.
>> All nodes are connected to Cisco Switch and are part of same vlan.
>> Each node has only one nic enabled (as does master).
>> No daisy-chaining involved.
>>
>> No job queuing system installed yet, doing manual starts.
>> Here is my command when trying to run on 3 nodes:
>>
>> Executing:
>> /mirror/OpenFOAM/ThirdParty-2.1.x/platforms/linux64Gcc/openmpi-1.5.3/
>> bin/mpirun -np 12 -hostfile machines
>> /mirror/OpenFOAM/OpenFOAM-2.1.x/bin/foamExe
>> c -prefix /mirror/OpenFOAM interFoam -parallel | tee log
>>
>> So we are trying to use OpenFOAM and have been starting jobs from master node.
>> MPI is openMPI 1.5.3.
>>
>>     Antti
>>
>> -----Original Message-----
>> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]
>> On Behalf Of Gus Correa
>> Sent: Thursday, September 20, 2012 9:23 AM
>> To: Bewoulf
>> Subject: Re: [Beowulf] Cannot use more than two nodes on cluster
>>
>> Hi Antti
>>
>> If you're resolving host names through /etc/hosts, check if they're consistent [maybe the same] on the "master" and "slave" nodes.
>> Maybe now only the master can resolve the slave nodes' names, is this the problem?
>>
>> Also, how are the master and slave nodes connected to each other?
>> Do you have a switch where all the nodes connect to?
>> Or are the slaves connected to various NICs on the master?
>> Or are the connections daisy-chained?
>> Or something else?
>>
>> Depending on the MPI you're using, and whether you're using a job queuing system/resource manager [Torque, SGE, Slurm, etc], and whether your MPI is integrated with the job queuing system, you may or may not need to provide a "machinefile" or "hostfile" in the mpiexec/mpirun command line.
>> This file needs to bear some consistency with the /etc/hosts as well.
>>
>>
>> You didn't say anything about
>> A) which MPI you're using,
>> B) if you're launching jobs in a job queuing system or directly on the
>> master,
>> C) what is your mpiexec command line,
>>
>> You may get more focused help if you disclose this type of information.
>>
>> I hope this helps,
>> Gus Correa
>>
>> On 09/20/2012 10:37 AM, Antti Korhonen wrote:
>>> I tested ssh with all combinations and that part is working as designed.
>>>
>>> I can start job manually on any single node.
>>> I can start jobs on any two  nodes , as long as other node is master.
>>> All other combinations hang  and jobs do not start.
>>>
>>> I read through few install guides and did not find any steps I missed.
>>> I am using Ubuntu 12.04, in case that makes any difference.
>>>
>>>      Antti
>>>
>>> -----Original Message-----
>>> From: beowulf-bounces at beowulf.org
>>> [mailto:beowulf-bounces at beowulf.org]
>>> On Behalf Of Jörg Saßmannshausen
>>> Sent: Thursday, September 20, 2012 1:42 AM
>>> To: beowulf at beowulf.org
>>> Subject: Re: [Beowulf] Cannot use more than two nodes on cluster
>>>
>>> Hi all,
>>>
>>> have you tried the following: ssh master ->    node1 ->    node2, i.e. ssh from the master to node1 and from there to node2?
>>> You do not have a situation where the remote host-key is not in the database and hence you get asked about adding that key to the local database?
>>>
>>> If that is working with all permutations, another possibility is that your host list is somehow messed up when you are submitting parallel jobs. Can you start the jobs manually by providing a host list to the MPI program you are using? Does that work or do you have problems here as well?
>>>
>>> My two pennies
>>>
>>> Jörg
>>>
>>>
>>> On Thursday 20 September 2012 07:40:56 Antti Korhonen wrote:
>>>> Passwordless SSH works between all nodes.
>>>> Firewalls are disabled.
>>>>
>>>>
>>>> From: greg at r-hpc.com [mailto:greg at r-hpc.com] On Behalf Of Greg
>>>> Keller
>>>> Sent: Wednesday, September 19, 2012 8:43 PM
>>>> To: beowulf at beowulf.org; Antti Korhonen
>>>> Subject: Re: [Beowulf] Cannot use more than two nodes on cluster
>>>>
>>>> I am going to bet $0.25 that SSH or TCP/IP is configured to allow
>>>> the master to get to the nodes without a password, but not from one
>>>> Compute to the other Compute.
>>>>
>>>> Test by sshing to Compute1, then from Compute1 to Compute2.
>>>> Depending on how you built the cluster, it's also possible there is
>>>> iptables running on the compute nodes but, my money is on the ssh keys need reconfiguring.
>>>> Let us know what you find.
>>>>
>>>> Cheers!
>>>> Greg
>>>>
>>>> Date: Wed, 19 Sep 2012 16:11:21 +0000
>>>> From: Antti Korhonen
>>>> <akorhonen at theranos.com<mailto:akorhonen at theranos.com>>    Subject:
>>>> [Beowulf] Cannot use more than two nodes on cluster
>>>> To: "beowulf at beowulf.org<mailto:beowulf at beowulf.org>"
>>>> <beowulf at beowulf.org<mailto:beowulf at beowulf.org>>    Message-ID:
>>>>
>>>> <B9D51F953BEE5C42BC2B503D288542992DD935FE at SRW004PA.theranos.local<ma
>>>> i
>>>> l to:B
>>>> 9D51F953BEE5C42BC2B503D288542992DD935FE at SRW004PA.theranos.local>>
>>>> Content-Type: text/plain; charset="us-ascii"
>>>>
>>>> Hello
>>>>
>>>> I have a small Beowulf cluster (master and 3 slaves).
>>>> I can run jobs on any single nodes.
>>>> Running on two nodes sort of works, running jobs on master and 1
>>>> slave works. (all combos, master+slave1 or master+slave2 or
>>>> master+slave3) Running jobs on two slaves hangs.
>>>> Running jobs on master + any two slaves hangs.
>>>>
>>>> Would anybody have any troubleshooting tips?
>>>
>>> --
>>> *************************************************************
>>> Jörg Saßmannshausen
>>> University College London
>>> Department of Chemistry
>>> Gordon Street
>>> London
>>> WC1H 0AJ
>>>
>>> email: j.sassmannshausen at ucl.ac.uk
>>> web: http://sassy.formativ.net
>>>
>>> Please avoid sending me Word or PowerPoint attachments.
>>> See http://www.gnu.org/philosophy/no-word-attachments.html
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing To change your subscription (digest mode or unsubscribe)
>>> visit http://www.beowulf.org/mailman/listinfo/beowulf
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing To change your subscription (digest mode or unsubscribe)
>>> visit http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> Computing To change your subscription (digest mode or unsubscribe)
>> visit http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list