[Beowulf] Fwd: warewulf - cannot log into nodes

Duke Nguyen duke.lists at gmx.com
Thu Nov 29 03:35:00 PST 2012


On 11/29/12 5:52 PM, Duke Nguyen wrote:
> On 11/28/12 1:56 AM, Gus Correa wrote:
>> On 11/27/2012 01:52 PM, Gus Correa wrote:
>>> On 11/27/2012 02:14 AM, Duke Nguyen wrote:
>>>> On 11/27/12 1:44 PM, Christopher Samuel wrote:
>>>>> -----BEGIN PGP SIGNED MESSAGE-----
>>>>> Hash: SHA1
>>>>>
>>>>> On 27/11/12 15:51, Duke Nguyen wrote:
>>>>>
>>>>>> Thanks! Yes, I am trying to get the system work with
>>>>>> Torque/Maui/OpenMPI now.
>>>>> Make sure you build Open-MPI with support for Torques TM interface,
>>>>> that will save you a lot of hassle as it means mpiexec/mpirun will
>>>>> find out directly from Torque what nodes and processors have been
>>>>> allocated for the job.
>>>> Christopher, how would I check that? I got Torque/Maui/OpenMPI up,
>>>> working with root (not with normal user yet :( !!!), tried mpirun 
>>>> and it
>>>> worked fine:
>>>>
>> PS - Do 'qsub myjob' as a regular user, not as root.
>>
>>>> # /usr/lib64/openmpi/bin/mpirun -pernode --hostfile
>>>> /home/mpiwulf/.openmpihostfile /home/mpiwulf/test/mpihello
>>>> Hello world! I am process number: 3 on host node0118
>>>> Hello world! I am process number: 1 on host node0104
>>>> Hello world! I am process number: 0 on host node0103
>>>> Hello world! I am process number: 2 on host node0117
>>>>
>>>> Thanks,
>>>>
>>>> D.
>>> D.
>>>
>>> Try to omit the hostfile from your mpirun command line,
>>> put it inside a Torque/PBS script, and submit it with qsub.
>>> Like this:
>>>
>>> *********************************
>>> myPBSScript.tcsh
>>> *********************************
>>> #! /bin/tcsh
>>> #PBS -l nodes=2:ppn=8 [Assuming your Torque 'nodes' file has np=8]
>>> #PBS -q batch at mycluster.mydomain
>>> #PBS -N hello
>>> @ NP = `cat $PBS_NODEFILE | wc -l`
>>> mpirun -np ${NP} ./mpihello
>>> *********************************
>>>
>>> $ qsub myPBSScript.tcsh
>>>
>>>
>>> If OpenMPI was built with Torque support,
>>> the job will run on the nodes/processors allocated by Torque.
>>> [The nodes/processors are listed in $PBS_NODEFILE,
>>> but you don't need to refer to it in the mpirun line if
>>> OpenMPI was built with Torque support. If OpenMPI lacks
>>> Torque support, then you can use $PBS_NODEFILE as your hostfile:
>>> mpirun -hostfile $PBS_NODEFILE.]
>>>
>>> If Torque was installed in a standard place, say under /usr,
>>> then OpenMPI configure will pick it up automatically.
>>> If not in a standard location, then add
>>> --with-tm=/torque/directory
>>> to the OpenMPI configure line.
>>> [./configure --help is your friend!]
>>>
>>> Another check:
>>>
>>> $ ompi_info [tons of output that you can grep for "tm" to see
>>> if Torque was picked up.]
>>>
>>>
>
> OK, after a huge headache of torque/maui things, I finally found out 
> that my master node's system was a mess :D. Multiple version of torque 
> (via yum and via src etc...) which cause the confuse for different 
> users logging in (root or normal users) - well, mainly because I 
> followed different guides on the net. Then I decided to delete 
> everything related to pbs (torque, maui, openmpi) and start from 
> scratch. So I built torque rpms for masters/nodes, installed them, 
> then built maui rpm, installed with support for torque, then built 
> openmpi rpm with support for torque too. This time I think I got 
> almost everything:
>
> [mpiwulf at biobos:~]$ ompi_info | grep tm
>                  MCA ras: tm (MCA v2.0, API v2.0, Component v1.6.3)
>                  MCA plm: tm (MCA v2.0, API v2.0, Component v1.6.3)
>                  MCA ess: tm (MCA v2.0, API v2.0, Component v1.6.3)
>
> openmpi now works with infiniband:
>
> [mpiwulf at biobos:~]$ /usr/local/bin/mpirun -mca btl ^tcp -pernode 
> --hostfile /home/mpiwulf/.openmpihostfile /home/mpiwulf/test/mpihello
> Hello world!  I am process number: 3 on host node0118
> Hello world!  I am process number: 1 on host node0104
> Hello world!  I am process number: 2 on host node0117
> Hello world!  I am process number: 0 on host node0103
>
> openmpi also works with torque:
>
> ----------------
> [mpiwulf at biobos:~]$ cat test/KCBATCH
> #!/bin/bash
> #
> #PBS -l nodes=6:ppn=1
> #PBS -N kcTEST
> #PBS -m be
> #PBS -e qsub.er.log
> #PBS -o qsub.ou.log
> #
> { time {
> /usr/local/bin/mpirun /home/mpiwulf/test/mpihello
> } } &>output.log
>
> [mpiwulf at biobos:~]$ qsub test/KCBATCH
> 21.biobos
>
> [mpiwulf at biobos:~]$ cat output.log
> -------------------------------------------------------------------------- 
>
> The OpenFabrics (openib) BTL failed to initialize while trying to
> allocate some locked memory.  This typically can indicate that the
> memlock limits are set too low.  For most HPC installations, the
> memlock limits should be set to "unlimited".  The failure occured
> here:
>
>   Local host:    node0103
>   OMPI source:   btl_openib_component.c:1200
>   Function:      ompi_free_list_init_ex_new()
>   Device:        mthca0
>   Memlock limit: 65536
>
> You may need to consult with your system administrator to get this
> problem fixed.  This FAQ entry on the Open MPI web site may also be
> helpful:
>
> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
> -------------------------------------------------------------------------- 
>
> -------------------------------------------------------------------------- 
>
> WARNING: There was an error initializing an OpenFabrics device.
>
>   Local host:   node0103
>   Local device: mthca0
> -------------------------------------------------------------------------- 
>
> Hello world!  I am process number: 5 on host node0103
> Hello world!  I am process number: 0 on host node0104
> Hello world!  I am process number: 2 on host node0110
> Hello world!  I am process number: 4 on host node0118
> Hello world!  I am process number: 1 on host node0109
> Hello world!  I am process number: 3 on host node0117
> [node0104:02221] 5 more processes have sent help message 
> help-mpi-btl-openib.txt / init-fail-no-mem
> [node0104:02221] Set MCA parameter "orte_base_help_aggregate" to 0 to 
> see all help / error messages
> [node0104:02221] 5 more processes have sent help message 
> help-mpi-btl-openib.txt / error in device init
>
> real    0m0.291s
> user    0m0.034s
> sys     0m0.043s
> ----------------
>
> Unfortunately I still got the problem of "error registering openib 
> memory" with non-interactive job. Any experience on this would be great.

Got it now, though I *do not* really like the solution. I had to edit 
the pbs_mom daemon:

# vi /etc/rc.d/init.d/pbs_mom

and make sure to have:

ulimit -l unlimited
#ulimit -n 32768

and now openib works fine :).

D.







More information about the Beowulf mailing list