From stuartb at 4gh.net  Sat Aug  1 15:24:18 2015
From: stuartb at 4gh.net (Stuart Barkley)
Date: Sat, 1 Aug 2015 18:24:18 -0400 (EDT)
Subject: [Beowulf] Scheduler question -- non-uniform memory allocation
 to MPI
In-Reply-To: <55BA4414.7050601@harvill.net>
References: <CAL8g0j+hfc5SutHGTp5cAsDWJvxSGR8p7rw=ycog99toUaNe9w@mail.gmail.com>
 <55A91292.1030303@rutgers.edu> <55BA4414.7050601@harvill.net>
Message-ID: <alpine.BSF.2.20.1508011746480.22833@freeman.4gh.net>

On Thu, 30 Jul 2015 at 11:34 -0000, Tom Harvill wrote:

> We run SLURM with cgroups for memory containment of jobs.  When
> users request resources on our cluster many times they will specify
> the number of (MPI) tasks and memory per task.  The reality of much
> of the software that runs is that most of the memory is used by MPI
> rank 0 and much less on slave processes.  This is wasteful and
> sometimes causes bad outcomes (OOMs and worse) during job runs.

I'll note that this problem also can occur in Grid Engine and OpenMPI.

We would get user reports of random job failures.  Sometimes the job
would run and other times it would fail.

We normally run allowing shared node access and the cases I've seen
with problems were with a highly fragmented cluster with tasks spread
1-2 per node.  Having the job request exclusive nodes (8 cores) was
generally enough to consolidate the qrsh processes from ~200 to ~50
which provided enough headroom on the master process.

The times I've observed have been due to the MPI startup process which
spawns a qrsh/ssh login from the master node to each of the slave
nodes (multiple MPI ranks on a slave share the same qrsh connection).
The memory for all of these qrsh processes on the master node can
eventually add up to be enough to cause out of memory conditions.

This "solution" (workaround) has been good enough for our impacted
users so far.  Eventually without other changes this problem will
return and not have as simple a solution.

Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone

From mikky_m at mail.ru  Mon Aug  3 02:06:27 2015
From: mikky_m at mail.ru (=?UTF-8?B?TWlraGFpbCBLdXptaW5za3k=?=)
Date: Mon, 03 Aug 2015 12:06:27 +0300
Subject: [Beowulf] =?utf-8?q?Haswell_as_supercomputer_microprocessors?=
In-Reply-To: <mailman.1380.1438430416.27821.mailman@beowulf.org>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
Message-ID: <1438592787.955484558@f398.i.mail.ru>

 New special supercomputer microprocessors (like IBM Power BQC and Fujitsu SPARC64 XIfx) have 2**N +2 cores (N=4 for 1st, N=5 for 2nd), where 2 last cores are redundant, not for computations, but only for other work w/Linux or even for replacing of failed computational core.

Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores.? Is there some sense to try POWER BQC or SPARC64 XIfx ideas (not exactly), and use only 16 Haswell cores for parallel computations ? If the answer is "yes", then how to use this way under Linux ?

Mikhail Kuzminsky, 
Zelinsky Institute of Organic Chemistry RAS,
Moscow


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150803/6ec91261/attachment.html>

From j.sassmannshausen at ucl.ac.uk  Mon Aug  3 03:59:00 2015
From: j.sassmannshausen at ucl.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=)
Date: Mon, 3 Aug 2015 11:59:00 +0100
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <1438592787.955484558@f398.i.mail.ru>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
Message-ID: <201508031159.03015.j.sassmannshausen@ucl.ac.uk>

Hi Mikhail,

I would guess your queueing system could take care of that. 

With SGE you can define how many cores each node has. Thus, if you only want to 
use 16 out of the 18 cores you simply define that.

Alternatively, at least OpenMPI allows you to underpopulate the nodes as well.

Having said that, is there a good reason why you want to purchase 18 cores and 
then only use 16? 
The only thing I can think of why one needs to / wants to do that is if your 
job requires more memory which you got on the node. For memory intensive work 
I am still thinking that less cores and more nodes are beneficial here.

My 2 cents from a sunny London

J?rg

On Monday 03 Aug 2015 10:06:27 Mikhail Kuzminsky wrote:
>  New special supercomputer microprocessors (like IBM Power BQC and Fujitsu
> SPARC64 XIfx) have 2**N +2 cores (N=4 for 1st, N=5 for 2nd), where 2 last
> cores are redundant, not for computations, but only for other work w/Linux
> or even for replacing of failed computational core.
> 
> Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores.  Is there
> some sense to try POWER BQC or SPARC64 XIfx ideas (not exactly), and use
> only 16 Haswell cores for parallel computations ? If the answer is "yes",
> then how to use this way under Linux ?
> 
> Mikhail Kuzminsky,
> Zelinsky Institute of Organic Chemistry RAS,
> Moscow

-- 
*************************************************************
Dr. J?rg Sa?mannshausen, MRSC
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150803/cc2a0d05/attachment.sig>

From samuel at unimelb.edu.au  Mon Aug  3 06:12:09 2015
From: samuel at unimelb.edu.au (Chris Samuel)
Date: Mon, 03 Aug 2015 23:12:09 +1000
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <1438592787.955484558@f398.i.mail.ru>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
Message-ID: <1672352.DTB073LGgN@quad>

On Mon, 3 Aug 2015 12:06:27 PM Mikhail Kuzminsky wrote:

> Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores.  Is there some
> sense to try POWER BQC or SPARC64 XIfx ideas (not exactly), and use only 16
> Haswell cores for parallel computations ? If the answer is "yes", then how
> to use this way under Linux ?

Doing this with Linux predates BGQ for instance - the whole cpuset idea came 
from SGI and was used on their Itanic Altix systems to provide a boot CPU set 
that would have all system processes confined into and then the rest of the 
cores were available for jobs.

When we used to use Torque I agitated for cpuset support, and for it to be 
done in a way that would allow this.

We use Slurm now, but I've never looked at how easy to make it work in the 
boot cpuset type mode - it's probably just a matter of telling it there are 
N-1 cores per node and ensuring that it doesn't try and claim the same core 
you're using as the boot cpuset. :-)

Best of luck!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci


From John.Hearns at xma.co.uk  Mon Aug  3 06:28:17 2015
From: John.Hearns at xma.co.uk (John Hearns)
Date: Mon, 3 Aug 2015 13:28:17 +0000
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <1672352.DTB073LGgN@quad>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru> <1672352.DTB073LGgN@quad>
Message-ID: <3004B1DE9C157E4585DD4B35D316EDFDACD27C@ALXEXCHMB01.xma.co.uk>

On Mon, 3 Aug 2015 12:06:27 PM Mikhail Kuzminsky wrote:

> Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores.  Is
> there some sense to try POWER BQC or SPARC64 XIfx ideas (not exactly),
> and use only 16 Haswell cores for parallel computations ? If the
> answer is "yes", then how to use this way under Linux ?

Doing this with Linux predates BGQ for instance - the whole cpuset idea came from SGI and was used on their Itanic Altix systems to provide a boot CPU set that would have all system processes confined into and then the rest of the cores were available for jobs.

When we used to use Torque I agitated for cpuset support, and for it to be done in a way that would allow this.

We use Slurm now, but I've never looked at how easy to make it work in the boot cpuset type mode - it's probably just a matter of telling it there are
N-1 cores per node and ensuring that it doesn't try and claim the same core you're using as the boot cpuset. :-)


Plus one to Chris with cpusets.
Cpusets not only on Itanium - I used them on a large memory UV system.
I Can see more and more people speccing high memory x86 systems these days, and they certainly should be looking at using cpusets.


I have often though we should have 'donkey engine' CPUs for HPC.
I thought these were the small enginers which powered up very large shipboard engines.
I may have that wrong!  https://en.wikipedia.org/wiki/Steam_donkey

As Mikhail says, you run the OS and the batch system daemons on there, leaving the rest of the CPUs for 100% flat out HPC work.


#####################################################################################
Scanned by MailMarshal - M86 Security's comprehensive email content security solution.
#####################################################################################
Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP

From landman at scalableinformatics.com  Mon Aug  3 06:37:19 2015
From: landman at scalableinformatics.com (Joe Landman)
Date: Mon, 3 Aug 2015 09:37:19 -0400
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <1438592787.955484558@f398.i.mail.ru>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
Message-ID: <55BF6E8F.9040709@scalableinformatics.com>

On 08/03/2015 05:06 AM, Mikhail Kuzminsky wrote:
> New special supercomputer microprocessors (like IBM Power BQC and
> Fujitsu SPARC64 XIfx) have 2**N +2 cores (N=4 for 1st, N=5 for 2nd),
> where 2 last cores are redundant, not for computations, but only for
> other work w/Linux or even for replacing of failed computational core.
>
> Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores.  Is there
> some sense to try POWER BQC or SPARC64 XIfx ideas (not exactly), and use
> only 16 Haswell cores for parallel computations ? If the answer is
> "yes", then how to use this way under Linux ?

Its possible to do this with some taskset incantation with cpuset 
filesystem bits (burnt offerings generally not needed).  I don't think 
there are "redundant" cores in the Intel product.

Its left as an exercise to the reader to implement though ...

More seriously, you can do some of this also with cgroups 
https://en.wikipedia.org/wiki/Cgroups which is actually what Docker et 
al. do (in part).

There are many ways to attack this problem.

If you are trying to isolate the OS from the computation, say to reduce 
OS jitter impacts upon processes, you might also like work on setting 
interrupt affinity, as well as start working with memory placement 
directly (to minimize QPI usage).  The issue you will encounter is that 
most of the HPC systems with a single HCA/NIC will require IO to/from a 
remote (in a NUMA sense) node.  Which means going over QPI.  Unless you 
have the Intel Infinipath (or Omnipath ... I am not as up on the new 
naming as I should be) or a multi-rail config set up specifically to put 
one NIC/HCA on each socket.

The point I am trying (subtly) to make here is that you can possibly 
spend more time and effort on optimization here.  The question is (and 
for the above) the relative value of this.  For various codes, OS jitter 
is very important, and you should seek to eliminate it.  For others ... 
not so much.


-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
e: landman at scalableinformatics.com
w: http://scalableinformatics.com
t: @scalableinfo
p: +1 734 786 8423 x121
c: +1 734 612 4615

From prentice.bisbal at rutgers.edu  Mon Aug  3 08:10:43 2015
From: prentice.bisbal at rutgers.edu (Prentice Bisbal)
Date: Mon, 03 Aug 2015 11:10:43 -0400
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <1438592787.955484558@f398.i.mail.ru>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
Message-ID: <55BF8473.3050802@rutgers.edu>

The processor in the IBM BG/Q is actually a POWER A2.[1] I never 
understood why Top500 listed them as BQC. The POWER A2 processor 
actually has 18 cores: 16 for computations, 1 for the OS itself, and 1 
'spare'. I believe the spare is not a hot spare, but is there to 
increase the yield in chip manufacturing. If there are 18 usable cores 
on the chip, one is disabled. If one core is not usable, well, they 
still have the 17 they were hoping for. (This is what I heard, but I 
don't remember who the source was or how credible it was. If this is 
wrong, someone please correct me!).

I wouldn't core the for the OS redundant. It actually improves the 
performance of the total system, as documented by the well-known 'ASCI 
Q' paper [2].

Now to answer your question, the answer is yes. I highly recommend you 
read [2] for a good explanation of why (the authors did a better job 
explaining it than I can in a quick e-mail). However, the improvement in 
performance increases with the size of the cluster, so it probably won't 
be noticeable on small clusters.

In addition to dedicating a single core for the OS, you also want to 
reduce OS 'noise'  (also called 'jitter') as much as possible by 
reducing services on the head node. You can do this by turning off or 
uninstalling unnecessary services and building a custom kernel that has 
only the services and hardware support needed by your cluster. This is 
the idea being the very minimal kernel compute-node kernel (CNK) of the 
Blue Gene Nodes. This is an active area of research with many different 
groups working in this area:

https://en.wikipedia.org/wiki/Lightweight_Kernel_Operating_System
https://en.wikipedia.org/wiki/Compute_Node_Linux
http://www.mcs.anl.gov/research/projects/zeptoos/
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=323279

[1] 
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?subtype=SP&infotype=PM&appname=STGE_DC_DC_USEN&htmlfid=DCD12345USEN&attachment=DCD12345USEN.PDF

[2] http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1592958


Prentice Bisbal
Systems Programmer/Administrator
Office of Instructional and Research Technology
Rutgers University
http://oirt.rutgers.edu

On 08/03/2015 05:06 AM, Mikhail Kuzminsky wrote:
> New special supercomputer microprocessors (like IBM Power BQC and 
> Fujitsu SPARC64 XIfx) have 2**N +2 cores (N=4 for 1st, N=5 for 2nd), 
> where 2 last cores are redundant, not for computations, but only for 
> other work w/Linux or even for replacing of failed computational core.
>
> Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores.  Is 
> there some sense to try POWER BQC or SPARC64 XIfx ideas (not exactly), 
> and use only 16 Haswell cores for parallel computations ? If the 
> answer is "yes", then how to use this way under Linux ?
>
> Mikhail Kuzminsky,
> Zelinsky Institute of Organic Chemistry RAS,
> Moscow
>
>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150803/8b6c5738/attachment.html>

From kilian.cavalotti.work at gmail.com  Mon Aug  3 09:18:09 2015
From: kilian.cavalotti.work at gmail.com (Kilian Cavalotti)
Date: Mon, 3 Aug 2015 09:18:09 -0700
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <1438592787.955484558@f398.i.mail.ru>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
Message-ID: <CAJz=VjGAsUGnsWzoAa8Ds4CdAt_fs=mc2B24X=w8w23Wx4SmtA@mail.gmail.com>

Hi Mikhail,

That's something you can achieve with Slurm, using what they call
"Core Specialization". See http://slurm.schedmd.com/core_spec.html for
details.

Cheers,
-- 
Kilian

On Mon, Aug 3, 2015 at 2:06 AM, Mikhail Kuzminsky <mikky_m at mail.ru> wrote:
> New special supercomputer microprocessors (like IBM Power BQC and Fujitsu
> SPARC64 XIfx) have 2**N +2 cores (N=4 for 1st, N=5 for 2nd), where 2 last
> cores are redundant, not for computations, but only for other work w/Linux
> or even for replacing of failed computational core.
>
> Current Intel Haswell E5 v3 may also have 18 = 2**4 +2 cores.  Is there some
> sense to try POWER BQC or SPARC64 XIfx ideas (not exactly), and use only 16
> Haswell cores for parallel computations ? If the answer is "yes", then how
> to use this way under Linux ?
>
> Mikhail Kuzminsky,
> Zelinsky Institute of Organic Chemistry RAS,
> Moscow
>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Kilian

From mikky_m at mail.ru  Tue Aug  4 03:38:48 2015
From: mikky_m at mail.ru (=?UTF-8?B?TWlraGFpbCBLdXptaW5za3k=?=)
Date: Tue, 04 Aug 2015 13:38:48 +0300
Subject: [Beowulf] =?utf-8?q?Haswell_as_supercomputer_microprocessors?=
In-Reply-To: <CAJz=VjGAsUGnsWzoAa8Ds4CdAt_fs=mc2B24X=w8w23Wx4SmtA@mail.gmail.com>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
 <CAJz=VjGAsUGnsWzoAa8Ds4CdAt_fs=mc2B24X=w8w23Wx4SmtA@mail.gmail.com>
Message-ID: <1438684728.117628822@f300.i.mail.ru>

 By my opinion, PowerPC A2 more exactly should be used as name for *core*, not for IBM? BlueGene/Q *processor chip*.
"Power BQC" name is used in TOP500, GREEN500, in a lot of Internet data, in IBM journal - see:

Sugavanam K. et al. Design for low power and power management in IBM Blue Gene/Q //IBM Journal of Research and 
Development. ? 2013. ?v. 57. ? ?. 1/2. ? p. 3: 1-3: 11.

PowerPC A2 is the core, see //en.wikipedia.org/wiki/Blue_Gene
???????????????????????????????????????????????????? //en.wikipedia.org/wiki/PowerPC A2

Mikhail


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150804/de4d6d84/attachment.html>

From John.Hearns at xma.co.uk  Tue Aug  4 06:34:19 2015
From: John.Hearns at xma.co.uk (John Hearns)
Date: Tue, 4 Aug 2015 13:34:19 +0000
Subject: [Beowulf] CAP Theorem
Message-ID: <3004B1DE9C157E4585DD4B35D316EDFDACD7ED@ALXEXCHMB01.xma.co.uk>

I have been reading two interesting articles on Docker:

http://blog.circleci.com/its-the-future/
http://blog.circleci.com/it-really-is-the-future/

The first one is a good laugh and was meant as a parody.

I guess there may have have been discussions on CAP Theorem with relevance to HPC, and especially exascale systems.
However the term is new to me.  https://en.wikipedia.org/wiki/CAP_theorem

I realise that it is relevant to distributed databases, but how about distributed computation?
________________________________

Scanned by MailMarshal - M86 Security's comprehensive email content security solution.

________________________________
Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150804/6847e544/attachment.html>

From mndoci at gmail.com  Tue Aug  4 06:52:25 2015
From: mndoci at gmail.com (Deepak Singh)
Date: Tue, 4 Aug 2015 06:52:25 -0700
Subject: [Beowulf] CAP Theorem
In-Reply-To: <3004B1DE9C157E4585DD4B35D316EDFDACD7ED@ALXEXCHMB01.xma.co.uk>
References: <3004B1DE9C157E4585DD4B35D316EDFDACD7ED@ALXEXCHMB01.xma.co.uk>
Message-ID: <EC75C7E7-2B09-4A2E-9151-1B736F2B1109@gmail.com>

If you ever want to dive deep into how well various systems handle partitions look no further than Aphyr's Jepsen series

https://aphyr.com/tags/jepsen


> On Aug 4, 2015, at 06:34, John Hearns <John.Hearns at xma.co.uk> wrote:
> 
> I have been reading two interesting articles on Docker:
>  
> http://blog.circleci.com/its-the-future/
> http://blog.circleci.com/it-really-is-the-future/
>  
> The first one is a good laugh and was meant as a parody.
>  
> I guess there may have have been discussions on CAP Theorem with relevance to HPC, and especially exascale systems.
> However the term is new to me.  https://en.wikipedia.org/wiki/CAP_theorem
>  
> I realise that it is relevant to distributed databases, but how about distributed computation?
> 
> Scanned by MailMarshal - M86 Security's comprehensive email content security solution. 
> 
> Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150804/22558a55/attachment-0001.html>

From prentice.bisbal at rutgers.edu  Tue Aug  4 07:52:00 2015
From: prentice.bisbal at rutgers.edu (Prentice Bisbal)
Date: Tue, 04 Aug 2015 10:52:00 -0400
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <1438684728.117628822@f300.i.mail.ru>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
 <CAJz=VjGAsUGnsWzoAa8Ds4CdAt_fs=mc2B24X=w8w23Wx4SmtA@mail.gmail.com>
 <1438684728.117628822@f300.i.mail.ru>
Message-ID: <55C0D190.3010105@rutgers.edu>

Seriously? Why does IBM have to make everything so difficult? Take GPFS. 
It was originally called MMFS for Multimedia filesystem, then GPFS for 
General Parallel Filesystem. A couple of years ago they decided to  
market it as a hardware/software solution called GPFS Storage Server, or 
GSS. Apparently, that didn't have enough buzzwordiness to it, so they 
changed it to ESS, for Elastic Storage Server. As if that wasn't enough, 
then they had to confuse their current and future customers by changing 
the name yet again to Spectra-scale. And yes, I am annoyed by all this!

What's really ironic is that IBM is one of the leaders brand 
management/corporate identity, so you'd think they'd see the value in 
sticking with a name.

Rant over.

Prentice

On 08/04/2015 06:38 AM, Mikhail Kuzminsky wrote:
> By my opinion, PowerPC A2 more exactly should be used as name for 
> *core*, not for IBM  BlueGene/Q *processor chip*.
> "Power BQC" name is used in TOP500, GREEN500, in a lot of Internet 
> data, in IBM journal - see:
>
> Sugavanam K. et al. Design for low power and power management in IBM 
> Blue Gene/Q //IBM Journal of Research and
> Development. ? 2013. ?v. 57. ? ?. 1/2. ? p. 3: 1-3: 11.
>
> PowerPC A2 is the core, see //en.wikipedia.org/wiki/Blue_Gene
> //en.wikipedia.org/wiki/PowerPC A2
>
> Mikhail
>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150804/102f12c5/attachment.html>

From samuel at unimelb.edu.au  Tue Aug  4 17:50:19 2015
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Wed, 05 Aug 2015 10:50:19 +1000
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <55C0D190.3010105@rutgers.edu>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
 <CAJz=VjGAsUGnsWzoAa8Ds4CdAt_fs=mc2B24X=w8w23Wx4SmtA@mail.gmail.com>
 <1438684728.117628822@f300.i.mail.ru> <55C0D190.3010105@rutgers.edu>
Message-ID: <55C15DCB.9040003@unimelb.edu.au>

On 05/08/15 00:52, Prentice Bisbal wrote:

> Seriously? Why does IBM have to make everything so difficult?

As I understand it Power BQC is the SoC (so CPU, networking, etc),
whereas A2 is the CPU core & instruction set.  So it's a fair
distinction (though one that is often glossed over in practice).

-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci


From prentice.bisbal at rutgers.edu  Wed Aug  5 07:38:09 2015
From: prentice.bisbal at rutgers.edu (Prentice Bisbal)
Date: Wed, 05 Aug 2015 10:38:09 -0400
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <55C15DCB.9040003@unimelb.edu.au>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
 <CAJz=VjGAsUGnsWzoAa8Ds4CdAt_fs=mc2B24X=w8w23Wx4SmtA@mail.gmail.com>
 <1438684728.117628822@f300.i.mail.ru> <55C0D190.3010105@rutgers.edu>
 <55C15DCB.9040003@unimelb.edu.au>
Message-ID: <55C21FD1.5010704@rutgers.edu>

On 08/04/2015 08:50 PM, Christopher Samuel wrote:
> On 05/08/15 00:52, Prentice Bisbal wrote:
>
>> Seriously? Why does IBM have to make everything so difficult?
> As I understand it Power BQC is the SoC (so CPU, networking, etc),
> whereas A2 is the CPU core & instruction set.  So it's a fair
> distinction (though one that is often glossed over in practice).
Okay. That makes perfect sense, but I will still argue that if that 
correct, using that terminology in the Top500 list doesn't make sense. 
An SoC is equivalent to a motherboard, but for regular Intel/AMD 
systems, they list the processor model, not the motherboard, so to list 
BQC instead of POWER A2 for the BG systems is inconsistent. As an 
example, compare the Sequoia description from the Top500 list to 
Stampede, which is just a regular x86 system:

*Sequoia*: BlueGene/Q, Power BQC 16C 1.60 GHz, Custom, 
<http://top500.org/system/177556>
IBM

*Stampede*: PowerEdge C8220, Xeon E5-2680 8C 2.700GHz, Infiniband FDR, 
Intel Xeon Phi SE10P, <http://top500.org/system/177931>
Dell

Prentice
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150805/05b89b36/attachment.html>

From jason at lovesgoodfood.com  Fri Aug  7 07:38:59 2015
From: jason at lovesgoodfood.com (Jason Riedy)
Date: Fri, 07 Aug 2015 10:38:59 -0400
Subject: [Beowulf] Haswell as supercomputer microprocessors
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
 <CAJz=VjGAsUGnsWzoAa8Ds4CdAt_fs=mc2B24X=w8w23Wx4SmtA@mail.gmail.com>
 <1438684728.117628822@f300.i.mail.ru> <55C0D190.3010105@rutgers.edu>
 <55C15DCB.9040003@unimelb.edu.au> <55C21FD1.5010704@rutgers.edu>
Message-ID: <871tffjh8c.fsf@qNaN.sparse.dyndns.org>

And Prentice Bisbal writes:
> Okay. That makes perfect sense, but I will still argue that if that
> correct, using that terminology in the Top500 list doesn't make
> sense.

I less care about the terminology than that the linpack results
are cut and paste between identical configurations rather than
actually run on them.  But I'm sure no large system has a poor
cable or connection in the mix that would be detected...


From prentice.bisbal at rutgers.edu  Fri Aug  7 13:34:02 2015
From: prentice.bisbal at rutgers.edu (Prentice Bisbal)
Date: Fri, 07 Aug 2015 16:34:02 -0400
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <871tffjh8c.fsf@qNaN.sparse.dyndns.org>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <1438592787.955484558@f398.i.mail.ru>
 <CAJz=VjGAsUGnsWzoAa8Ds4CdAt_fs=mc2B24X=w8w23Wx4SmtA@mail.gmail.com>
 <1438684728.117628822@f300.i.mail.ru> <55C0D190.3010105@rutgers.edu>
 <55C15DCB.9040003@unimelb.edu.au> <55C21FD1.5010704@rutgers.edu>
 <871tffjh8c.fsf@qNaN.sparse.dyndns.org>
Message-ID: <55C5163A.3010300@rutgers.edu>

On 08/07/2015 10:38 AM, Jason Riedy wrote:
> And Prentice Bisbal writes:
>> Okay. That makes perfect sense, but I will still argue that if that
>> correct, using that terminology in the Top500 list doesn't make
>> sense.
> I less care about the terminology than that the linpack results
> are cut and paste between identical configurations rather than
> actually run on them.  But I'm sure no large system has a poor
> cable or connection in the mix that would be detected...
>
>
Well, the terminology helps us to make sure we're comparing apples to 
apples!

--
Prentice

From hakon.bugge at gmail.com  Sat Aug  8 06:08:09 2015
From: hakon.bugge at gmail.com (=?utf-8?B?SMOla29uIEJ1Z2dl?=)
Date: Sat, 08 Aug 2015 15:08:09 +0200
Subject: [Beowulf] =?utf-8?q?Haswell_as_supercomputer_microprocessors?=
Message-ID: <55c5ff39.a181700a.5edf9.6934@mx.google.com>

Sorry for top posting.

Jason has more than a valid point. At least in former times, I do know that cut&paste from not only _identical_ configurations happened. For example, the system submitted to the list was equipped with eth NICs, whereas the performance quoted was from a similar system, but with a proprietary HPC interconnect. 

So much for the apples-to-apples.

I favour the SPEC suites when it comes to comparing systems, but with the caveat that vendors or customers of the larger system show little interrest.

H?kon

Sendt fra min HTC

----- Reply message -----
Fra: "Prentice Bisbal" <prentice.bisbal at rutgers.edu>
Til: <beowulf at beowulf.org>
Emne: [Beowulf] Haswell as supercomputer microprocessors
Dato: fre., aug. 7, 2015 22:34


On 08/07/2015 10:38 AM, Jason Riedy wrote:
> And Prentice Bisbal writes:
>> Okay. That makes perfect sense, but I will still argue that if that
>> correct, using that terminology in the Top500 list doesn't make
>> sense.
> I less care about the terminology than that the linpack results
> are cut and paste between identical configurations rather than
> actually run on them.  But I'm sure no large system has a poor
> cable or connection in the mix that would be detected...
>
>
Well, the terminology helps us to make sure we're comparing apples to 
apples!

--
Prentice
_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150808/0063028b/attachment.html>

From samuel at unimelb.edu.au  Sat Aug  8 06:41:07 2015
From: samuel at unimelb.edu.au (Chris Samuel)
Date: Sat, 08 Aug 2015 23:41:07 +1000
Subject: [Beowulf] Haswell as supercomputer microprocessors
In-Reply-To: <871tffjh8c.fsf@qNaN.sparse.dyndns.org>
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <55C21FD1.5010704@rutgers.edu> <871tffjh8c.fsf@qNaN.sparse.dyndns.org>
Message-ID: <3080805.ZaGZyPBPp7@quad>

On Fri, 7 Aug 2015 10:38:59 AM Jason Riedy wrote:

> I less care about the terminology than that the linpack results
> are cut and paste between identical configurations rather than
> actually run on them.  But I'm sure no large system has a poor
> cable or connection in the mix that would be detected...

IIRC (not at work to check) HPL is actually part of the BGQ diagnostics; BGQ 
also has some very useful cable diagnostics that it monitors and flags broken 
wires up proactively (and has spares to work around them).

But not really a beowulf system.. :-)

-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci


From jason at lovesgoodfood.com  Sat Aug  8 14:04:18 2015
From: jason at lovesgoodfood.com (Jason Riedy)
Date: Sat, 08 Aug 2015 17:04:18 -0400
Subject: [Beowulf] Haswell as supercomputer microprocessors
References: <mailman.1380.1438430416.27821.mailman@beowulf.org>
 <55C21FD1.5010704@rutgers.edu> <871tffjh8c.fsf@qNaN.sparse.dyndns.org>
 <3080805.ZaGZyPBPp7@quad>
Message-ID: <87h9o9mqzx.fsf@qNaN.sparse.dyndns.org>

And Chris Samuel writes:
> IIRC (not at work to check) HPL is actually part of the BGQ diagnostics; BGQ 
> also has some very useful cable diagnostics that it monitors and flags broken 
> wires up proactively (and has spares to work around them).

And part of most acceptance tests, but those aren't the results
reported on the list.  The variance in commercial systems'
results could be a useful reliability-like metric.


From John.Hearns at xma.co.uk  Thu Aug 20 06:24:14 2015
From: John.Hearns at xma.co.uk (John Hearns)
Date: Thu, 20 Aug 2015 13:24:14 +0000
Subject: [Beowulf] Accelio
Message-ID: <3004B1DE9C157E4585DD4B35D316EDFDAD251D@ALXEXCHMB01.xma.co.uk>

I saw this mentioned on the Mellanox site. Has anyone come across it:

http://www.accelio.org/

Looks interesting.


Dr. John Hearns
Principal HPC Engineer
Product Development
T:
M:
F:


01727 201 800
07432 647 511
01727 201 814


Visit us at www.xma.co.uk<http://www.xma.co.uk/>
Follow us @WeareXMA<https://www.twitter.com/WeareXMA>


XMA
7 Handley Page Way
Old Parkbury Lane
Colney Street
St. Albans
Hertfordshire
AL2 2DQ


[We are XMA.]<http://www.xma.co.uk/>


[XMA]<http://www.xma.co.uk/>


________________________________

Scanned by MailMarshal - M86 Security's comprehensive email content security solution.

________________________________
Any views or opinions presented in this email are solely those of the author and do not necessarily represent those of the company. Employees of XMA Ltd are expressly required not to make defamatory statements and not to infringe or authorise any infringement of copyright or any other legal right by email communications. Any such communication is contrary to company policy and outside the scope of the employment of the individual concerned. The company will not accept any liability in respect of such communication, and the employee responsible will be personally liable for any damages or other liability arising. XMA Limited is registered in England and Wales (registered no. 2051703). Registered Office: Wilford Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150820/4a54678c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4814 bytes
Desc: image001.png
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150820/4a54678c/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 2187 bytes
Desc: image002.png
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150820/4a54678c/attachment-0003.png>

From e.scott.atchley at gmail.com  Thu Aug 20 11:22:06 2015
From: e.scott.atchley at gmail.com (Scott Atchley)
Date: Thu, 20 Aug 2015 14:22:06 -0400
Subject: [Beowulf] Accelio
In-Reply-To: <3004B1DE9C157E4585DD4B35D316EDFDAD251D@ALXEXCHMB01.xma.co.uk>
References: <3004B1DE9C157E4585DD4B35D316EDFDAD251D@ALXEXCHMB01.xma.co.uk>
Message-ID: <CAL8g0j+h-s8+i-G_tt98QdEFctF_dy1xtE0faSL3k=+AxhaW1g@mail.gmail.com>

They are using this as a basis for the XioMessenger within Ceph to get RDMA
support.

On Thu, Aug 20, 2015 at 9:24 AM, John Hearns <John.Hearns at xma.co.uk> wrote:

> I saw this mentioned on the Mellanox site. Has anyone come across it:
>
>
>
> http://www.accelio.org/
>
>
>
> Looks interesting.
>
>
>
>
>
>
>
> Dr. John Hearns
> Principal HPC Engineer
> Product Development
>
> T:
> M:
> F:
>
>
> 01727 201 800
> 07432 647 511
> 01727 201 814
>
>
> Visit us at www.xma.co.uk
> Follow us @WeareXMA <https://www.twitter.com/WeareXMA>
>
>
> *XMA*
> 7 Handley Page Way
> Old Parkbury Lane
> Colney Street
> St. Albans
> Hertfordshire
> AL2 2DQ
>
>
>
> [image: We are XMA.] <http://www.xma.co.uk/>
>
>
>
> [image: XMA] <http://www.xma.co.uk/>
>
>
>
>
>
> ------------------------------
>
> Scanned by *MailMarshal* - M86 Security's comprehensive email content
> security solution.
>
> ------------------------------
> Any views or opinions presented in this email are solely those of the
> author and do not necessarily represent those of the company. Employees of
> XMA Ltd are expressly required not to make defamatory statements and not to
> infringe or authorise any infringement of copyright or any other legal
> right by email communications. Any such communication is contrary to
> company policy and outside the scope of the employment of the individual
> concerned. The company will not accept any liability in respect of such
> communication, and the employee responsible will be personally liable for
> any damages or other liability arising. XMA Limited is registered in
> England and Wales (registered no. 2051703). Registered Office: Wilford
> Industrial Estate, Ruddington Lane, Wilford, Nottingham, NG11 7EP
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150820/97c93f0e/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 4814 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150820/97c93f0e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 2187 bytes
Desc: not available
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150820/97c93f0e/attachment-0001.png>

From jason at lovesgoodfood.com  Thu Aug 20 13:32:46 2015
From: jason at lovesgoodfood.com (Jason Riedy)
Date: Thu, 20 Aug 2015 16:32:46 -0400
Subject: [Beowulf] Accelio
References: <3004B1DE9C157E4585DD4B35D316EDFDAD251D@ALXEXCHMB01.xma.co.uk>
Message-ID: <87zj1l1z0x.fsf@qNaN.sparse.dyndns.org>

And John Hearns writes:
> I saw this mentioned on the Mellanox site. Has anyone come across it:
> http://www.accelio.org/

Why have one when you can have many? http://www.openucx.org/


From jcownie at gmail.com  Thu Aug 20 14:13:45 2015
From: jcownie at gmail.com (James Cownie)
Date: Thu, 20 Aug 2015 22:13:45 +0100
Subject: [Beowulf] Accelio
In-Reply-To: <87zj1l1z0x.fsf@qNaN.sparse.dyndns.org>
References: <3004B1DE9C157E4585DD4B35D316EDFDAD251D@ALXEXCHMB01.xma.co.uk>
 <87zj1l1z0x.fsf@qNaN.sparse.dyndns.org>
Message-ID: <DB90C774-40F2-4EC5-B306-DDD268D6914F@gmail.com>

> On 20 Aug 2015, at 21:32, Jason Riedy <jason at lovesgoodfood.com> wrote:
> 
> And John Hearns writes:
>> I saw this mentioned on the Mellanox site. Has anyone come across it:
>> http://www.accelio.org/
> 
> Why have one when you can have many? http://www.openucx.org/ <http://www.openucx.org/>

Indeed, though maybe at a slightly lower level : http://ofiwg.github.io/libfabric/ <http://ofiwg.github.io/libfabric/> 

-- Jim
James Cownie <jcownie at gmail.com>
Mob: +44 780 637 7146
http://skiingjim.blogspot.com/


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150820/d3691bfb/attachment.html>

From samuel at unimelb.edu.au  Sat Aug 29 04:07:00 2015
From: samuel at unimelb.edu.au (Chris Samuel)
Date: Sat, 29 Aug 2015 21:07 +1000
Subject: [Beowulf] glibc 2.22 includes a vector math library (x86_64
	initially)
Message-ID: <4314980.u8PKEVSjLZ@quad>

Hi all,

Don't know if many people noticed this, but this looks like a handy new 
feature for glibc to get (from the release announcement):

https://www.sourceware.org/ml/libc-alpha/2015-08/msg00609.html

#* Added vector math library named libmvec with the following vectorized
#  x86_64 implementations: cos, cosf, sin, sinf, sincos, sincosf, log, logf,
# exp, expf, pow, powf.

More info on the glibc website:

https://sourceware.org/glibc/wiki/libmvec

# Libmvec is vector math library added in Glibc 2.22.
#
# Vector math library was added to support SIMD constructs of OpenMP4.0
# (#2.8 in http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf) by
# adding vector implementations of vector math functions.
#
# Vector math functions are vector variants of corresponding scalar math
# operations implemented using SIMD ISA extensions (e.g. SSE or AVX for
# x86_64). They take packed vector arguments, perform the operation on
# each element of the packed vector argument, and return a packed vector
# result. Using vector math functions is faster than repeatedly calling the
# scalar math routines.

All the best,
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci


From nick.c.evans at gmail.com  Mon Aug 31 19:54:19 2015
From: nick.c.evans at gmail.com (Nick Evans)
Date: Tue, 1 Sep 2015 12:54:19 +1000
Subject: [Beowulf] Diagnosing Discovery issue xCat
Message-ID: <CAOsdKvewsAsgQjygrWs9yaOUDCu3UrsXGbE3TuRyGNWrjFS22Q@mail.gmail.com>

Hi All,

I am sure i am just doing something silly as i haven't had an issue in the
past getting nodes discovered via the switch port lookup method.

Currently the newly booting node goes through the following steps

Get IP
Get the "xcat/xnba.kpxe" file
Download the Genisis discovery environment and boot into it
re-request IP
get certificate
initiate discovery

This then loops never actually discovering.
I have also attached the output of the messages file from the management
node.

Hardware is IBM dx360m4 node attached to Cisco WS-C3750G-48PS-S switch


Any pointers on where to look for anything that might shed some light on
this issue will be helpful.
Also do i need to specifically get the MIBS file for the switch as i don't
recall needing to to this in the past?

Thanks in advance
Nick
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150901/1e82291c/attachment.html>
-------------- next part --------------
Sep  1 12:45:51 mgt dhcpd: DHCPDISCOVER from 40:f2:e9:04:79:e2 via em3
Sep  1 12:45:52 mgt dhcpd: DHCPOFFER on 10.10.200.1 to 40:f2:e9:04:79:e2 via em3
Sep  1 12:45:53 mgt dhcpd: DHCPREQUEST for 10.10.200.1 (10.10.100.79) from 40:f2:e9:04:79:e2 via em3
Sep  1 12:45:53 mgt dhcpd: DHCPACK on 10.10.200.1 to 40:f2:e9:04:79:e2 via em3
Sep  1 12:45:53 mgt in.tftpd[28421]: RRQ from 10.10.200.1 filename xcat/xnba.kpxe
Sep  1 12:45:53 mgt in.tftpd[28421]: tftp: client does not accept options
Sep  1 12:45:53 mgt in.tftpd[28422]: RRQ from 10.10.200.1 filename xcat/xnba.kpxe
Sep  1 12:45:53 mgt dhcpd: DHCPDISCOVER from 40:f2:e9:04:79:e2 via em3
Sep  1 12:45:54 mgt dhcpd: DHCPOFFER on 10.10.200.2 to 40:f2:e9:04:79:e2 via em3
Sep  1 12:45:54 mgt dhcpd: DHCPREQUEST for 10.10.200.2 (10.10.100.79) from 40:f2:e9:04:79:e2 via em3
Sep  1 12:45:54 mgt dhcpd: DHCPACK on 10.10.200.2 to 40:f2:e9:04:79:e2 via em3
Sep  1 12:47:07 mgt dhcpd: DHCPDISCOVER from 40:f2:e9:04:79:e2 via em3
Sep  1 12:47:08 mgt dhcpd: DHCPOFFER on 10.10.200.3 to 40:f2:e9:04:79:e2 via em3
Sep  1 12:47:08 mgt dhcpd: DHCPREQUEST for 10.10.200.3 (10.10.100.79) from 40:f2:e9:04:79:e2 via em3
Sep  1 12:47:08 mgt dhcpd: DHCPACK on 10.10.200.3 to 40:f2:e9:04:79:e2 via em3
Sep  1 12:47:10 mgt xcat[33070]: xCAT: Allowing getcredentials x509cert
Sep  1 12:47:46 mgt xcat[25330]: xcatd: Processing discovery request from 10.10.200.3
Sep  1 12:47:47 mgt xcat[39277]: Error communicating with ms-h25-mgtobm-1g-40: Unable to get MAC entries via either BRIDGE or Q-BRIDE MIB
Sep  1 12:47:53 mgt xcat[25330]: xcatd: Processing discovery request from 10.10.200.3
Sep  1 12:47:59 mgt xcat[25330]: xcatd: Processing discovery request from 10.10.200.3
Sep  1 12:48:05 mgt xcat[25330]: xcatd: Processing discovery request from 10.10.200.3
Sep  1 12:48:11 mgt xcat[25330]: xcatd: Processing discovery request from 10.10.200.3
Sep  1 12:48:11 mgt xcat[39294]: Error communicating with ms-h25-mgtobm-1g-40: Unable to get MAC entries via either BRIDGE or Q-BRIDE MIB

From samuel at unimelb.edu.au  Mon Aug 31 20:43:11 2015
From: samuel at unimelb.edu.au (Christopher Samuel)
Date: Tue, 1 Sep 2015 13:43:11 +1000
Subject: [Beowulf] Diagnosing Discovery issue xCat
In-Reply-To: <ff692ecb286343559c0c0204894306f5@000s-ex-hub-qs1.unimelb.edu.au>
References: <ff692ecb286343559c0c0204894306f5@000s-ex-hub-qs1.unimelb.edu.au>
Message-ID: <55E51ECF.50604@unimelb.edu.au>

Hi Nick,

On 01/09/15 12:54, Nick Evans wrote:

> Any pointers on where to look for anything that might shed some light
> on this issue will be helpful. Also do i need to specifically get the
> MIBS file for the switch as i don't recall needing to to this in the
> past?

I'm just bringing up a new cluster with xCAT and found that I was having
issues with xCAT talking to the switches for discovery of the blade chassis.

It turned out that whilst the documentation said that xCAT defaults to
using SNMPv1 by default it actually takes the default of the underlying
library and that now is SNMPv3.

So we did:

# tabdump switches
#switch,snmpversion,username,password,privacy,auth,linkports,sshusername,sshpassword,protocol,switchtype,comments,disable
"sw18","SNMPv1",,,,,,,,,"BNT",,

You can tell for certain with wireshark or tcpdump.

If that is the case for that you can just set it as above (of course
you'll want "Cisco" instead of "BNT" for yours).

Best of luck!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci


From nick.c.evans at gmail.com  Mon Aug 31 21:21:21 2015
From: nick.c.evans at gmail.com (Nick Evans)
Date: Tue, 1 Sep 2015 14:21:21 +1000
Subject: [Beowulf] Diagnosing Discovery issue xCat
In-Reply-To: <55E51ECF.50604@unimelb.edu.au>
References: <ff692ecb286343559c0c0204894306f5@000s-ex-hub-qs1.unimelb.edu.au>
 <55E51ECF.50604@unimelb.edu.au>
Message-ID: <CAOsdKvcjqzzCBGxA_4qM_U2S9Dz_6dTifunDy7weHJHi0Ea_Cg@mail.gmail.com>

HI Chris,

Thanks for the insight. My switch table is as follows
#switch,snmpversion,username,password,privacy,auth,linkports,sshusername,sshpassword,protocol,switchtype,comments,disable
"ms-h25-data-10g-42","SNMPv1",,,,,,,,,"Cisco",,
"ms-h25-mgtobm-1g-40","SNMPv1",,,,,,,,,"Cisco",,

I did origionaly have just "2c" for the snmpversion and have now tried 1,
2c, SNMPv1, SNMPv2c... All with now luck.

Will have to get wire shark onto it and find out what is happening.
Thanks

Nick

On 1 September 2015 at 13:43, Christopher Samuel <samuel at unimelb.edu.au>
wrote:

> Hi Nick,
>
> On 01/09/15 12:54, Nick Evans wrote:
>
> > Any pointers on where to look for anything that might shed some light
> > on this issue will be helpful. Also do i need to specifically get the
> > MIBS file for the switch as i don't recall needing to to this in the
> > past?
>
> I'm just bringing up a new cluster with xCAT and found that I was having
> issues with xCAT talking to the switches for discovery of the blade
> chassis.
>
> It turned out that whilst the documentation said that xCAT defaults to
> using SNMPv1 by default it actually takes the default of the underlying
> library and that now is SNMPv3.
>
> So we did:
>
> # tabdump switches
>
> #switch,snmpversion,username,password,privacy,auth,linkports,sshusername,sshpassword,protocol,switchtype,comments,disable
> "sw18","SNMPv1",,,,,,,,,"BNT",,
>
> You can tell for certain with wireshark or tcpdump.
>
> If that is the case for that you can just set it as above (of course
> you'll want "Cisco" instead of "BNT" for yours).
>
> Best of luck!
> Chris
> --
>  Christopher Samuel        Senior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/      http://twitter.com/vlsci
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20150901/a4356aee/attachment.html>