[Beowulf] [External] Re: Intel Cluster Checker

Black, Brady P brady.p.black at intel.com
Thu Apr 30 12:15:08 PDT 2020


You can launch clck with either slurm (which means you do not need to provide a nodefile, clck will use whatever nodes you are allocated), or with a nodefile - which is typically more of a standalone operation when the system is in maintenance mode. 

To help solve the mpirun issue it would be good to know the exact error. Could be mpirun is not in the $PATH on that node or if on a shared drive, it is perhaps not mounted. When using slurm we suggest having the batch script include the appropriate sourcing of psxevars.sh, and if running from the command line - that your users .bashrc have a similar line so 'which mpirun' leads you to the Intel MPI runtime command, usually in /opt/intel/....

There is also two ways in which Cluster Checker can perform multinode communication. Through pdsh, or Intel MPI. If you made changes to the clck.xml file to enable the use of mpi.so as the communication library, then we would need to dig into why it was not completing the run correctly. output of 'which mpirun' could help in this case. 

Cheers
_Brady

> -----Original Message-----
> From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Prentice
> Bisbal via Beowulf
> Sent: Thursday, April 30, 2020 13:50
> To: beowulf at beowulf.org
> Subject: Re: [Beowulf] [External] Re: Intel Cluster Checker
> 
> When you launch your clck jobs, do you launch them with slurm, or do you
> use a nodefile? When I use a nodefile, I get an error that it can't call mpirun
> on one of the nodes, or something like that. I'd provide the exact error
> message, but I don't have access to it at the moment.
> 
> Prentice
> 
> On 4/30/20 11:49 AM, Black, Brady P wrote:
> > Hi - Intel Cluster Checker person chiming in.
> >
> > To answer your question Prentice about runtime of Cluster Checker (CLCK),
> this will depend on which set of tests or framework definition (FWD) you use
> and the number of servers. The default fwd, is health_base which should run
> in a matter of seconds. It was designed to run quickly and be a sanity check
> before running jobs. Other FWDs are designed for cluster hand-off and
> validation; so these will take much longer as they run a multitude of different
> benchmarks on individual nodes (stream/dgemm/sgemm/...) and across the
> cluster (hpcg/hpl/pairwise imb/...) looking for outliers. Which can take 90+
> minutes to multiple hours depending on the system configuration and size.
> Of course there are inbetween tests also such as health_extended_user or
> mpi_prereq_user.
> >
> > Couple of tips - clck -X list is a great way to see what framework definitions
> exist. clck -X <name_of_fwd> will give you more details on what is being
> checked for the specific fwd.
> >
> > Thanks for using cluster checker and providing feedback. Happy to help
> further.
> >
> > -Brady
> >
> >> -----Original Message-----
> >> From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Michael
> Di
> >> Domenico
> >> Sent: Thursday, April 30, 2020 10:23
> >> Cc: Beowulf Mailing List <beowulf at beowulf.org>
> >> Subject: Re: [Beowulf] Intel Cluster Checker
> >>
> >> i played with it about a year ago since i get it as part of the intel
> >> compiler bundle we pay for.  it was overly complicated to install and
> >> run and didn't seem worth while.  kind of like getting a piece of
> >> ikea furniture but then trying to use a phillips screw driver to build it
> instead of the little wrench.
> >> otherwise when i dug into what it was actually doing, it didn't seem
> >> to be doing anything magical.  it was just doing it 'the intel way',
> >> which in my experience is generally very strange
> >>
> >>
> >>
> >> On Wed, Apr 29, 2020 at 4:07 PM Prentice Bisbal via Beowulf
> >> <beowulf at beowulf.org> wrote:
> >>> Beowulfers,
> >>>
> >>> Have any of you used the Intel Cluster Checker? I've been tasked
> >>> with using it, and I think I have it running, but the documentation
> >>> isn't very good. I was wondering how long a typical run on some
> >>> cluster nodes should take.
> >>>
> >>> Prentice
> >>>
> >>> _______________________________________________
> >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >>> Computing To change your subscription (digest mode or unsubscribe)
> >>> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >> Computing To change your subscription (digest mode or unsubscribe)
> >> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> > Computing To change your subscription (digest mode or unsubscribe)
> > visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


More information about the Beowulf mailing list