[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

Jonathan Aquilina jaquilina at eagleeyet.net
Thu May 2 09:10:37 PDT 2019


Hi John,

I think there is a bit of an inaccuracy given you mention HP. What I have learned as I am working with a local HP and HPE distributor that for servers and everything you want to deal with HPE (HP enterprise) where as standard consumer hardware is bought from HP as they have two distinct companies focused on different market segments.

In terms of cluster with HP servers has anyone spoken or deal with HPE support for this kind of stuff?

Regards,
Jonathan

Regards,
Jonathan Aquilina
Owner EagleEyeT

________________________________
From: Beowulf <beowulf-bounces at beowulf.org> on behalf of John Hearns via Beowulf <beowulf at beowulf.org>
Sent: Thursday, May 2, 2019 6:03 PM
To: Faraz Hussain
Cc: Beowulf Mailing List; Christopher Samuel
Subject: Re: [Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

You ask some damned good questions there.
I will try to answer them from the point of view of someone who has worked as an HPC systems integrator and supported HPC systems,
both for systems integrators and within companies.

We will start with HP. Did you buy those systems direct from HP as servers, or did you buy a configured HPC system,
complete with Infiniband networking and with a software stack?
If you bought bare metal servers then you are out of luck regarding support, other than hardware failures.
HP now incorporate SGI, and their support is fantastic. Great people work for HP and SGI. But they aren't responsible for your install.

If however you bought an integrated HPC system this will normally be integrated by a smaller company, usually in your country.
Is this the case here?  Then yes the integrator should be providing support.
HOWEVER you have elected to remove their installed OS and upgrade by yourself. If I was the integrator I would give advice,
but refuse to support the upgrade unless it was recommended by us, and you have a continuing support contract.

You are using CentOS. The CentOS team are great guys - I know the founder quite well, and know people who work for RedHat.
You have chosen CentOS - Community Supported Operating System. Join the CentOS HPC SIG perhaps and ask for help.
But you don't get support from RedHat - as you are not using Redhat Enterprise Linux.

Now we come to Mellanox. Mellanox support is fantastic. Formally, to open a support ticket with them you will need a support agreement
on your switch. You HAVE got a support agreement - right?
If not I have found that informal requests for support are often answered by Mellanox support.

Failing all of those you could hire me!
(I am being semi-serious here - I am a permanent employee at the moment, but I have worked as an HPC contractor int he past,
and if I could justify it I would prefer to do HPC support on a contract basis).
































On Thu, 2 May 2019 at 16:45, Faraz Hussain <info at feacluster.com<mailto:info at feacluster.com>> wrote:
Thanks. Before I go down the path of installing things willy-nilly, is
there some guide I should be following instead? I obviously have a
problem with my mellanox drivers combined with "user error"..

So should I be paying Mellanox to help? Or is it a RedHat issue? Or is
it our harware vendor, HP who should be involved??

Looks like I need support on how to get support :-)


Quoting Christopher Samuel <chris at csamuel.org<mailto:chris at csamuel.org>>:

>> root at lustwzb34:/root # systemctl status rdma
>> Unit rdma.service could not be found.
>
> You're missing this RPM then, which might explain a lot:
>
> $ rpm -qi rdma-core
> Name        : rdma-core
> Version     : 17.2
> Release     : 3.el7
> Architecture: x86_64
> Install Date: Tue 04 Dec 2018 03:58:16 PM AEDT
> Group       : Unspecified
> Size        : 107924
> License     : GPLv2 or BSD
> Signature   : RSA/SHA256, Tue 13 Nov 2018 01:45:22 AM AEDT, Key ID
> 24c6a8a7f4a80eb5
> Source RPM  : rdma-core-17.2-3.el7.src.rpm
> Build Date  : Wed 31 Oct 2018 07:10:24 AM AEDT
> Build Host  : x86-01.bsys.centos.org<http://x86-01.bsys.centos.org>
> Relocations : (not relocatable)
> Packager    : CentOS BuildSystem <http://bugs.centos.org>
> Vendor      : CentOS
> URL         : https://github.com/linux-rdma/rdma-core
> Summary     : RDMA core userspace libraries and daemons
> Description :
> RDMA core userspace infrastructure and documentation, including initscripts,
> kernel driver-specific modprobe override configs, IPoIB network scripts,
> dracut rules, and the rdma-ndd utility.
>
> --
>   Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf



_______________________________________________
Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190502/2eaf2e30/attachment.html>


More information about the Beowulf mailing list