[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

Thu May 2 09:38:27 PDT 2019

Warming to my subject now. I really dont want to be specific about any
vendor, or cluster management package.
As I say I have had experience ranging from national contracts, currently
at a company with tens of thousands of cpus worldwide,
down to installing half rack HPC clusters for customers, and informally
supporting half rack sized clusters where the users did not have formal
support.

When systems are bought the shiny bit is the hardware - much is made of the
latest generation CPUs, GPUS etc.
Buyers try to get as much hardare as they can for the price - usually
ending up as CPU core count or HPL performance.
They will swallow support contracts as they dont want to have a big failure
and have their management (Academic or industrial)
asking what the heck just happened and why the heck you are running without
support.
The hardware support is provided by the vendors, and their regional
distributors.
So from the point of view of a systems vendor hardware support is the
responsibility of the distributor or hardware vendor.

What DOES get squeezed is the HPC software stack support and the
applications level support.
After all - how hard can it be?
The sales guys told me that Intel now has 256 core processors with built in
AI which will run any software faster
then you can type 'run'.
The new guy with the beard has a laptop which uses this Ubuntu operating
system - and its all free.
Why do we need to pay $$$ for this cluster OS?

On Thu, 2 May 2019 at 17:18, John Hearns <hearnsj at googlemail.com> wrote:

> Chris, I have to say this. I have worked for smaller companies, and have
> worked for cluster integrators.
> For big University sized and national labs the procurement exercise will
> end up with a well defined support arrangement.
>
> I have seen, in once company I worked at, an HPC system arrive which I was
> not responsible for.
> This system was purchased by the IT department, and was intended to run
> Finite Element software.
> The hardware came from a Tier 1 vendor, but it was integrated by a small
> systems integrator.
> Yes, they installed a software stack and demonstrated that it would run
> Abaqus.
> But beyond that there was no support for getting other applications
> running. And no training that I could see in diagnosing faults.
>
> I am not going to name names, but I suspect experiences like that are
> common.
> Companies want to procure kit for as little as possible. Tier 1 vendors
> and white box vendors want to make the sales.
> But no-one wants to pay for Bright Cluster Manager, for example.
> So the end user gets at best a freeware solution like Rocks, or at worst
> some Kickstarted setup which installs an OS,
> the CentOS supplied IB drivers and MPI, and Gridengine slapped on top of
> that.
>
> This leads to an unsatisfying experience on the part of the end users, and
> also for the engineers of the integrating company.
>
> Which leads me to say that we see the rise of HPC in the cloud services-
> AWS,  OnScale, Rescale, Verne Global etc. etc.
> And no wonder - you should be getting a much more polished and ready to go
> infrastructure, even though you cant physically touch it.
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, 2 May 2019 at 17:08, Christopher Samuel <chris at csamuel.org> wrote:
>
>> On 5/2/19 8:40 AM, Faraz Hussain wrote:
>>
>> > So should I be paying Mellanox to help? Or is it a RedHat issue? Or is
>> > it our harware vendor, HP who should be involved??
>>
>> I suspect that would be set out in the contract for the HP system.
>>
>> The clusters I've been involved in purchasing in the past have always
>> required support requests to go via the immediate vendor and they then
>> arrange to put you in contact with others where required.
>>
>> All the best,
>> Chris
>> --
>>    Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190502/bcd4e81c/attachment.html>