[Beowulf] [External] anyone have modern interconnect metrics?

Sat Jan 20 17:10:40 UTC 2024

On Fri, Jan 19, 2024 at 9:40 PM Prentice Bisbal via Beowulf <
beowulf at beowulf.org> wrote:

> > Yes, someone is sure to say "don't try characterizing all that stuff -
> > it's your application's performance that matters!"  Alas, we're a generic
> > "any kind of research computing" organization, so there are thousands
> > of apps
> > across all possible domains.
>
> <rant>
>
> I agree with you. I've always hated the "it depends on your application"
> stock response in HPC. I think it's BS. Very few of us work in an
> environment where we support only a handful of applications with very
> similar characteristics. I say use standardized benchmarks that test
> specific performance metrics (mem bandwidth or mem latency, etc.),
> first, and then use a few applications to confirm what you're seeing
> with those benchmarks.
>
> </rant>
>

It does depend on the application(s). At OLCF, we have hundreds of
applications. Some pound the network and some do not. Because we are a
Leadership Computing Facility, a user cannot get any time on the machine
unless they can scale to 20% and ideally to 100% of the system. We have
several apps with FFTs which become all-to-alls in MPI. Because of this,
ideally we want a non-blocking fat-tree (i.e., Clos) topology. Every other
topology is a compromise. That said, a full Clos is 2x or more in cost
compared to other common topologies (e.g., dragonfly or a 2:1
oversubscribed, fat-tree). If your workload is small jobs that can fit in a
rack, for example, then by all means save some money and get an
oversubscribed fat-tree, dragonfly, etc. If your jobs need to use the full
machine and they have large message collectives, then you have to bite the
bullet and spend more on network and less on compute and/or storage.

To assess the usage of our parallel file systems, we run with Darshan
installed and it captures data from each MPI job (each job step within a
job). We do not have similar tools to determine how the network is being
used (e.g., how much bandwidth do we need, what communication patterns).
When I was at Myricom and we were releasing Myri-10G, I benchmarked several
ISV codes on 2G versus 10G. If I remember, Fluent did not benefit from the
extra bandwidth, but PowerFlow did a lot.

My point is that "It depends" may not be a satisfying answer, but it is
realistic.

> > Another interesting topic is that nodes are becoming many-core - any
> > thoughts?
>
> Core counts are getting too high to be of use in HPC. High core-count
> processors sound great until you realize that all those cores are now
> competing for same memory bandwidth and network bandwidth, neither of
> which increase with core-count.
>
> Last April we were evaluating test systems from different vendors for a
> cluster purchase. One of our test users does a lot of CFD simulations
> that are very sensitive to mem bandwidth. While he was getting a 50%
> speed up in AMD compared to Intel (which makes sense since AMDs require
> 12 DIMM slots to be filled instead of Intel's 8), he asked us consider
> servers with LESS cores. Even with the AMDs, he was saturating the
> memory bandwidth before scaling to all the cores, causing his
> performance to plateau. For him, buying cheaper processors with lower
> core-counts was better for him, since the savings would allow us to by
> additional nodes, which would be more beneficial to him.
>

We see this as well in DOE especially when GPUs are doing a significant
amount of the work.

Scott

> <snip>
> --
> Prentice
>
>
> On 1/16/24 5:19 PM, Mark Hahn wrote:
> > Hi all,
> > Just wondering if any of you have numbers (or experience) with
> > modern high-speed COTS ethernet.
> >
> > Latency mainly, but perhaps also message rate.  Also ease of use
> > with open-source products like OpenMPI, maybe Lustre?
> > Flexibility in configuring clusters in the >= 1k node range?
> >
> > We have a good idea of what to expect from Infiniband offerings,
> > and are familiar with scalable network topologies.
> > But vendors seem to think that high-end ethernet (100-400Gb) is
> > competitive...
> >
> > For instance, here's an excellent study of Cray/HP Slingshot (non-COTS):
> > https://arxiv.org/pdf/2008.08886.pdf
> > (half rtt around 2 us, but this paper has great stuff about
> > congestion, etc)
> >
> > Yes, someone is sure to say "don't try characterizing all that stuff -
> > it's your application's performance that matters!"  Alas, we're a generic
> > "any kind of research computing" organization, so there are thousands
> > of apps
> > across all possible domains.
> >
> > Another interesting topic is that nodes are becoming many-core - any
> > thoughts?
> >
> > Alternatively, are there other places to ask? Reddit or something less
> > "greybeard"?
> >
> > thanks, mark hahn
> > McMaster U / SharcNET / ComputeOntario / DRI Alliance Canada
> >
> > PS: the snarky name "NVidiband" just occurred to me; too soon?
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240120/058d6355/attachment.htm>