From chris at csamuel.org Mon Feb 3 01:26:03 2025 From: chris at csamuel.org (Chris Samuel) Date: Sun, 2 Feb 2025 17:26:03 -0800 Subject: [Beowulf] Monitoring communication + processing time In-Reply-To: References: Message-ID: <51f7eb99-c7ff-481d-989e-7aafcd41b2d6@csamuel.org> On 15/1/25 5:04 pm, Alexandre Ferreira Ramos via Beowulf wrote: > Does anyone have a hint about how we should proceed for this monitoring? LLNL also has an MPI profiling library: https://github.com/LLNL/mpiP I've not tried it myself, but I like the idea of it. All the best, Chris From mdidomenico4 at gmail.com Wed Feb 5 14:33:07 2025 From: mdidomenico4 at gmail.com (Michael DiDomenico) Date: Wed, 5 Feb 2025 09:33:07 -0500 Subject: [Beowulf] malloc on filesystem Message-ID: this might sound like a bit of an oddify, but does anyone know if there's a library out there that will let me override malloc calls to memory and direct them to a filesystem instead? ie using the filesystem as memory instead of ram for a program. ideally something i can LD_PRELOAD on top of a static binary. understandably this is generally a silly thing to do, but you know, users... google is failing me, my search terms likely aren't right. i'm looking through some of the older checkpointing codes at the moment. maybe someone cut shortcut my search From stewart at serissa.com Wed Feb 5 14:46:24 2025 From: stewart at serissa.com (Serissa) Date: Wed, 5 Feb 2025 09:46:24 -0500 Subject: [Beowulf] malloc on filesystem In-Reply-To: References: Message-ID: <509D751B-DEF0-4600-892A-4D96C7848080@serissa.com> If you are willing to mmap the whole file, then dlmalloc can do this. The issue is that it expects the storage pool to be accessible with pointers. I am not aware of an allocator that uses function calls to read and write its own metadata (so that you can abstract the metadata), but if you find one I'd like the reference! In my previous job we wanted such a thing so that host code could run an allocator for GPU memory without load store access to the memory being allocated, and we couldn't find one. There were too many places in dlmalloc that did pointer access so it was too hard to figure out which ones were accessing the storage pool and which were local. -Larry > On Feb 5, 2025, at 9:33?AM, Michael DiDomenico wrote: > > ?this might sound like a bit of an oddify, but does anyone know if > there's a library out there that will let me override malloc calls to > memory and direct them to a filesystem instead? ie using the > filesystem as memory instead of ram for a program. ideally something > i can LD_PRELOAD on top of a static binary. understandably this is > generally a silly thing to do, but you know, users... > > google is failing me, my search terms likely aren't right. i'm > looking through some of the older checkpointing codes at the moment. > maybe someone cut shortcut my search > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From hearnsj at gmail.com Wed Feb 5 16:17:08 2025 From: hearnsj at gmail.com (John Hearns) Date: Wed, 5 Feb 2025 16:17:08 +0000 Subject: [Beowulf] malloc on filesystem In-Reply-To: References: Message-ID: Is this any use https://en.wikipedia.org/wiki/Zram On Wed, 5 Feb 2025 at 15:50, Michael DiDomenico wrote: > this might sound like a bit of an oddify, but does anyone know if > there's a library out there that will let me override malloc calls to > memory and direct them to a filesystem instead? ie using the > filesystem as memory instead of ram for a program. ideally something > i can LD_PRELOAD on top of a static binary. understandably this is > generally a silly thing to do, but you know, users... > > google is failing me, my search terms likely aren't right. i'm > looking through some of the older checkpointing codes at the moment. > maybe someone cut shortcut my search > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcownie at gmail.com Thu Feb 6 11:17:59 2025 From: jcownie at gmail.com (Jim Cownie) Date: Thu, 6 Feb 2025 11:17:59 +0000 Subject: [Beowulf] Monitoring communication + processing time In-Reply-To: <51f7eb99-c7ff-481d-989e-7aafcd41b2d6@csamuel.org> References: <51f7eb99-c7ff-481d-989e-7aafcd41b2d6@csamuel.org> Message-ID: There are a number of open source MPI profiling libraries which Google can no doubt find for you; as recommended below, mpiP looks sane (though I haven't tried it myself) Or, you can use the MPI Profiling interface to intercept MPI calls and time them yourself, though this is in effect writing your own MPI profiler, so seems somewhat unnecessary. If you do go this route, you should be able to do it as a separate add-on that doesn't require any application source code changes. MPI has (at my insistence :-)) had a profiling interface since MPI-1, so this is not new technology. -- Jim James Cownie Mob: +44 780 637 7146 > On 3 Feb 2025, at 01:26, Chris Samuel wrote: > > On 15/1/25 5:04 pm, Alexandre Ferreira Ramos via Beowulf wrote: > >> Does anyone have a hint about how we should proceed for this monitoring? > > LLNL also has an MPI profiling library: https://github.com/LLNL/mpiP > > I've not tried it myself, but I like the idea of it. > > All the best, > Chris > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From Josef.Weidendorfer at in.tum.de Thu Feb 6 14:56:12 2025 From: Josef.Weidendorfer at in.tum.de (Weidendorfer, Josef) Date: Thu, 6 Feb 2025 15:56:12 +0100 Subject: [Beowulf] Monitoring communication + processing time In-Reply-To: References: <51f7eb99-c7ff-481d-989e-7aafcd41b2d6@csamuel.org> Message-ID: Have a look at the tools page of VI-HPS: https://www.vi-hps.org/tools/tools.html Most is open source, some is commercial. It includes mpiP and OpenSpeedShop, but there is also Scalasca, TAU, Vampir ? Josef > Am 06.02.2025 um 12:17 schrieb Jim Cownie : > > There are a number of open source MPI profiling libraries which Google can no doubt find for you; as recommended below, mpiP looks sane (though I haven't tried it myself) > Or, you can use the MPI Profiling interface to intercept MPI calls and time them yourself, though this is in effect writing your own MPI profiler, so seems somewhat unnecessary. If you do go this route, you should be able to do it as a separate add-on that doesn't require any application source code changes. > > MPI has (at my insistence :-)) had a profiling interface since MPI-1, so this is not new technology. > > -- Jim > James Cownie > Mob: +44 780 637 7146 > >> On 3 Feb 2025, at 01:26, Chris Samuel wrote: >> >> On 15/1/25 5:04 pm, Alexandre Ferreira Ramos via Beowulf wrote: >> >>> Does anyone have a hint about how we should proceed for this monitoring? >> >> LLNL also has an MPI profiling library: https://github.com/LLNL/mpiP >> >> I've not tried it myself, but I like the idea of it. >> >> All the best, >> Chris >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf From brice.goglin at gmail.com Thu Feb 27 08:19:41 2025 From: brice.goglin at gmail.com (Brice Goglin) Date: Thu, 27 Feb 2025 09:19:41 +0100 Subject: [Beowulf] MPI over RoCE? Message-ID: <475073c1-2d7a-45f8-bcfb-994e66a651bc@gmail.com> Hello While meeting vendors to buy our next cluster, we got different recommendations about the network for MPI. The cluster will likely be about 100 nodes. Some vendors claim RoCE is enough to get <2us latency and good bandwidth for such low numbers of nodes. Some others say RoCE is far behind IB for both latency and bandwidth and we likely need to get IB if we care about network performance. If anybody tried MPI over RoCE over such a "small" cluster, what NICs and switches did you use? Also, is the configuration easy from the admin (installation) and users (MPI options) points of view? Thanks Brice From mattw at madmonks.org Thu Feb 27 09:10:04 2025 From: mattw at madmonks.org (Matt Wallis) Date: Thu, 27 Feb 2025 20:10:04 +1100 Subject: [Beowulf] MPI over RoCE? In-Reply-To: <475073c1-2d7a-45f8-bcfb-994e66a651bc@gmail.com> References: <475073c1-2d7a-45f8-bcfb-994e66a651bc@gmail.com> Message-ID: <86958AFE-2D04-4E53-AAC3-AE54373F9C69@madmonks.org> From my experience, RoCE will be just as fast if not faster than IB, inside a single Ethernet switch, it?s when you go outside the switch you lose out. The trick has been finding NICs that are supported natively by OFED. I tend to still find the Mellanox NICs the most reliable and well supported. Then the question is if you?re still buying the Mellanox NICs, why not go the whole hog, particularly as you may grow outside of a single switch. Matt. > On 27 Feb 2025, at 19:19, Brice Goglin wrote: > > Hello > > While meeting vendors to buy our next cluster, we got different recommendations about the network for MPI. The cluster will likely be about 100 nodes. Some vendors claim RoCE is enough to get <2us latency and good bandwidth for such low numbers of nodes. Some others say RoCE is far behind IB for both latency and bandwidth and we likely need to get IB if we care about network performance. > > If anybody tried MPI over RoCE over such a "small" cluster, what NICs and switches did you use? > > Also, is the configuration easy from the admin (installation) and users (MPI options) points of view? > > Thanks > > Brice > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf > From prentice at ucar.edu Thu Feb 27 17:28:11 2025 From: prentice at ucar.edu (Prentice Bisbal) Date: Thu, 27 Feb 2025 12:28:11 -0500 Subject: [Beowulf] MPI over RoCE? In-Reply-To: <475073c1-2d7a-45f8-bcfb-994e66a651bc@gmail.com> References: <475073c1-2d7a-45f8-bcfb-994e66a651bc@gmail.com> Message-ID: <8a32f2f1-9cb1-4940-b3d5-c2e16615755c@ucar.edu> On 2/27/25 3:19 AM, Brice Goglin wrote: > Hello > > While meeting vendors to buy our next cluster, we got different > recommendations about the network for MPI. The cluster will likely be > about 100 nodes. Some vendors claim RoCE is enough to get <2us latency > and good bandwidth for such low numbers of nodes. Some others say RoCE > is far behind IB for both latency and bandwidth and we likely need to > get IB if we care about network performance. > > If anybody tried MPI over RoCE over such a "small" cluster, what NICs > and switches did you use? > > Also, is the configuration easy from the admin (installation) and > users (MPI options) points of view? > > I hope this isn't a dumb question: Do the Ethernet switches you're looking at have crossbar switches inside them? I believe crossbar switches are a requirement for IB, but are only found in "higher performance" Ethernet switches. IB isn't just about latency. The crossbar switches allow for high bisectional bandwidth, non-blocking communication, etc. -- Prentice From brice.goglin at gmail.com Fri Feb 28 09:01:29 2025 From: brice.goglin at gmail.com (Brice Goglin) Date: Fri, 28 Feb 2025 10:01:29 +0100 Subject: [Beowulf] MPI over RoCE? In-Reply-To: <8a32f2f1-9cb1-4940-b3d5-c2e16615755c@ucar.edu> References: <475073c1-2d7a-45f8-bcfb-994e66a651bc@gmail.com> <8a32f2f1-9cb1-4940-b3d5-c2e16615755c@ucar.edu> Message-ID: Le 27/02/2025 ? 18:28, Prentice Bisbal a ?crit?: > On 2/27/25 3:19 AM, Brice Goglin wrote: >> Hello >> >> While meeting vendors to buy our next cluster, we got different >> recommendations about the network for MPI. The cluster will likely be >> about 100 nodes. Some vendors claim RoCE is enough to get <2us >> latency and good bandwidth for such low numbers of nodes. Some others >> say RoCE is far behind IB for both latency and bandwidth and we >> likely need to get IB if we care about network performance. >> >> If anybody tried MPI over RoCE over such a "small" cluster, what NICs >> and switches did you use? >> >> Also, is the configuration easy from the admin (installation) and >> users (MPI options) points of view? >> >> > I hope this isn't a dumb question: Do the Ethernet switches you're > looking at have crossbar switches inside them? I believe crossbar > switches are a requirement for IB, but are only found in "higher > performance" Ethernet switches. IB isn't just about latency. The > crossbar switches allow for high bisectional bandwidth, non-blocking > communication, etc. > > I don't know but that's a good question, I will ask vendors. Brice