[Beowulf] DMA Memory Mapping Question
patrick at myri.com
Wed Feb 21 20:06:50 PST 2007
Chris Samuel wrote:
> We occasionally get users who manage to use up all the DMA memory that is
> addressable by the Myrinet card through the Power5 hypervisor.
The IOMMU limit set by the hypervisor varies depending on the machine,
the hypervisor version and the phase of the moon. Sometimes, it's a
limit per PCI slot (ie per device), sometimes it is a limit for the
whole machine (can be virtual machine, that's one of the reason behind
the hypervisor) and it's shared by all the devices. Sometimes, it's
reasonable large (1 or 2 GB), sometimes it is ridiculously small (256 MB).
The hypervisor does not make a lot of sense in a HPC environment, but it
would be non-trivial work to remove it on PPC.
> Through various firmware and driver tweaks (thanks to both IBM and Myrinet)
> we've gotten that limit up to almost 1GB and then we use an undocumented
> environment variable (GMPI_MAX_LOCKED_MBYTE) to say only use 248MB of that
> per process (as we've got 4 cores in each box), which we enforce through
> The problems went away. Or at least it did until just now. :-(
> The characterstic error we get is:
> : alloc_failed, not enough memory (Fatal Error)
> Context: <(gmpi_init) gmpi_dma_alloc: dma_recv buffers>
> Now Myrinet can handle running out of DMA memory once a process is running,
> but when it starts it must be able to allocate a (fairly trivial) amount of
> DMA memory otherwise you get that fatal error.
GM does pipeline large messages with chunks of 1 MB, so you can progress
as long as you can register 1 MB at a time (you can think of
pathological deadlocking situations, but it's not the common case).
However, GM registers some buffers for Eager messages at init time. From
memory, it's in the order of 32 MB per process (constant, does not
depend on the size of the job). If you can't register that, there is
nothing you can do so aborting is a good idea.
If you limit registration per process, then I can think of one situation
that will hit the IOMMU limit: if a process dies of abnormal death
(segfault, killed, whatever), the GM port will be "shutting down" while
the outstanding messages are dropped. During this time, the memory is
still registered. If you start another process at that time, you will
effectively have more than 4 processes with registered memory, and it
may exceed the limit. A quick workaround would be to modify the MPICH-GM
init code to only try to open the first 4 GM ports. That will in effect
guarantee that only 4 processes can register memory at one time (latest
release of GM provides 13 ports).
I see from your next post that it's not what happened. It could have :-)
> Looking at the node I can confirm that there are only 3 user processes
> running, so what I am after is a way of determining how much of that DMA
> memory a process has allocated.
There is no handy way, but it would not be hard to add this info to the
output of gm_board_info. There is not many releases of GM these days.
Nevertheless, I will add it to the queue, it's simple enough to not be
considered a new feature.
> Oh - switching to the Myrinet MX drivers (which doesn't have this problem) is
> not an option, we have an awful lot of users, mostly (non-computer)
Actually, MX would not behave well in your environment: MX does not
pipeline large messages, it register the whole message at once (MX
registration is much faster, and pipelining prevents overlap of
communication with computation). With a 250 MB of DMA-able memory per
process, that would be the maximum message size you can send or receive.
We have plan to do something about that, but it's not at the top of the
queue. The right thing would be to get rid of the hypervisor (by the
way, the hypervisor makes the memory registration overhead much more
expensive), but it probably will never happen.
> scientists, who have their own codes and trying to persuade them to recompile
> would be very hard - which would be necessary as we've not been able to
> convince MPICH-GM to build shared libraries on Linux on Power with the IBM
> compilers. :-(
Time for dreaming about an MPI ABI :-)
More information about the Beowulf