[Beowulf] DMA Memory Mapping Question

Scott Atchley atchley at myri.com
Wed Feb 21 18:47:45 PST 2007


On Feb 21, 2007, at 7:45 PM, Chris Samuel wrote:

> Hi folks,
>
> We've got an IBM Power5 cluster running SLES9 and using the GM  
> drivers.
>
> We occasionally get users who manage to use up all the DMA memory  
> that is
> addressable by the Myrinet card through the Power5 hypervisor.
>
> Through various firmware and driver tweaks (thanks to both IBM and  
> Myrinet)
> we've gotten that limit up to almost 1GB and then we use an  
> undocumented
> environment variable (GMPI_MAX_LOCKED_MBYTE) to say only use 248MB  
> of that
> per process (as we've got 4 cores in each box), which we enforce  
> through
> Torque.
>
> The problems went away.  Or at least it did until just now. :-(
>
> The characterstic error we get is:
>
> [13]: alloc_failed, not enough memory (Fatal Error)
>         Context: <(gmpi_init) gmpi_dma_alloc: dma_recv buffers>
>
> Now Myrinet can handle running out of DMA memory once a process is  
> running,
> but when it starts it must be able to allocate a (fairly trivial)  
> amount of
> DMA memory otherwise you get that fatal error.
>
> Looking at the node I can confirm that there are only 3 user processes
> running, so what I am after is a way of determining how much of  
> that DMA
> memory a process has allocated.
>
> I looked at /proc/${PID}/maps and saw this:
>
> 40028000-40029000 r--s 00002000 00:0c \
> 8483                               /dev/gm0
>
> which to me looks like a memory mapping, but to my eyes that looks  
> like just
> 1,000 bytes..
>
> Does anyone have any ideas at all ?

Isn't this in hex? If so, it would be 4096 bytes. I do not use GM  
much and I do not know what this is. I just loaded GM on one node and  
with no GM processes running except the mapper, I have a similar  
entry (at a different address, but also 0x1000). I would guess this  
is to allow GM and the mapper to communicate. I will check internally.

> Oh - switching to the Myrinet MX drivers (which doesn't have this  
> problem) is
> not an option, we have an awful lot of users, mostly (non-computer)
> scientists, who have their own codes and trying to persuade them to  
> recompile
> would be very hard - which would be necessary as we've not been  
> able to
> convince MPICH-GM to build shared libraries on Linux on Power with  
> the IBM
> compilers. :-(
>
> cheers,
> Chris

I am sorry you have not had success with MPICH-GM to compile dynamic  
libs. Have you sent email to Myricom help?

Regards,

Scott



More information about the Beowulf mailing list