Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] DMA Memory Mapping Question

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Scott Atchley atchley at myri.com
Wed Feb 21 18:47:45 PST 2007


On Feb 21, 2007, at 7:45 PM, Chris Samuel wrote:

> Hi folks,
>
> We've got an IBM Power5 cluster running SLES9 and using the GM  
> drivers.
>
> We occasionally get users who manage to use up all the DMA memory  
> that is
> addressable by the Myrinet card through the Power5 hypervisor.
>
> Through various firmware and driver tweaks (thanks to both IBM and  
> Myrinet)
> we've gotten that limit up to almost 1GB and then we use an  
> undocumented
> environment variable (GMPI_MAX_LOCKED_MBYTE) to say only use 248MB  
> of that
> per process (as we've got 4 cores in each box), which we enforce  
> through
> Torque.
>
> The problems went away.  Or at least it did until just now. :-(
>
> The characterstic error we get is:
>
> [13]: alloc_failed, not enough memory (Fatal Error)
>         Context: <(gmpi_init) gmpi_dma_alloc: dma_recv buffers>
>
> Now Myrinet can handle running out of DMA memory once a process is  
> running,
> but when it starts it must be able to allocate a (fairly trivial)  
> amount of
> DMA memory otherwise you get that fatal error.
>
> Looking at the node I can confirm that there are only 3 user processes
> running, so what I am after is a way of determining how much of  
> that DMA
> memory a process has allocated.
>
> I looked at /proc/${PID}/maps and saw this:
>
> 40028000-40029000 r--s 00002000 00:0c \
> 8483                               /dev/gm0
>
> which to me looks like a memory mapping, but to my eyes that looks  
> like just
> 1,000 bytes..
>
> Does anyone have any ideas at all ?

Isn't this in hex? If so, it would be 4096 bytes. I do not use GM  
much and I do not know what this is. I just loaded GM on one node and  
with no GM processes running except the mapper, I have a similar  
entry (at a different address, but also 0x1000). I would guess this  
is to allow GM and the mapper to communicate. I will check internally.

> Oh - switching to the Myrinet MX drivers (which doesn't have this  
> problem) is
> not an option, we have an awful lot of users, mostly (non-computer)
> scientists, who have their own codes and trying to persuade them to  
> recompile
> would be very hard - which would be necessary as we've not been  
> able to
> convince MPICH-GM to build shared libraries on Linux on Power with  
> the IBM
> compilers. :-(
>
> cheers,
> Chris

I am sorry you have not had success with MPICH-GM to compile dynamic  
libs. Have you sent email to Myricom help?

Regards,

Scott



More information about the Beowulf mailing list