[Beowulf] DMA Memory Mapping Question
csamuel at vpac.org
Wed Feb 21 22:16:27 PST 2007
On Thu, 22 Feb 2007, Patrick Geoffray wrote:
> Hi Chris,
> Chris Samuel wrote:
> > We occasionally get users who manage to use up all the DMA memory that is
> > addressable by the Myrinet card through the Power5 hypervisor.
> The IOMMU limit set by the hypervisor varies depending on the machine,
> the hypervisor version and the phase of the moon.
> Sometimes, it's a limit per PCI slot (ie per device), sometimes it is a
> limit for the whole machine (can be virtual machine, that's one of the
> reason behind the hypervisor) and it's shared by all the devices.
Fortunately on our systems we run just a single LPAR per physical compute
node, so then it is reduced to the firmware revision, whether or not
the "superslot" option is enabled and whether or not your Myrinet card is in
that super slot. Of course ours weren't, so we had to move them all!
> Sometimes, it's reasonable large (1 or 2 GB), sometimes it is ridiculously
> small (256 MB).
We started off at 256MB, the superslot option knocked that up to about 1GB but
the driver reserves a percentage so that was down to around 700-800MB.
Tweaking that %'age in the driver improves it up to around 900MB for MPI.
> The hypervisor does not make a lot of sense in a HPC environment, but it
> would be non-trivial work to remove it on PPC.
IBM say that Power5 cannot run without a hypervisor. I happen to know,
however, that the first O/S that was brought up on Power5 was Linux and that
was because they could get it to run on the bare metal, whereas AIX wouldn't
work until they got the hypervisor running. The folks at IBM's LTC in
Canberra have argued on our side, but didn't win.
Power4 can be run without a hypervisor though.
> I see from your next post that it's not what happened. It could have :-)
Useful details though, thanks!
> > Looking at the node I can confirm that there are only 3 user processes
> > running, so what I am after is a way of determining how much of that DMA
> > memory a process has allocated.
> There is no handy way, but it would not be hard to add this info to the
> output of gm_board_info. There is not many releases of GM these days.
> Nevertheless, I will add it to the queue, it's simple enough to not be
> considered a new feature.
Oh, OK, we had been told that it wasn't appropriate to go into GM. Thanks!
> > Oh - switching to the Myrinet MX drivers (which doesn't have this
> > problem) is not an option, we have an awful lot of users, mostly
> > (non-computer)
> Actually, MX would not behave well in your environment: MX does not
> pipeline large messages, it register the whole message at once (MX
> registration is much faster, and pipelining prevents overlap of
> communication with computation). With a 250 MB of DMA-able memory per
> process, that would be the maximum message size you can send or receive.
Very useful to know, thanks!
> We have plan to do something about that, but it's not at the top of the
> queue. The right thing would be to get rid of the hypervisor (by the
> way, the hypervisor makes the memory registration overhead much more
> expensive), but it probably will never happen.
No, IBM won't do that. Power4 is the most recent platform that will
(apparently) run without the Hypervisor. :-(
> > scientists, who have their own codes and trying to persuade them to
> > recompile would be very hard - which would be necessary as we've not been
> > able to convince MPICH-GM to build shared libraries on Linux on Power
> > with the IBM compilers. :-(
> Time for dreaming about an MPI ABI :-)
Just don't get too attached to it.
Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager
Victorian Partnership for Advanced Computing http://www.vpac.org/
Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 189 bytes
Desc: not available
More information about the Beowulf