[Beowulf] getting a phi, dammit.

James Cownie jcownie at cantab.net
Thu Mar 7 11:55:13 PST 2013

On 7 Mar 2013, at 17:29, Vincent Diepeveen wrote:

> On Mar 6, 2013, at 9:42 PM, James Cownie wrote:
>> On 6 Mar 2013, at 06:00, Mark Hahn wrote:
>>>> The issue here is that because we offer 8GB of memory on the cards, some
>>>> BIOSes are unable to map all of it through the PCI either due to bugs or
>>>> failure to support so much memory. This is not the only people suffering
>>> interesting.  but it seems like there are quite a few cards out there
>>> with 4-6GB (admittedly, mostly higher-end workstation/gp-gpu cards.)
>>> is this issue a bigger deal for Phi than the Nvidia family?
>>> is it more critical for using Phi in offload mode?
>> I think this was answered by Brice in another message. We map all of the memory
>> through the PCI, whereas many other people only map a smaller buffer, and therefore
>> have to do additional copies.
> James, not really following exactly what you mean by 'through the PCI'.
I mean that all of the memory on the card can be seen from the host via the PCI, and,
therefore, that PCI transfers to the memory on the card can be directly to the final destination.

The alternative would be to map a smaller window of memory on the card for PCI transfers
(thus avoiding the large PCI aperture issue which is where we came into this), but requiring
that to move data to/from an arbitrary memory location on the card you' d have to DMA it across the
PCI to the buffer space, then copy it to the final destination (or the reverse for transferring to the host, 
of course).

> If you do memory through the pci, isn't that factor 10+ worse in bandwidth than when using device RAM?
Yes, of course. I'm not advocating doing this at user level, rather we're discussing the underlying mechanisms
used for the PCI copying that supports the transfers of data when the programmer has requested that
via whatever mechanism you choose to use.

> What matters is how much RAM you can allocate on the device for your threads of course.
Absolutely, which is why we put as much memory as we can on the card.

> Anything you ship through that PCI is going to be that slow in terms of bandwidth,
> that you just do not want to do that and really want to limit it.
And again, absolutely!

> If you transfer data from HOST (the cpu's) to the GPU, then AMD and Nvidia gpgpu cards can do that
> without stopping the gpu cores from calculation. So it happens in background. In this manner you need of
> course a limited buffer.
I am not an NVidia or AMD expert, but it seems to me that you must have to copy the data from the
buffer on the card that was PCI mapped and accessible to the host to where it finally wants to reside.
And that is the "extra copy" I mentioned originally. Even if you do that with a block-copy engine it still
has to eat bandwidth while you're doing it., even if it is asynchronous to the FPUs.

You can, of course, do asynchronous transfers on the MIC too.

> A problem some report with OpenCL is that if they by accident overallocate the amount of RAM they want to
> use on the gpu, that it is allocating Host memory, which as said before is pretty slow. Really more than factor 10.

Right, I think we're (perhaps surprisingly :-) ) in violent agreement.

-- Jim
James Cownie <jcownie at cantab.net>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.beowulf.org/pipermail/beowulf/attachments/20130307/7d065119/attachment-0001.html 

More information about the Beowulf mailing list