[Beowulf] What class of PDEs/numerical schemes suitable for GPU clusters
hahn at mcmaster.ca
Thu Nov 20 08:23:31 PST 2008
> [shameless plug]
> A project I have spent some time with is showing 117x on a 3-GPU machine over
> a single core of a host machine (3.0 GHz Opteron 2222). The code is
> mpihmmer, and the GPU version of it. See http://www.mpihmmer.org for more
> details. Ping me offline if you need more info.
> [/shameless plug]
I'm happy for you, but to me, you're stacking the deck by comparing to a
quite old CPU. you could break out the prices directly, but comparing 3x
GPU (modern? sounds like pci-express at least) to a current entry-level
cluster node (8 core2/shanghai cores at 2.4-3.4 GHz) be more appropriate.
at the VERY least, honesty requires comparing one GPU against all the cores
in a current CPU chip. with your numbers, I expect that would change the
speedup from 117 to around 15. still very respectable.
I apologize for not RTFcode, but does the host version of hmmer you're
comparing with vectorize using SSE?
>> or more generally: fairly small data, accessed data-parallel or with very
>> regular and limited sharing, with high work-per-data.
> ... not small data. You can stream data.
can you sustain your 117x speedup if your data is in host memory?
by small, I meant the on-gpu-card memory, mainly, which is fast but
often more limited than host memory.
sidebar: it's interesting that ram is incredibly cheap these days,
and we typically spec a middle-of-the-road machine at 2GB/core.
but even 4GB/core is not much more expensive, but to be honest,
the number of users who need that much is fairly small.
>> GP-GPU tools are currently immature, and IMO the hardware probably needs a
>> generation of generalization before it becomes really widely used.
> Hrmm... Cuda is pretty good. Still needs some polish, but people can use
> it, and are generating real apps from it. We are seeing pretty wide use ...
> I guess the issue is what one defines as "wide".
Cuda is NV-only, and forces the programmer to face a lot of limits and
weaknesses. at least I'm told so by our Cuda users - things like having
to re-jigger code to avoid running out of registers. from my perspective,
a random science prof is going to be fairly put off by that sort of thing
unless the workload is really impossible to do otherwise. (compared to
the traditional cluster+MPI approach, which is portable, scalable and
at least short-term future-proof.)
More information about the Beowulf