[Beowulf] GPU boards and cluster servers

Wed Sep 10 16:27:35 PDT 2008

John,

I'd go for AMD thing.

Think about it more than 4x more cache a stream processor,
as they got 64 of them doing 5 instructions a cycle or so,
versus nvidia has 240 of them.

Seymour Cray's law has a better balance for it than Nvidia.

Additionally it will be easier to find documentation and information about
AMD as they are a processor manufacturer used to give out information
about their hardware, Nvidia still has to learn that.

As for speed of course today Nvidia with a new GPU faster end this year
AMD, next year who knows, but usually it will be turning a coin each time.

Each newer GPU you can assume to be faster. Adding cores for those guys is
relative easy in contradiction for CPU's.

In case you plan to make an algorithm that's not embarrassingly parallel,
Nvidia has a problem that AMD doesn't. It has 2 layers of parallellism
versus AMD just a single one. AFAIK in AMD/ATI you've got 64 processors
that get each the same instruction stream, justlike 1 block of nvidia; but
nvidia additionally to that has also a grid of blocks; that means you have
to make special parallellistic algorithm also between blocks which is
different from the parallellism from just 64 stream processors that
execute instructions @ 5 units at a time.

Additionally debugging blocks is going to be tougher than debugging 1
block; If you have 1 block that all executes the same code at the same
time, then that's reasonable deterministic (could be memory writes to the
same adress aren't deterministic in case you plan to do those).

Think about it 4x more cache a stream processor (assuming cards have same
amount of cache and potential, which averaged over a few years of time
will be the same). Crucial to FFT type workloads.

Vincent