[Beowulf] GPU based cluster fo machine learning purposes

Thu Apr 10 07:07:40 PDT 2014

On 4/10/14 5:28 AM, "Piotr Król" <pietrushnic at gmail.com> wrote:

>On Thu, Apr 10, 2014 at 01:44:30AM -0400, Mark Hahn wrote:
>> >I'm considering proof of concept Beowulf cluster build for machine
>> >learning purposes.
>> 
>> you can't go wrong using cheap/PC/commodity parts.  you'll also get the
>> easiest access to tools/distros/etc.
>> 
>
>I'm concerned about cluster size I would like to keep it as small as
>possible. Probably some Mini/Nano-ITX board would be good enough to beat
>Jetson TK1. I wonder about price for whole setup and its comparison with
>Jetson.

I buy a lot of Mini-ITX boards these days as embedded controllers.
They're no ball of fire speed wise (dual Atom 1.9GHz is typical).  That
said, they're cheap and easy to use.  One thing you can get caught on is
that the very lowest cost ones have limited conventional connectors for
peripherals (e.g. USB and serial ports).. They come out to headers on the
board, and you're expected to provide a suitable jumper to the actual
enduser connector (e.g. A submini D for serial ports or a USB A Female).
Things like power switches are not included.

By the time you buy a power supply, add memory, I/o shields, vestigial
chassis, your little $50 motherboard is now a $300 computer.

For a prototyping cluster, small size isn't often a real driver (unless
you're trying to pack it into a small box for some other reason: lunch box
beowulf clusters that fit under a plane seat).  Going to a more
conventional (slightly larger) consumer oriented motherboard and an
inexpensive consumer oriented power supply might actually give you better
bang for the buck. 

Using the "double sided foam tape it down to baking sheets" approach makes
pretty much any mobo and power supply easy. It's not dense, but it is
cheap and fast

>
>> >In short I need as good as possible double precision matrix
>> >multiplication performance with small power consumption and size.
>> 
>> TK1 appears to be SP-oriented (not surprisingly).  it's a little unclear
>> what its power dissipation is - I'd guess something in the 20W range for
>> linpack.
>> 
>> >Taking matrix multiplication into consideration I thought that GPU is
>> >natural choice.
>> 
>> well, maybe.  you always save power by operating more units at lower
>>clock,
>> and GPU tends to embrace this approach.  it's not like GPUs have some
>> magically more efficient circuits otherwise.  but it's proabably worth
>> looking at the gpu-linpack performance/watt from AMD's APU options.
>>(though
>> they contain higher-performance CPU and memory support than TK1.)
>> 
>Very good point! Following your AMD APU advice I found this article:
>http://www.anandtech.com/show/7711/floating-point-peak-performance-of-kave
>ri-and-other-recent-amd-and-intel-chips
>I will try to rethink my configuration using AMD APU + Mini/Nano-ITX
>board and will see if I can get better result considering
>performance/price
>ratio.

For virtually ALL computational applications, the lowest MIPS/unit of
currency will be with the single big multicore computer in consumer trim.

Mini-ITX is definitely not good on a MIPS/$ basis.  It is ok on a
MIPS/Watt basis, but as others have pointed out, the CPU power consumption
is actually a small part of the overall picture.  Memory and external
interfaces also consume significant power, and they're essentially the
same, regardless of form factor.

What Mini-ITX is great for is "physically small size", "very low standby
power", "onboard 12V DC/DC converter", "low power dissipation so no fan is
needed" (the latter comes at the cost of "low MIPS").

OTOH, if the goal is "low cost experimental cluster" then MiniITX can be a
good way.  You can buy 5 mother boards and all the stuff you need for
<$1000, maybe <$500.

>
>> >I'm open to any suggestions, even if it means changing everything in
>> >this build :)
>> 
>> IMO, you can learn everything you need to learn from 4-8 low-end PCs.
>> there are certainly power differences versus and arm+low-end-gpu board
>> like this, but since this device delivers pretty much token gflops,
>> you might consider just using a raspberry pi or beaglebone if you have
>>your
>> heart set on avoiding the PC market.
>
>I considered RPi and BeagleBone. I measure performance on RPi and get 68
>DP MFLOPS after overclocking.

Overclocking and cluster computing don't go together very well.  Clusters
are sufficiently complex beasts that you don't need the additional
failure/flakiness/thermal management hassles that comes from overclocking.

> There is unleashed performance of
>VideoCore IV GPU (24 SP GFLOPS) but there is no C compiler for that
>(only reverse engineered assembly).

Unless you really enjoy hacking at a very low level, you want to pick
hardware for which YOU aren't responsible for making the OS and tools
work.  You want to spend your time on
A) hardware assembly
B) learning how to effectively use multiple nodes and a communications
fabric

Not finding that tool A needs glibc version 3 and can't work with version
4; and tool B needs glibc version 4; and oh, essential tool C is on
sourceforge and seems to work on the original author's system when she was
doing her master's thesis in 2005 using an old copy of Debian Woody on a
early Pentium, but hasn't seen much use since then.

{I have a cluster down in a closet of my lab with 4 Mini-ITX Via Eden
processors, it boots off CF or netboots Debian Woody, over WiFi, no less.
I can't give the nodes away.... Someone says "I need a little embedded PC
for %some small task%"  and I say, hey, you want one of these cool 12V
powered fanless Pcs with FreeDos or Linux?   And as soon as they start to
figure out how much time they will spend getting it to do what they want,
they say, "I'll buy a brand new Mini-ITX with Windows 7 or CentOS for
$400"  Such is life when labor hours cost money.)

> BeagleBone MX seems to have about
>50-60 MFLOPS according to this:
>http://www.vesperix.com/arm/atlas-arm/bench/gcc-a8/index.html
>
>So this boards are not comparable with Jetson. I will take a look at
>Mini/Nano-ITX PC market.

Hah.. If you want a real low power/high performance.. Consider the
teensy3.1, a sort of super arduino using the Freescale K20 processor based
on the ARM Cortex architecture. 30mA, runs at 72 Mhz clock rate, does a
1024 complex fixed point FFT in a few milliseconds.

If we assume a 1024 point FFT has 1024*10 butterflies, each with a complex
multiply (4 integer multiplies+ 2 adds) and 2 complex adds (4 integer
adds), that's 10240*(4+6) or about 100 kiloops, in say, 5 milliseconds
that's 20 MIPS.  (actually it's more, because I was just counting
arithmetic operations, and there's all the other instructions to move
arguments around,etc). Let's just call it 72 MIPS..

30mA at 3.3V is 100 mW, so we have 72 MIPS/0.1mW which is 720 MIPS/Watt.

Your nodes will cost <$20 each and are about the size of a USB thumb drive.

Sorry, no MIP library yet.

>