[Beowulf] Intel Phi musings

Dr Stuart Midgley sdm900 at gmail.com
Mon Feb 25 06:09:26 PST 2013


The Kepler K10 is faster at single precision floating point, which is our code.

The biggest issue we have at the moment is the amount of memory the systems have.

--
Dr Stuart Midgley
sdm900 at sdm900.com




On 23/02/2013, at 1:13 AM, Richard Walsh <rbwcnslt at gmail.com> wrote:

> 
> Hey Stuart,
> 
> Mmm ... interesting.
> 
> As I understand it the name K10 corresponds to the GK104 which is really
> really a graphics-oriented chip.  It is the K20 or GK110 that is HPC (GP GPU)
> version of the Kepler and the right one to make the comparison too. 
> 
> Here is the white paper:
> 
> http://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
> 
> One wonders if you are running in single or double precision (maybe you
> told me) because the GK110 has 192 single precision cores per SMX unit
> while only 64 double precision cores (3 to 1 ratio rather than the typical
> 2 to 1).  It would be interesting to to see data from this comparison.  Doing
> some math.
> 
> Single precision:
> 
> GK110  15 SMX units x 192 SP cores == 2880 SP ops/clock  0.738 GHz ==  2125 SP GFLOPS
> 
> Phi       60 PHI cores x 16 SP vectors ==  960 SP ops/clock  1.100 GHz ==  1056 SP GFLOPS
> 
> Double precision:
> 
> GK110  15 SMX units x 64 DP cores   ==  960 DP ops/clock  0.738  GHz  ==  708 DP GFLOPS
> 
> Phi        60 PHI cores x 8 DP vectors  ==  480 DP ops/clock  1.100 GHz  ==   528 DP GFLOPS
> 
> These peak numbers (assuming I got the math right) of course do not dictate real code
> performance outcomes where effective memory bandwidth will make a large contribution.
> Still it looks like the GK110 should have the performance edge (if not the productivity edge).
> 
> rbw
> 
> 
> 
> On Fri, Feb 22, 2013 at 7:44 AM, Dr Stuart Midgley <sdm900 at gmail.com> wrote:
> We have a code written on both the Phi and K10's and they give about the same performance (both highly optimised finite difference codes).
> 
> 
> 
> 
> --
> Dr Stuart Midgley
> sdm900 at sdm900.com
> 
> 
> 
> 
> On 15/02/2013, at 4:53 AM, Richard Walsh <rbwcnslt at gmail.com> wrote:
> 
> >
> > Hey Stuart,
> >
> > Thanks much for the detail.
> >
> > So, if I am reading you correctly your test was on a single
> > physical PHI (you will later expand to multiple PHIs).  This
> > was a highly parallel single precision application which showed
> > the expected linear speed up to 60 cores ... then a kink as you
> > cross into hyper-threaded operation with a 1/2 as steep slope
> > up to factor of two to 120 core-equivalents with a 4 to 1 over
> > subscription of hyper-threads.  This was all done with the Intel
> > compilers on an unmodified pthreaded code that is well-vectored.
> >
> > A good result ... on an application that is a perfect candidate
> > for PHI.  To run elsewhere with CUDA, OpenMP, or OpenACC
> > directives would require quite a bit of recoding which you were
> > happy to avoid.  My guess is if you had a CUDA implementation
> > you would see better performance on a FERMI or KEPLER,
> > but that is a programming path you do not wish to take.
> >
> > This is an interesting case to hear about.  The flack (technical
> > marketing) from NVIDIA is to focus on the difficulty of using
> > the 'offload' model and Intel extensions to OpenMP, Cylk, etc.,
> > articulate their hardware's performance advantages, and talk
> > about OpenACC. These arguments are not unreasonable, but
> > clearly not universallydeciding.
> >
> > Thanks much ... and good luck getting all your other codes
> > to scale just as well.
> >
> > rbw
> >
> > On Thu, Feb 14, 2013 at 10:18 AM, Dr Stuart Midgley <sdm900 at gmail.com> wrote:
> > Evening
> >
> > Sorry for the slow response.
> >
> > Most of our codes are pthreads, we have avoided MPI and OpenMP as much as possible.  Our current cluster consists of Nehalem, Westmere, Sandy Bridge and Interlagos of various flavours.  Our Phi cards are in Sandy Bridge systems (host machine has 16 cores with 128GB ram).  We run the intel compilers.
> >
> > Our fastest systems are the 64core Interlagos systems (256GB ram) running at 2.6GHz.  For a few of our most important kernels, a single phi had greater throughput than a whole node.  Which, if you count the flops, is expected.  The Phi's have a massive amount of single precision floating point performance (our codes are single precision).
> >
> > Our kernels vectorise very well (lots of hand coded SSE3) and are expected for run very well on the phi (we haven't tested these codes yet).  The codes we have tested are trivially parallel and very FP heavy - they ported easily to the phi and run very well.
> >
> > The codes I tested (in like 2hrs) saw linear speedup to 60cores and then a "kink" in performance and then continued performance gains right up to 240 threads.  Essentially these codes are single cpu with a trivial wrapper around them to hand out work.  This is exactly what hyper threading was designed to help.  So at 240 threads, we were about 120 times faster than a single thread of this code.  At 60 threads, we were 60 times faster :)
> >
> > Again, since the codes I tested were small data in, small data out and heavy compute and trivially parallel, running over multiple phi's is trivial and provide linear performance gains.  As we start porting more of our complex codes, I expect to see similar gains.  Our codes already run very very well on 64 cores…
> >
> > The phi's are separate cards, in separate pcie slots.  I have not delved into the programming api's fully, but I suspect you can utilise the one phi card for your threaded codes.  The way I've been running is with a native phi application (basically using the Phi as a separate linux cluster node)… using it in offload mode is very different and you may well be able to get your kernel running across both with the right pragmas.
> >
> > To be 100% honest, we took the boots and all approach.  If we only purchased 1 phi to test on, we would never expend the energy to port all our codes.  Purchasing hundreds of them gives you a lot of impetus to port your codes quickly :)
> >
> >
> > --
> > Dr Stuart Midgley
> > sdm900 at sdm900.com
> >
> >
> >
> >
> > On 13/02/2013, at 12:38 AM, Richard Walsh <rbwcnslt at gmail.com> wrote:
> >
> > >
> > > Hey Stuart,
> > >
> > > Thanks for your answer ...
> > >
> > > That sounds compelling.  May I ask a few more questions?
> > >
> > > So should I assume that this was a threaded SMP type application
> > > (OpenMP, pthreads) or it is MPI based? Is the supporting CPU of the
> > > multi-core Sandy Bridge vintage? Have you been able to compare
> > > the hyper-threaded, multi-core scaling on that Sandy Bridge side of the
> > > system with that on the Phi (fewer cores to compare of course).  Using the
> > > Intel compilers I assume ... how well do your kernels vectorize?  Curious
> > > about the observed benefits of hyper-threading, which generally offers
> > > little to floating-point intensive HPC computations where functional unit
> > > collision is an issue.  You said you have 2 Phis per node.  Were you
> > > running a single job across both?  Were the Phis in separate PCIE
> > > slots or on the same card (sorry I should know this, but I have just
> > > started looking at Phi).  If they are on separate cards in separate
> > > slots can I assume that I am limited to MPI parallel implementations
> > > when using both.
> > >
> > > Maybe that is more than a few questions ... ;-) ...
> > >
> > > Regards,
> >
> >
> 
> 




More information about the Beowulf mailing list