<html><head></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">Up front, I work for Intel and, even write software for the Intel(r) Xeon Phi(tm) coprocessor.<div>

<br><div><div>On 12 Feb 2013, at 16:38, Richard Walsh wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div>Curious about the observed benefits of hyper-threading, which generally offers</div><div>

<div>little to floating-point intensive HPC computations where functional unit</div><div>collision is an issue.  </div></div></blockquote><br></div></div><div>There's a big difference between the processors in the Phi, and those in current Xeons.</div><div>The Phi CPUs are in-order processors, whereas the Xeons are out of order. On the </div><div>Xeons hyper-threading is intended to allow the out of order CPU to schedule operations from either</div><div>hardware thread when there are spare functional units that aren't being used. If a single thread</div><div>can max-out  a functional unit (for instance the floating point ALU) then enabling another hardware</div><div>thread is unlikely significantly to improve performance (as you observe!).</div><div><br></div><div>However the intent in the in-order processor is different; here the aim is to provide extra</div><div>latency tolerance when one thread is stalled waiting for a cache or memory access; in the out</div><div>of order core, this is hidden by the out of order mechanism.</div><div><br></div><div>So the benefits of running more hardware threads in the Phi can be much larger than in the </div><div>big, out of order core, and I would certainly recommend running at least two threads/core</div><div>unless you are seriously memory bandwidth bound.</div><div><br></div><div>When investigating scaleability on the Phi, my preference is to plot cores along the x-axis and treat</div><div>1thread/core, 2threads/core, ... 4threads/core as separate series. I find this easier to understand than</div><div>a plot with threads on the x-axis, because it's then hard to distinguish 60threads (== 15 coresx4 threads) from </div><div>60 threads (==20coresx3T), 60threads == (30Cx2T) and 60threads (==60Cx1T).</div><div><br></div><div>If you're using OpenMP, then the KMP_PLACE_THREADS envirable makes it easy to play with </div><div>allocations of that sort.</div><div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; "><br></font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">--</font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">-- Jim</font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">--</font></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px; "><font face="Helvetica" size="3" style="font: normal normal normal 12px/normal Helvetica; ">James Cownie <<a href="mailto:jcownie@cantab.net">jcownie@cantab.net</a>></font></div><br></div><div><br></div><div><br></div><div><br></div></body></html>