<div dir="ltr">If I recall correctly, IBM did just what you're describing with the BlueGene CPUs. I believe those were 18-core parts, with 2 of the cores being reserved to run the OS and as a buffer against jitter. That left a nice, neat power-of-2 amount of cores for compute tasks.<div><br></div><div>Re: having a specialized, low-power core, this is clearly something that's already been successful in the mobile device space. The <a href="https://en.wikipedia.org/wiki/ARM_big.LITTLE">big.LITTLE</a> ARM architecture is designed for this kind of thing and has been quite successful. Certainly, now that Intel and AMD are really designing modular SoC-like products, it wouldn't be terribly difficult to bake in a couple of low power x86 cores (e.g. Atom or Xeon-D + larger Skylake die in Intel's case; Jaguar + Zen in AMD's case). I'm not an expert in fab economics, but I don't believe it would not significantly add to production costs.</div><div><br></div><div>A similar approach to IBM's (with BlueGene) is what the major public Cloud providers often do these days. AWS' standard approach is to buy CPUs with 1-2 more cores pr socket than they actually intend to expose to users, and to use those extra cores for managing the hypervisor layer. As an example, the CPUs in the C4.8xlarge instances are, in reality, custom 10-core Xeon (Haswell) parts. Yet, AWS only exposes 8 of the cores per socket to the end user in order to ensure consistent performance and reduce the chance of a compute intensive workload from interfering with AWS' management of the physical node via the hypervisor. Microsoft Azure and Google Compute Platform often (but not always) do the same thing, so it's something of a "best practice" among the Cloud providers these days. Anecdotally, I can report that in our (Cycle Computing's) work with customers doing HPC and "Big Compute" on public Clouds that performance consistency has improved a lot over time and we've had the Cloud folks tell us that reserving a few cores/node was a helpful step in that process.</div><div><br></div><div>Hope this helps!</div><div><br></div><div><br></div><div>Best,</div><div><br></div><div>Evan</div><div><br></div><div><br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Sat, Jul 22, 2017 at 6:13 AM, Scott Atchley <span dir="ltr"><<a href="mailto:e.scott.atchley@gmail.com" target="_blank">e.scott.atchley@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">I would imagine the answer is "It depends". If the application uses the per-CPU caches effectively, then performance may drop when HT shares the cache between the two processes.<div><br></div><div>We are looking at reserving a couple of cores per node on Summit to run system daemons if the use requests. If the user can effectively use the GPUs, the CPUs should be idle much of the time anyway. We will see.</div><div><br></div><div>I like you idea of a low power core to run OS tasks.</div></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Sat, Jul 22, 2017 at 6:11 AM, John Hearns via Beowulf <span dir="ltr"><<a href="mailto:beowulf@beowulf.org" target="_blank">beowulf@beowulf.org</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div dir="ltr"><div>Several times in the past I have jokingly asked if there shoudl eb another lower powered CPU core ina  system to run OS tasks (hello Intel - are you listening?)</div><div>Also int he past there was advice to get best possible throughpur on AMD Bulldozer CPUs to run only on every second core (as they share FPUs).</div><div>When I managed a large NUMA system we used cpusets, and the OS ran in a smal l'boot cpuset' which was physically near the OS disks and IO cards.</div><div><br></div><div>I had a thought about hyperthreading though. A few months ago we did a quick study with Blener rendering, and got 30% more througput with HT switched on. Also someone who I am workign with now would liek to assess the effect on their codes of HT on/HT off.</div><div>I kow that HT has nromally not had any advantages with HPC type codes - as the core should be 100% flat out.</div><div><br></div><div>I am thinking though - what woud be the effect of enabling HT, and usign a cgroup to constrain user codes to run on all the odd-numbered CPU cores, with the OS tasks on the even numbered ones?</div><div>I would hope this would be at least performance neutral? Your thoughts please! Also thoughts on candidate benchmark programs to test this idea.</div><div><br></div><div><br></div><div>John Hearns........</div><span class="m_-2574031518464108215HOEnZb"><font color="#888888"><div> ....... John Hearns</div></font></span></div>

<br></div></div>______________________________<wbr>_________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>

<br></blockquote></div><br></div>

<br>______________________________<wbr>_________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/<wbr>mailman/listinfo/beowulf</a><br>

<br></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr"><div>Evan Burness</div><div>Director, HPC Solutions</div><div>Cycle Computing</div><div><a href="mailto:evan.burness@cyclecomputing.com" target="_blank">evan.burness@cyclecomputing.com</a></div><div>(919) 724-9338</div></div></div>

</div>