<p>Hi all,<br></p><p>I've been trying to get the best performance on a small cluster we have here at University of Aveiro, Portugal, but I've not been enable to get most software to scale to more than one node.</p>

<p>Our specs are as follows:</p><p>- HP c-7000 Blade enclosure,</p><div>- 8 Blades BL460c:</div><div></div><div><ul><li>Dual Xeon Quad-core E5430, 2.66 GHz </li><li>8GiB FB-DIMM DDR2-667 </li><li>Dual Gigabit Ethernet </li>

<li>Dual 146GB 10K RPM SAS HDD in RAID1</li></ul><p>The dual GbE are based on Broadcom's NetXtreme II, installed with the driver from 2.6.26-gentoo-sources kernel, and they are connected to the internal switch, which seems a Nortel one, rebranded HP.</p>

<p>The problem with this setup is that even calculations that take more than 15 days don't scale to more than 8 cores, or one node. Usually performance is lower with 16cores, 12 cores, than with just 8. From what I've been reading, I should be able to scale fine at least till 16 cores and 32 for some software. </p>

<p>I tried with Gromacs to have two nodes using one processor each, to check if 8 cores were stressing the GbE too much, and the performance dropped too much compared with running two CPUs on the same node. This is sort of unexpected for me, since the benchmarks I've seen on Gromacs website state that I should be able to have 100% scaling on this case, sometimes more.</p>

<p>2 cores, 1 node, 1500 steps ---> 361s  (2.6.26-r4 and 2.6.24-r6, no IPv6, icc)<br>-----------------------------------------------------------------------------------------------------------------------------------------------------------------<br>

2 cores, 2 node, 1500 steps ---> 499s  (2.6.26-r4, no GROUP SCHEDULER, icc)</p><p>To be more precise. This particular benchmark is the only one that is stressful enough to give me a benefit to go from 8 to 16 cores.</p>

<p>8 cores, 1 node, 1500 steps ---> 101s (2.6.26-r4, no GROUP SCHEDULER, icc)<br>-----------------------------------------------------------------------------------------------------------------------------------------------------------------<br>

16 cores, 2 nodes, 1500 steps ---> 65s, 65s (no IPv6, 2.6.26, 2.6.26)</p><p>The rest of typical calculations done here are less heavy and have worse performance running in 16 cores than 8. I also don't know if this is just a case of two "easy" calculations or really hardware - which I find strange since calculations that take up 15days in 8 cores aren't able to run faster in 16.</p>

<p>From what I could also digg around, it seems that some switches have too much latency and hamper any kind of proper performance from GbE. Were this my case, any benchmarks I could use to test that theory out?</p><p>One particular software, VASP, doesn't scale to more than 6 cores, which seems to be a bandwidth problem due to the FSB used in Xeons, but the other software behaves quite well.</p>

<p>As for software, I'm using Gentoo Linux, ICC/IFC/GotoBLAS, tried scalapack with no benefit, OpenMPI and Torque, running in x86-64 mode.</p><p>Any help will be very welcomed.</p><p>Best regards,</p><p>                         Tiago Marques<br>

<br></p></div><div></div>