<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"><head><title></title><meta http-equiv="Content-type" content="text/html; charset=UTF-8" /><style type="text/css">p { margin:0px; padding:0px; }</style></head><body style='background-color:rgb(255, 255, 255);background-image:none;background-repeat:repeat;background-position:0% 0%;font-family:Verdana,Geneva,Arial,Helvetica,sans-serif;font-size:12px;margin-top:0px;margin-bottom:0px;margin-left:0px;margin-right:0px;padding-top:5px;padding-bottom:5px;padding-left:5px;padding-right:5px;'>Hi everyone, <br /><br />I have built and played with my home beowulf cluster running rocks for a while and <br /><p>now I would like to construct a bigger one. I have bought a book called Building Clustered<span></span></p><p id="__paragraph__1271510321000">Linux Systems (being a noob at HPC). The book is most excellent.  The cluster is to be <span></span></p><p id="__paragraph__1271510539000">used for CFD and FEM computations. A side note: I have a MSc in computational</p><p id="__paragraph__1271510539000">science so I know my applications in detail, and the numerics behind the calculations, but<span></span></p><p id="__paragraph__1271510883000">on the HPC side, I'm kind of a noob (learning it for only one year of the required</p><p id="__paragraph__1271510883000">25 years ;) ).<br /></p><p id="__paragraph__1271510372000"><br /></p><p id="__paragraph__1271510372000">I have some questions regarding the choice of hardware architecture for the compute nodes.<span></span></p><p id="__paragraph__1271510394000"><br /></p><p id="__paragraph__1271510394000">Since I am on a low budget I would like to implement a 16 compute nodes cluster of COTS <span></span></p><p id="__paragraph__1271510423000">electronics (for my budget, Xeons and Opterons are not COTS, definitely) and I have trouble</p><p id="__paragraph__1271510423000">deciding whether to Dual Core or Quad Core, Intel vs AMD processors for the cluster.<span></span></p><p id="__paragraph__1271511192000"><br /></p><p id="__paragraph__1271511192000">gcc is used for the compilation and the AMD hardware is cheaper, so I'm inclined on AMD, <span></span></p><p id="__paragraph__1271511775000">but I would appreciate any advice on this: I have been given advice before that icc</p><p id="__paragraph__1271511775000">can get me up to 20% of speed increase on Intel processors. <span></span></p><p id="__paragraph__1271512485000"><br /></p><p id="__paragraph__1271512485000">Another reason is the fact that since I'm running coarse grained CFD/FEM simulations, <span></span></p><p id="__paragraph__1271512501000">there is not much use of the bigger cache that is held by the server type processors<span></span></p><p id="__paragraph__1271512526000">like Opteron and Xeon, or am I wrong? The data is really huge so not much can happen<span></span></p><p id="__paragraph__1271512548000">in the cache that can stay there for a while and make it be useful that way. <br /></p><p id="__paragraph__1271511234000"><br /></p><p id="__paragraph__1271510423000">I have read that for the multiple core processors the system bus can get saturated, so I <span></span></p><p id="__paragraph__1271510572000">am running benchmarks on two single machines: one with 2 core processor and the other<span></span></p><p id="__paragraph__1271510611000">one with 4 cores. <span></span></p><p id="__paragraph__1271510924000"><br /></p><p id="__paragraph__1271510924000">The idea is to run a benchmarking case of my choice (transient multiphase incompressible<span></span></p><p id="__paragraph__1271510954000">fluid flow) and increase the case size and the number of processors to see when the <span></span></p><p id="__paragraph__1271510994000">parallelization is impacted by the traffic on the system bus and to estimate the biggest<span></span></p><p id="__paragraph__1271511035000">size of the simulation case for the single compute node. <span></span></p><p id="__paragraph__1271512425000"><br /></p><p id="__paragraph__1271510687000">How can I avoid I/O speeds to impact my IPC estimation for a single slice? I will up <span></span></p> <p id="__paragraph__1271511543000">a RAID 0 on the machine and with no networking involved, I am not sure that there is anything<span></span></p> <p id="__paragraph__1271511570000">else I could do to take the I/O impact out of the benchmarking for the single slice. </p> <p id="__paragraph__1271512423000"><br /></p><p id="__paragraph__1271511570000">I am describing this in details because I really want to avoid spending the budget in the <span></span></p><p id="__paragraph__1271512666000">wrong way. I will appreciate any advice that you can spare. <span></span></p><p id="__paragraph__1271512686000"><br /></p><p id="__paragraph__1271512686000">Tomislav<br /></p><p id="__paragraph__1271512429000"><br /></p><p id="__paragraph__1271512405000"><br /></p><p id="__paragraph__1271512405000"><br /></p><p id="__paragraph__1271511590000"><br /></p><p id="__paragraph__1271511590000"><br /></p></body></html>