
Sandia National Laboratories Combustion Research Facility (CRF) studies energy conversion processes for applications such as power plants, internal combustion engines, and industrial furnaces. As a Department of Energy facility, they collaborate with scientific and commercial partners in aerospace, automotive, transportation, and power generation industries. The collective goal is to find high efficiency, low emission solutions to complex combustion problems.
CRF simulates reacting flows, running two classes of jobs:
Researchers simulate particular problems, building on proven algorithms. Using advanced laboratory facilities, colleagues conduct real time experiments. Results from both are analyzed in a joint complementary manner.
Because the applications are CPU intensive, CRF competes for resources at supercomputing centers. Demand always exceeds supply and CRF had difficulty obtaining access in a timely fashion for all the simulations they wished to run. A typical baseline case has a run time of a week or two. More complex jobs can take 6-8 weeks.
Joe Oefelein, Senior Member of Technical Staff, and Jackie Chen, Distinguished Member of Technical Staff, realized HPC technology had evolved to the point where departmental scale clusters could be obtained for a reasonable cost. They determined a Linux cluster could perform their routine calculations. Supercomputer time could be reserved for larger simulations that require substantial system support.
Joe and Jackie, anticipating the trend towards distributed computing, had parallelized their codes for these environments. This allowed them to quickly benchmark the performance of their specific applications on the clusters available. Joe easily made the purchase argument to his department manager who endorsed the engineers' recommendations with the Deputy Director and Director of the CRF for approval.
Three Principal Investigators access the cluster from their workstations using an informal queue to manage shared use of the cluster which is in continual operation. "It's been clear as weve emerged from the shake-down phase that its performing as promised. The time I need to manage the cluster is truly minimal," says Joe.
The Penguin Computing AMD Opteron cluster is composed of 72 nodes with 144 processors, centrally located in a laboratory. The Altus 3200 master node, back up battery system, GigaBit Ethernet switch and Infiniband are on one rack. Two racks house Altus 1000E Opteron 246 compute nodes, with motherboards and communication hardware. CRF invested in Infiniband in anticipation of adding another 72 nodes.
Using the Scyld BeoMaster interface to emulate a workstation, any staff member with the technical competency to administer a single Linux machine can handle the entire cluster. The minimal intervention required to manage a Scyld cluster eliminated the need to budget $150,000 a year for resources from Sandias support staff.
Joe and Jackie's advice to those considering clusters
Please send any suggestions you have for Exceptional Installations to editor@beowulf.org
We are looking for recently constructed clusters that are interesting because of size, construction or application. The most relevant systems are not those that go to extremes to be pointlessly unique, but those systems that are reproducible and especially maintainable.
Fully integrated Beowulf clusters with commercially supported Beowulf software systems are available from the following vendors: