[Beowulf] $1, 279-per-hour, 30, 000-core cluster built on Amazon EC2 cloud
deadline at eadline.org
Mon Oct 3 11:17:33 PDT 2011
I think everyone has a similar thoughts, but the presentation
provides some real data and experiences.
BTW, for those interested, I have new poll on ClusterMonkey asking
about clouds and HPC. (http://www.clustermonkey.net/)
The last poll was on GP-GPU use.
> Thanks for posting that video. It confirmed what I always suspected
> about clouds for HPC.
> On 10/03/2011 08:25 AM, Douglas Eadline wrote:
>> Interesting and pragmatic HPC cloud presentation, worth watching
>> (25 minutes)
>>> $1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud
>>> By Jon Brodkin | Published September 20, 2011 10:49 AM
>>> Amazon EC2 and other cloud services are expanding the market for
>>> high-performance computing. Without access to a national lab or a
>>> supercomputer in your own data center, cloud computing lets businesses
>>> up temporary clusters at will and stop paying for them as soon as the
>>> computing needs are met.
>>> A vendor called Cycle Computing is on a mission to demonstrate the
>>> of Amazonâs cloud by building increasingly large clusters on the
>>> Compute Cloud. Even with Amazon, building a cluster takes some work,
>>> Cycle combines several technologies to ease the process and recently
>>> them to create a 30,000-core cluster running CentOS Linux.
>>> The cluster, announced publicly this week, was created for an unnamed
>>> âTop 5
>>> Pharmaâ customer, and ran for about seven hours at the end of July at
>>> cost of $1,279 per hour, including the fees to Amazon and Cycle
>>> The details are impressive: 3,809 compute instances, each with eight
>>> and 7GB of RAM, for a total of 30,472 cores, 26.7TB of RAM and 2PB
>>> (petabytes) of disk space. Security was ensured with HTTPS, SSH and
>>> AES encryption, and the cluster ran across data centers in three Amazon
>>> regions in the United States and Europe. The cluster was dubbed
>>> Spreading the cluster across multiple continents was done partly for
>>> recovery purposes, and also to guarantee that 30,000 cores could be
>>> provisioned. âWe thought it would improve our probability of success
>>> spread it out,â Cycle Computingâs Dave Powers, manager of product
>>> engineering, told Ars. âNobody really knows how many instances you
>>> get at
>>> any one time from any one [Amazon] region.â
>>> Amazon offers its own special cluster compute instances, at a higher
>>> than regular-sized virtual machines. These cluster instances provide 10
>>> Gigabit Ethernet networking along with greater CPU and memory, but they
>>> werenât necessary to build the Cycle Computing cluster.
>>> The pharmaceutical companyâs job, related to molecular modeling, was
>>> âembarrassingly parallelâ so a fast interconnect wasnât crucial.
>>> reduce costs, Cycle took advantage of Amazonâs low-price âspot
>>> instances.â To
>>> manage the cluster, Cycle Computing used its own management software as
>>> as the Condor High-Throughput Computing software and Chef, an open
>>> systems integration framework.
>>> Cycle demonstrated the power of the Amazon cloud earlier this year with
>>> 10,000-core cluster built for a smaller pharma firm called Genentech.
>>> 10,000 cores is a relatively easy task, says Powers. âWe think
>>> the small-scale environments,â he said. 30,000 cores isnât the end
>>> either. Going forward, Cycle plans bigger, more complicated clusters,
>>> ones that will require Amazonâs special cluster compute instances.
>>> The 30,000-core cluster may or may not be the biggest one run on EC2.
>>> isnât saying.
>>> âI canât share specific customer details, but can tell you that we
>>> businesses of all sizes running large-scale, high-performance computing
>>> workloads on AWS [Amazon Web Services], including distributed clusters
>>> the Cycle Computing 30,000 core cluster to tightly-coupled clusters
>>> used for science and engineering applications such as computational
>>> dynamics and molecular dynamics simulation,â an Amazon spokesperson
>>> Amazon itself actually built a supercomputer on its own cloud that made
>>> onto the list of the worldâs Top 500 supercomputers. With 7,000
>>> Amazon cluster ranked number 232 in the world last November with speeds
>>> 41.82 teraflops, falling to number 451 in June of this year. So far,
>>> Computing hasnât run the Linpack benchmark to determine the speed of
>>> clusters relative to Top 500 sites.
>>> But Cycleâs work is impressive no matter how you measure it. The job
>>> performed for the unnamed pharma company âwould take well over a week
>>> them to run internally,â Powers says. In the end, the cluster
>>> equivalent of 10.9 âcompute years of work.â
>>> The task of managing such large cloud-based clusters forced Cycle to
>>> its own game, with a new plug-in for Chef the company calls Grill.
>>> âThere is no way that any mere human could keep track of all of the
>>> parts on a cluster of this scale,â Cycle wrote in a blog post. âAt
>>> weâve always been fans of extreme IT automation, but we needed to
>>> to the next level in order to monitor and manage every instance,
>>> daemon, job, and so on in order for Nekomata to be an efficient 30,000
>>> tool instead of a big shiny on-demand paperweight.â
>>> But problems did arise during the 30,000-core run.
>>> âYou can be sure that when you run at massive scale, you are bound to
>>> into some unexpected gotchas,â Cycle notes. âIn our case, one of
>>> included such things as running out of file descriptors on the license
>>> server. In hindsight, we should have anticipated this would be an
>>> we didnât find that in our prelaunch testing, because we didnât
>>> at full
>>> scale. We were able to quickly recover from this bump and keep moving
>>> with the workload with minimal impact. The license server was able to
>>> very nicely with this workload once we increased the number of file
>>> Cycle also hit a speed bump related to volume and byte limits on
>>> Elastic Block Store volumes. But the company is already planning bigger
>>> better things.
>>> âWe already have our next use-case identified and will be turning up
>>> scale a bit more with the next run,â the company says. But
>>> not about core counts or terabytes of RAM or petabytes of data. Rather,
>>> about how we are helping to transform how science is done.â
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> To change your subscription (digest mode or unsubscribe) visit
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
More information about the Beowulf