[Beowulf] 10 kCore cluster in Amazon cloud, costs ~1 kUSD/hour

Thu Apr 7 01:56:33 PDT 2011

http://www.networkworld.com/cgi-bin/mailto/x.cgi?pagetosend=/news/2011/040611-linux-supercomputer.html&pagename=/news/2011/040611-linux-supercomputer.html&pageurl=http://www.networkworld.com/news/2011/040611-linux-supercomputer.html&site=datacenter&nsdr=n 

10,000-core Linux supercomputer built in Amazon cloud

Cycle Computing builds cloud-based supercomputing cluster to boost scientific
research.

By Jon Brodkin, Network World

April 06, 2011 03:15 PM ET

High-performance computing expert Jason Stowe recently asked two of his
engineers a simple question: Can you build a 10,000-core cluster in the
cloud?

"It's a really nice round number," says Stowe, the CEO and founder of Cycle
Computing, a vendor that helps customers gain fast and efficient access to
the kind of supercomputing power usually reserved for universities and large
research organizations.

SUPERCOMPUTERS: Microsoft breaks petaflop barrier, loses Top 500 spot to
Linux

To continue reading, register here to become an Insider. You'll get free
access to premium content from CIO, Computerworld, CSO, InfoWorld, and
Network World. See more Insider content or sign in.

High-performance computing expert Jason Stowe recently asked two of his
engineers a simple question: Can you build a 10,000-core cluster in the
cloud?

"It's a really nice round number," says Stowe, the CEO and founder of Cycle
Computing, a vendor that helps customers gain fast and efficient access to
the kind of supercomputing power usually reserved for universities and large
research organizations.

SUPERCOMPUTERS: Microsoft breaks petaflop barrier, loses Top 500 spot to
Linux

Cycle Computing had already built a few clusters on Amazon's Elastic Compute
Cloud that scaled up to several thousand cores. But Stowe wanted to take it
to the next level. Provisioning 10,000 cores on Amazon has probably been done
numerous times, but Stowe says he's not aware of anyone else achieving that
number in an HPC cluster, meaning one that uses a batch scheduling technology
and runs an HPC-optimized application.

"We haven't found references to anything larger," Stowe says. Had it been
tested for speed, the Linux-based cluster Stowe ran on Amazon might have been
big enough to make the Top 500 list of the world's fastest supercomputers.

One of the first steps was finding a customer that would benefit from such a
large cluster. There's no sense in spinning up such a large environment
unless it's devoted to some real work.

The customer that opted for the 10,000-core cloud cluster was biotech company
Genentech in San Francisco, where scientist Jacob Corn needed computing power
to examine how proteins bind to each other, in research that might eventually
lead to medical treatments. Compared to the 10,000-core cluster, "we're a
tenth the size internally," Corn says.

Cycle Computing and Genentech spun up the cluster on March 1 a little after
midnight, based on Amazon's advice regarding the optimal time to request
10,000 cores. While Amazon offers virtual machine instances optimized for
high-performance computing, Cycle and Genentech instead opted for a "standard
vanilla CentOS" Linux cluster to save money, according to Stowe. CentOS is a
version of Linux based on Red Hat's Linux.

The 10,000 cores were composed of 1,250 instances with eight cores each, as
well as 8.75TB of RAM and 2PB disk space. Scaling up a couple of thousand
cores at a time, it took 45 minutes to provision the whole cluster. There
were no problems. "When we requested the 10,000th core, we got it," Stowe
said.

The cluster ran for eight hours at a cost of $8,500, including all the fees
to Amazon and Cycle Computing. (See also: Start-up transforms unused desktop
cycles into fast server clusters)

For Genentech, this was cheap and easy compared to the alternative of buying
10,000 cores for its own data center and having them idle away with no work
for most of their lives, Corn says. Using Genentech's existing resources to
perform the simulations would take weeks or months instead of the eight hours
it took on Amazon, he says. Genentech benefited from the high number of cores
because its calculations were "embarrassingly parallel," with no
communication between nodes, so performance stats "scaled linearly with the
number of cores," Corn said.

To provision the cluster, Cycle used its own CycleCloud software, the Condor
scheduling system and Chef, an open source configuration management
framework.

Cycle also used some of its own software to detect errors and restart nodes
when necessary, a shared file system, and a few extra nodes on top of the
10,000 to handle some of the legwork. To ensure security, the cluster was
engineered with secure-HTTP and 128/256-bit Advanced Encryption Standard
encryption, according to Cycle.

Cycle Computing boasted that the cluster was roughly equivalent to the 114th
fastest supercomputer in the world on the Top 500 list, which hit about 66
teraflops. In reality, they didn't run the speed benchmark required to submit
a cluster to the Top 500 list, but nearly all of the systems listed below No.
114 in the ranking contain fewer than 10,000 cores.

Genentech is still waiting to see whether the simulations lead to anything
useful in the real world, but Corn says the data "looks fantastic." He says
Genentech is "very open" to building out more Amazon clusters, and Cycle
Computing is looking ahead as well.

"We're already working on scaling up larger," Stowe says. All Cycle needs is
a customer with "a use case to take advantage of it."

Follow Jon Brodkin on Twitter: www.twitter.com/jbrodkin

Read more about data center in Network World's Data Center section.