[Beowulf] Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs

Wed Oct 31 06:00:47 PDT 2012

http://www.anandtech.com/print/6421

Inside the Titan Supercomputer: 299K AMD x86 Cores and 18.6K NVIDIA GPUs

by Anand Lal Shimpi on 10/31/2012 1:28:00 AM

Posted in  CPUs , Cloud Computing , IT Computing , HPC , GPU , GPUs , Video ,
nvidia

Earlier this month I drove out to Oak Ridge, Tennessee to pay a visit to the
Oak Ridge National Laboratory (ORNL). I'd never been to a national lab
before, but my ORNL visit was for a very specific purpose: to witness the
final installation of the Titan supercomputer.

ORNL is a US Department of Energy laboratory that's managed by UT-Battelle.
Oak Ridge has a core competency in computational science, making it not only
unique among all DoE labs but also making it perfect for a big supercomputer.

Titan is the latest supercomputer to be deployed at Oak Ridge, although it's
technically a significant upgrade rather than a brand new installation.
Jaguar, the supercomputer being upgraded, featured 18,688 compute nodes -
each with a 12-core AMD Opteron CPU. Titan takes the Jaguar base, maintaining
the same number of compute nodes, but moves to 16-core Opteron CPUs paired
with an NVIDIA Kepler K20 GPU per node. The result is 18,688 CPUs and 18,688
GPUs, all networked together to make a supercomputer that should be capable
of landing at or near the top of the TOP500 list.

Over the course of a day in Oak Ridge I got a look at everything from how
Titan was built to the types of applications that are run on the
supercomputer. Having seen a lot of impressive technology demonstrations over
the years, I have to say that my experience at Oak Ridge with Titan is
probably one of the best. Normally I cover compute as it applies to making
things look cooler or faster on consumer devices. I may even dabble in
talking about how better computers enable more efficient datacenters (though
that's more Johan's beat). But it's very rare that I get to look at the
application of computing to better understanding life, the world and universe
around us. It's meaningful, impactful compute.

Read on for our inside look at the Titan supercomputer.

Introduction

Earlier this month I drove out to Oak Ridge, Tennessee to pay a visit to the
Oak Ridge National Laboratory (ORNL). I'd never been to a national lab
before, but my ORNL visit was for a very specific purpose: to witness the
final installation of the Titan supercomputer.

ORNL is a US Department of Energy laboratory that's managed by UT-Battelle.
Oak Ridge has a core competency in computational science, making it not only
unique among all DoE labs but also making it perfect for a big supercomputer.

Titan is the latest supercomputer to be deployed at Oak Ridge, although it's
technically a significant upgrade rather than a brand new installation.
Jaguar, the supercomputer being upgraded, featured 18,688 compute nodes -
each with a 12-core AMD Opteron CPU. Titan takes the Jaguar base, maintaining
the same number of compute nodes, but moves to 16-core Opteron CPUs paired
with an NVIDIA Kepler K20 GPU per node. The result is 18,688 CPUs and 18,688
GPUs, all networked together to make a supercomputer that should be capable
of landing at or near the top of the TOP500 list.

We won't know Titan's final position on the list until the SC12 conference in
the middle of November (position is determined by the system's performance in
Linpack), but the recipe for performance is all there. At this point, its
position on the TOP500 is dependent on software tuning and how reliable the
newly deployed system has been.

Rows upon rows of cabinets make up the Titan supercomputer

Over the course of a day in Oak Ridge I got a look at everything from how
Titan was built to the types of applications that are run on the
supercomputer. Having seen a lot of impressive technology demonstrations over
the years, I have to say that my experience at Oak Ridge with Titan is
probably one of the best. Normally I cover compute as it applies to making
things look cooler or faster on consumer devices. I may even dabble in
talking about how better computers enable more efficient datacenters (though
that's more Johan's beat). But it's very rare that I get to look at the
application of computing to better understanding life, the world and universe
around us. It's meaningful, impactful compute.

{gallery 2417}

The Hardware

In the 15+ years I've been writing about technology, I've never actually
covered a supercomputer. I'd never actually seen one until my ORNL visit. I
have to say, the first time you see a supercomputer it's a bit anticlimactic.
If you've ever toured a modern datacenter, it doesn't look all that
different.

A portion of Titan

More Titan, the metal pipes carry coolant

Titan in particular is built from 200 custom 19-inch cabinets. These cabinets
may look like standard 19-inch x 42RU datacenter racks, but what's inside is
quite custom. All of the cabinets that make up Titan requires a room that's
about the size of a basketball court.

The hardware comes from Cray. The Titan installation uses Cray's new XK7
cabinets, but it's up to the customer to connect together however many they
want.

ORNL is actually no different than any other compute consumer: its
supercomputers are upgraded on a regular basis to keep them from being
obsolete. The pressures are even greater for supercomputers to stay up to
date, after a period of time it actually costs more to run an older
supercomputer than it would to upgrade the machine. Like modern datacenters,
supercomputers are entirely power limited. Titan in particular will consume
around 9 megawatts of power when fully loaded.

The upgrade cycle for a modern supercomputer is around 4 years. Titan's
predecessor, Jaguar, was first installed back in 2005 but regularly upgraded
over the years. Whenever these supercomputers are upgraded, old hardware is
traded back in to Cray and a credit is issued. Although Titan reuses much of
the same cabinetry and interconnects as Jaguar, the name change felt
appropriate given the significant departure in architecture. The Titan
supercomputer makes use of both CPUs and GPUs for compute. Whereas the latest
version of Jaguar featured 18,688 12-core AMD Opteron processors, Titan keeps
the total number of compute nodes the same (18,688) but moves to 16-core AMD
Opteron 6274 CPUs. What makes the Titan move so significant however is that
each 16-core Opteron is paired with an NVIDIA K20 (Kepler GK110) GPU.

A Titan compute board: 4 AMD Opteron (16-core CPUs) + 4 NVIDIA Tesla K20 GPUs

The transistor count alone is staggering. Each 16-core Opteron is made up of
two 8-core die on a single chip, totaling 2.4B transistors built using
GlobalFoundries' 32nm process. Just in CPU transistors alone, that works out
to be 44.85 trillion transistors for Titan. Now let's talk GPUs.

NVIDIA's K20 is the server/HPC version of GK110, a part that never had a need
to go to battle in the consumer space. The K20 features 2688 CUDA cores,
totaling 7.1 billion transistors per GPU built using TSMC's 28nm process.
With a 1:1 ratio of CPUs and GPUs, Titan adds another 132.68 trillion
transistors to the bucket bringing the total transistor count up to over 177
trillion transistors - for a single supercomputer.

I often use Moore's Law to give me a rough idea of when desktop compute
performance will make its way into notebooks and then tablets and
smartphones. With Titan, I can't even begin to connect the dots. There's just
a ton of computing horsepower available in this installation.

Transistor counts are impressive enough, but when you do the math on the
number of cores it's even more insane. Titan has a total of 299,008 AMD
Opteron cores. ORNL doesn't break down the number of GPU cores but if I did
the math correctly we're talking about over 50 million FP32 CUDA cores. The
total computational power of Titan is expected to be north of 20 petaflops.

Each compute node (CPU + GPU) features 32GB of DDR3 memory for the CPU and a
dedicated 6GB of GDDR5 (ECC enabled) for the K20 GPU. Do the math and that
works out to be 710TB of memory.

Titan's storage array

System storage is equally impressive: there's a total of 10 petabytes of
storage in Titan. The underlying storage hardware isn't all that interesting
- ORNL uses 10,000 standard 1TB 7200 RPM 2.5" hard drives. The IO subsystem
is capable of pushing around 240GB/s of data. ORNL is considering including
some elements of solid state storage in future upgrades to Titan, but for its
present needs there is no more cost effective solution for IO than a bunch of
hard drives. The next round of upgrades will take Titan to around 20 - 30PB
of storage, at peak transfer speeds of 1TB/s.

Most workloads on Titan will be run remotely, so network connectivity is just
as important as compute. There are dozens of 10GbE links inbound to the
machine. Titan is also linked to the DoE's Energy Sciences Network (ESNET)
100Gbps backbone.

Physical Architecture

The physical architecture of Titan is just as interesting as the high level
core and transistor counts. I mentioned earlier that Titan is built from 200
cabinets. Inside each cabinets are Cray XK7 boards, each of which has four
AMD G34 sockets and four PCIe slots. These aren't standard desktop PCIe
slots, but rather much smaller SXM slots. The K20s NVIDIA sells to Cray come
on little SXM cards without frivolous features like display outputs. The SXM
form factor is similar to the MXM form factor used in some notebooks.

{gallery 2418}

There's no way around it. ORNL techs had to install 18,688 CPUs and GPUs over
the past few weeks in order to get Titan up and running. Around 10 of the
formerly-Jaguar cabinets had these new XK boards but were using Fermi GPUs. I
got to witness one of the older boards get upgraded to K20. The process isn't
all that different from what you'd see in a desktop: remove screws, remove
old card, install new card, replace screws. The form factor and scale of
installation are obviously very different, but the basic premise remains.

As with all computer components, there's no guarantee that every single chip
and card is going to work. When you're dealing with over 18,000 computers as
a part of a single entity, there are bound to be failures. All of the compute
nodes go through testing, and faulty hardware swapped out, before the upgrade
is technically complete.

OS & Software

Titan runs the Cray Linux Environment, which is based on SUSE 11. The OS has
to be hardened and modified for operation on such a large scale. In order to
prevent serialization caused by interrupts, Cray takes some of the cores and
uses them to run all of the OS tasks so that applications running elsewhere
aren't interrupted by the OS.

Jobs are batch scheduled on Titan using Moab and Torque.

AMD CPUs and NVIDIA GPUs

If you're curious about why Titan uses Opterons, the explanation is actually
pretty simple. Titan is a large installation of Cray XK7 cabinets, so CPU
support is actually defined by Cray. Back in 2005 when Jaguar made its debut,
AMD's Opterons were superior to the Intel Xeon alternative. The evolution of
Cray's XT/XK lines simply stemmed from that point, with Opteron being the
supported CPU of choice.

The GPU decision was just as simple. NVIDIA has been focusing on non-gaming
compute applications for its GPUs for years now. The decision to partner with
NVIDIA on the Titan project was made around 3 years ago. At the time, AMD
didn't have a competitive GPU compute roadmap. If you remember back to our
first Fermi architecture article from back in 2009, I wrote the following:

"By adding support for ECC, enabling C++ and easier Visual Studio
integration, NVIDIA believes that Fermi will open its Tesla business up to a
group of clients that would previously not so much as speak to NVIDIA. ECC is
the killer feature there."

At the time I didn't know it, but ORNL was one of those clients. With almost
19,000 GPUs, errors are bound to happen. Having ECC support was a must have
for GPU enabled Jaguar and Titan compute nodes. The ORNL folks tell me that
CUDA was also a big selling point for NVIDIA.

Finally, some of the new features specific to K20/GK110 (e.g. Hyper Q and GPU
Direct) made Kepler the right point to go all-in with GPU compute.

Power Delivery & Cooling

Titan's cabinets require 480V input to reduce overall cable thickness
compared to standard 208V cabling. Total power consumption for Titan should
be around 9 megawatts under full load and around 7 megawatts during typical
use. The building that Titan is housed in has over 25 megawatts of power
delivered to it.

In the event of a power failure there's no cost effective way to keep the
compute portion of Titan up and running (remember, 9 megawatts), but you
still want IO and networking operational. Flywheel based UPSes kick in, in
the event of a power interruption. They can power Titan's network and IO for
long enough to give diesel generators time to come on line.

The cabinets themselves are air cooled, however the air itself is chilled
using liquid cooling before entering the cabinet. ORNL has over 6600 tons of
cooling capacity just to keep the recirculated air going into these cabinets
cool.

Applying for Time on Titan

The point of building supercomputers like Titan is to give scientists and
researchers access to hardware they wouldn't otherwise have. In order to
actually book time on Titan, you have to apply for it through a proposal
process.

There's an annual call for proposals, based on which time on Titan will be
allocated. The machine is available to anyone who wants to use it, although
the problem you're trying to solve needs to be approved by Oak Ridge.

If you want to get time on Titan you write a proposal through a program
called Incite. In the proposal you ask to use either Titan or the
supercomputer at Argonne National Lab (or both). You also outline the problem
you're trying to solve and why it's important. Researchers have to describe
their process and algorithms as well as their readiness to use such a monster
machine. Any program will run on a simple computer, but to need a
supercomputer with hundreds of thousands of cores the requirements are very
strict. As a part of the proposal process you'll have to show that you've
already run your code on machines that are smaller, but similar in nature
(e.g. 1/3 the scale of Titan).

Your proposal would then be reviewed twice - once for computational readiness
(can it run on Titan) and once for scientific peer review. The review boards
rank all of the proposals received, and based on those rankings time is
awarded on the supercomputers.

The number of requests outweighs the available compute time by around 3x. The
proposal process is thus highly competitive. The call for proposals goes out
once a year in April, with proposals due in by the end of June. Time on the
supercomputers is awarded at the end of October with the accounts going live
on the first of January. Proposals can be for 1 - 3 years, although the
multiyear proposals need to renew each year (proving the time has been
useful, sharing results, etc...).

Programs that run on Titan are typically required to run on at least 1/5 of
the machine. There are smaller supercomputers available that can be used for
less demanding tasks. Given how competitive the proposal process is, ORNL
wants to ensure that those using Titan actually have a need for it.

Once time is booked, jobs are scheduled in batch and researchers get their
results whenever their turn comes up.

The end user costs for using Titan depend on what you're going to do with the
data. If you're a research scientist and will publish your findings, the time
is awarded free of charge. All ORNL asks is that you provide quarterly
updates and that you credit the lab and the Department of Energy for
providing the resource.

If, on the other hand, you're a private company wanting to do proprietary
work you have to pay for your time on the machine. On Jaguar the rate was
$0.05 per core hour, although with Titan ORNL will be moving to a node-hour
billing rate since the addition of GPUs throws a wrench in the whole
core-hour billing increment.

Supercomputing Applications

In the gaming space we use additional compute to model more accurate physics
and graphics. In supercomputing, the situation isn't very different. Many of
ORNL's supercomputing projects model the physically untestable (either for
scale or safety reasons). Instead of getting greater accuracy for the impact
of an explosion on an enemy, the types of workloads run at ORNL use advances
in compute to better model the atmosphere, a nuclear reactor or a decaying
star.

I never really had a good idea of specifically what sort of research was done
on supercomputers. Luckily I had the opportunity to sit down with Dr. Bronson
Messer, an astrophysicist looking forward to spending some time on Titan. Dr.
Messer's work focuses specifically on stellar decay, or what happens
immediately following a supernova. His work is particularly important as many
of the elements we take for granted weren't present in the early universe.
Understanding supernova explosions gives us unique insight into where we came
from. 

For Dr. Messer's studies, there's a lot of CUDA Fortran that's used although
the total amount of code that runs on GPUs is pretty small. There may be 20K
- 1M lines of code, but in that complex codebase you're only looking at tens
of lines of CUDA code for GPU acceleration. There are huge speedups from
porting those small segments of code to run on GPUs (much of that code is
small because it's contained within a loop that gets pushed out in parallel
to GPUs vs. executing serially). Dr. Messer tells me that the actual process
of porting his code to CUDA isn't all that difficult, after all there aren't
that many lines to worry about, but it's changing all of the data around to
make the code more GPU friendly that is time intensive. It's also easy to
screw up. Interestingly enough, in making his code more GPU friendly a lot of
the changes actually improved CPU performance as well thanks to improved
cache locality. Dr. Messer saw a 2x improvement in his CPU code simply by
making data structures more GPU friendly.

Many of the applications that will run on Titan are similar in nature to Dr.
Messer's work. At ORNL what the researchers really care about are covers of
Nature and Science. There are researchers focused on how different types of
fuels combust at a molecular level. I met another group of folks focused on
extracting more efficiency out of nuclear reactors. These are all extremely
complex problems that can't easily be experimented on (e.g. hey let's just
try not replacing uranium rods for a little while longer and see what happens
to our nuclear reactor). Scientists at ORNL and around the world working on
Titan are fundamentally looking to model reality, as accurately as possible,
so that they can experiment on it. If you think about simulating every quark,
atom, molecule in whatever system you're trying to model (e.g. fuel in a
combustion engine), there's a ton of data that you have to keep track of. You
have to look at how each one of these elementary constituents impacts one
another when exposed to whatever is happening in the system at the time. It's
these large scale problems that are fundamentally driving supercomputer
performance forward, and there's simply no letting up. Even at two orders of
magnitude better performance than what Titan can deliver with ~300K CPU cores
and 50M+ GPU cores, there's not enough compute power to simulate most of the
applications that run on Titan in their entirety. Researchers are still
limited by the systems they run on and thus have to limit the scope of their
simulations. Maybe they only look at one slice of a star, or one slice of the
Earth's atmosphere and work on simulating that fraction of the whole. Go too
narrow and you'll lose important understanding of the system as a whole. Go
too broad and you'll lose fidelity that helps give you accurate results.

Given infinite time you'll be able to run anything regardless of hardware,
but for researchers (who happen to be human) time isn't infinite. Having
faster hardware can help shorten run times to more manageable amounts. For
example, reducing a 6 month runtime (which isn't unheard of for many of these
projects) to something that can execute to completion in a single month can
have a dramatic impact on productivity. Dr. Messer put it best when told me
that keeping human beings engaged for a month is a much different proposition
than keeping human beings engaged for half a year.

There are other types of applications that will run on Titan without the need
for enormous runtimes, instead they need lots of repetitions. Doing hurricane
simulation is one of those types of problems. ORNL was in between generations
of supercomputers at one point and donated some compute time to the National
Tornado Center in Oklahoma during that transition. During the time they had
access to the ORNL supercomputer, their forecasts improved tremendously.

ORNL also has a neat visualization room where you can plot, in 3D, the output
from work you've run on Titan. The problem with running workloads on a
supercomputer is the output can be terabytes of data - which tends to be
difficult to analyze in a spreadsheet. Through 3D visualization you're able
to get a better idea of general trends. It's similar to the motivation behind
us making lots of bar charts in our reviews vs. just publishing a giant
spreadsheet, but on a much, much, much larger scale.

The image above is actually showing some data run on Titan simulating a
pressurized water nuclear reactor. The video below explains a bit more about
the data and what it means.

Final Words

At a high level, the Titan supercomputer delivers an order of magnitude
increase in performance over the outgoing Jaguar system at roughly the same
energy price. Using over 200,000 AMD Opteron cores, Jaguar could deliver
roughly 2.3 petaflops of performance at around 7MW of power consumption.
Titan approaches 300,000 AMD Opteron cores but adds nearly 19,000 NVIDIA K20
GPUs, delivering over 20 petaflops of performance at "only" 9MW. The question
remains: how can it be done again?

In 4 years, Titan will be obsolete and another set of upgrades will have to
happen to increase performance in the same power envelope. By 2016 ORNL hopes
to be able to build a supercomputer capable of 10x the performance of Titan
but within a similar power envelope. The trick is, you don't get the
performance efficiency from first adopting GPUs for compute a second time.
ORNL will have to rely on process node shrinks and improvements in
architectural efficiency, on both CPU and GPU fronts, to deliver the next 10x
performance increase. Over the next few years we'll see more integration
between the CPU and GPU with an on-die communication fabric. The march
towards integration will help improve usable performance in supercomputers
just as it will in client machines.

Increasing performance by 10x in 4 years doesn't seem so far fetched, but
breaking the 1 Exaflop barrier by 2020 - 2022 will require something much
more exotic. One possibility is to move from big beefy x86 CPU cores to
billions of simpler cores. Given ORNL's close relationship with NVIDIA, it's
likely that the smartphone core approach is being advocated internally.
Everyone involved has differing definitions of what is a simple core (by 2020
Haswell will look pretty darn simple), but it's clear that whatever comes
after Titan's replacement won't just look like a bigger, faster Titan. There
will have to be more fundamental shifts in order to increase performance by 2
orders of magnitude over the next decade. Luckily there are many research
projects that have yet to come to fruition. Die stacking and silicon
photonics both come to mind, even though we'll need more than just that.

It's incredible to think that the most recent increase in supercomputer
performance has its roots in PC gaming. These multi-billion transistor GPUs
first came about to improve performance and visual fidelity in 3D games. The
first consumer GPUs were built to better simulate reality so we could have
more realistic games. It's not too surprising then to think that in the
research space the same demands apply, although in pursuit of a different
goal: to create realistic models of the world and universe around us. It's
honestly one of the best uses of compute that I've ever seen.