[Beowulf] Google Throws Open Doors to Its Top-Secret Data Center
j.sassmannshausen at ucl.ac.uk
Wed Oct 17 06:53:35 PDT 2012
Bummer, Eugen was a quicker than me. Here are some pictures though:
On Wednesday 17 October 2012 14:40:12 Eugen Leitl wrote:
> (PUE 1.2 doesn't sound so otherworldly to me)
> Google Throws Open Doors to Its Top-Secret Data Center
> By Steven Levy 10.17.12 7:30 AM
> Follow @StevenLevy
> Photo: Google/Connie Zhou
> If you’re looking for the beating heart of the digital age — a physical
> location where the scope, grandeur, and geekiness of the kingdom of bits
> become manifest—you could do a lot worse than Lenoir, North Carolina. This
> rural city of 18,000 was once rife with furniture factories. Now it’s the
> home of a Google data center.
> Engineering prowess famously catapulted the 14-year-old search giant into
> its place as one of the world’s most successful, influential, and
> frighteningly powerful companies. Its constantly refined search algorithm
> changed the way we all access and even think about information. Its
> equally complex ad-auction platform is a perpetual money-minting machine.
> But other, less well-known engineering and strategic breakthroughs are
> arguably just as crucial to Google’s success: its ability to build,
> organize, and operate a huge network of servers and fiber-optic cables
> with an efficiency and speed that rocks physics on its heels. Google has
> spread its infrastructure across a global archipelago of massive
> buildings—a dozen or so information palaces in locales as diverse as
> Council Bluffs, Iowa; St. Ghislain, Belgium; and soon Hong Kong and
> Singapore—where an unspecified but huge number of machines process and
> deliver the continuing chronicle of human experience.
> This is what makes Google Google: its physical network, its thousands of
> fiber miles, and those many thousands of servers that, in aggregate, add up
> to the mother of all clouds. This multibillion-dollar infrastructure allows
> the company to index 20 billion web pages a day. To handle more than 3
> billion daily search queries. To conduct millions of ad auctions in real
> time. To offer free email storage to 425 million Gmail users. To zip
> millions of YouTube videos to users every day. To deliver search results
> before the user has finished typing the query. In the near future, when
> Google releases the wearable computing platform called Glass, this
> infrastructure will power its visual search results.
> The problem for would-be bards attempting to sing of these data centers has
> been that, because Google sees its network as the ultimate competitive
> advantage, only critical employees have been permitted even a peek inside,
> a prohibition that has most certainly included bards. Until now.
> A server room in Council Bluffs, Iowa. Previous spread: A central cooling
> plant in Google’s Douglas County, Georgia, data center. Photo:
> Google/Connie Zhou
> Here I am, in a huge white building in Lenoir, standing near a reinforced
> door with a party of Googlers, ready to become that rarest of species: an
> outsider who has been inside one of the company’s data centers and seen the
> legendary server floor, referred to simply as “the floor.” My visit is the
> latest evidence that Google is relaxing its black-box policy. My hosts
> include Joe Kava, who’s in charge of building and maintaining Google’s data
> centers, and his colleague Vitaly Gudanets, who populates the facilities
> with computers and makes sure they run smoothly.
> A sign outside the floor dictates that no one can enter without hearing
> protection, either salmon-colored earplugs that dispensers spit out like
> trail mix or panda-bear earmuffs like the ones worn by airline ground
> crews. (The noise is a high-pitched thrum from fans that control airflow.)
> We grab the plugs. Kava holds his hand up to a security scanner and opens
> the heavy door. Then we slip into a thunderdome of data …
> Urs Hölzle had never stepped into a data center before he was hired by
> Sergey Brin and Larry Page. A hirsute, soft-spoken Swiss, Hölzle was on
> leave as a computer science professor at UC Santa Barbara in February 1999
> when his new employers took him to the Exodus server facility in Santa
> Clara. Exodus was a colocation site, or colo, where multiple companies
> rent floor space. Google’s “cage” sat next to servers from eBay and other
> blue-chip Internet companies. But the search company’s array was the most
> densely packed and chaotic. Brin and Page were looking to upgrade the
> system, which often took a full 3.5 seconds to deliver search results and
> tended to crash on Mondays. They brought Hözle on to help drive the
> It wouldn’t be easy. Exodus was “a huge mess,” Hölzle later recalled. And
> the cramped hodgepodge would soon be strained even more. Google was not
> only processing millions of queries every week but also stepping up the
> frequency with which it indexed the web, gathering every bit of online
> information and putting it into a searchable format. AdWords—the service
> that invited advertisers to bid for placement alongside search results
> relevant to their wares—involved computation-heavy processes that were
> just as demanding as search. Page had also become obsessed with speed,
> with delivering search results so quickly that it gave the illusion of
> mind reading, a trick that required even more servers and connections. And
> the faster Google delivered results, the more popular it became, creating
> an even greater burden. Meanwhile, the company was adding other
> applications, including a mail service that would require instant access
> to many petabytes of storage. Worse yet, the tech downturn that left many
> data centers underpopulated in the late ’90s was ending, and Google’s
> future leasing deals would become much more costly.
> For Google to succeed, it would have to build and operate its own data
> centers—and figure out how to do it more cheaply and efficiently than
> anyone had before. The mission was codenamed Willpower. Its first
> built-from-scratch data center was in the Dalles, a city in Oregon near
> the Columbia River.
> Hözle and his team designed the $600 million facility in light of a radical
> insight: Server rooms did not have to be kept so cold. The machines throw
> off prodigious amounts of heat. Traditionally, data centers cool them off
> with giant computer room air conditioners, or CRACs, typically jammed
> under raised floors and cranked up to arctic levels. That requires massive
> amounts of energy; data centers consume up to 1.5 percent of all the
> electricity in the world. Data centers consume up to 1.5 percent of all
> the world’s
> Google realized that the so-called cold aisle in front of the machines
> could be kept at a relatively balmy 80 degrees or so—workers could wear
> shorts and T-shirts instead of the standard sweaters. And the “hot aisle,”
> a tightly enclosed space where the heat pours from the rear of the
> servers, could be allowed to hit around 120 degrees. That heat could be
> absorbed by coils filled with water, which would then be pumped out of the
> building and cooled before being circulated back inside. Add that to the
> long list of Google’s accomplishments: The company broke its CRAC habit.
> Google also figured out money-saving ways to cool that water. Many data
> centers relied on energy-gobbling chillers, but Google’s big data centers
> usually employ giant towers where the hot water trickles down through the
> equivalent of vast radiators, some of it evaporating and the remainder
> attaining room temperature or lower by the time it reaches the bottom. In
> its Belgium facility, Google uses recycled industrial canal water for the
> cooling; in Finland it uses seawater.
> The company’s analysis of electrical flow unearthed another source of
> waste: the bulky uninterrupted-power-supply systems that protected servers
> from power disruptions in most data centers. Not only did they leak
> electricity, they also required their own cooling systems. But because
> Google designed the racks on which it placed its machines, it could make
> space for backup batteries next to each server, doing away with the big
> UPS units altogether. According to Joe Kava, that scheme reduced
> electricity loss by about 15 percent.
> All of these innovations helped Google achieve unprecedented energy
> savings. The standard measurement of data center efficiency is called
> power usage effectiveness, or PUE. A perfect number is 1.0, meaning all
> the power drawn by the facility is put to use. Experts considered
> 2.0—indicating half the power is wasted—to be a reasonable number for a
> data center. Google was getting an unprecedented 1.2.
> For years Google didn’t share what it was up to. “Our core advantage really
> was a massive computer network, more massive than probably anyone else’s in
> the world,” says Jim Reese, who helped set up the company’s servers. “We
> realized that it might not be in our best interest to let our competitors
> But stealth had its drawbacks. Google was on record as being an exemplar of
> green practices. In 2007 the company committed formally to carbon
> neutrality, meaning that every molecule of carbon produced by its
> activities—from operating its cooling units to running its diesel
> generators—had to be canceled by offsets. Maintaining secrecy about energy
> savings undercut that ideal: If competitors knew how much energy Google
> was saving, they’d try to match those results, and that could make a real
> environmental impact. Also, the stonewalling, particularly regarding the
> Dalles facility, was becoming almost comical. Google’s ownership had
> become a matter of public record, but the company still refused to
> acknowledge it.
> In 2009, at an event dubbed the Efficient Data Center Summit, Google
> announced its latest PUE results and hinted at some of its techniques. It
> marked a turning point for the industry, and now companies like Facebook
> and Yahoo report similar PUEs.
> Make no mistake, though: The green that motivates Google involves
> presidential portraiture. “Of course we love to save energy,” Hölzle says.
> “But take something like Gmail. We would lose a fair amount of money on
> Gmail if we did our data centers and servers the conventional way. Because
> of our efficiency, we can make the cost small enough that we can give it
> away for free.”
> Google’s breakthroughs extend well beyond energy. Indeed, while Google is
> still thought of as an Internet company, it has also grown into one of the
> world’s largest hardware manufacturers, thanks to the fact that it builds
> much of its own equipment. In 1999, Hözle bought parts for 2,000
> stripped-down “breadboards” from “three guys who had an electronics shop.”
> By going homebrew and eliminating unneeded components, Google built a
> batch of servers for about $1,500 apiece, instead of the then-standard
> $5,000. Hölzle, Page, and a third engineer designed the rigs themselves.
> “It wasn’t really ‘designed,’” Hölzle says, gesturing with air quotes.
> More than a dozen generations of Google servers later, the company now
> takes a much more sophisticated approach. Google knows exactly what it
> needs inside its rigorously controlled data centers—speed, power, and good
> connections—and saves money by not buying unnecessary extras. (No graphics
> cards, for instance, since these machines never power a screen. And no
> enclosures, because the motherboards go straight into the racks.) The same
> principle applies to its networking equipment, some of which Google began
> building a few years ago.
> Outside the Council Bluffs data center, radiator-like cooling towers chill
> water from the server floor down to room temperature. Photo: Google/Connie
> So far, though, there’s one area where Google hasn’t ventured: designing
> its own chips. But the company’s VP of platforms, Bart Sano, implies that
> even that could change. “I’d never say never,” he says. “In fact, I get
> that question every year. From Larry.”
> Even if you reimagine the data center, the advantage won’t mean much if you
> can’t get all those bits out to customers speedily and reliably. And so
> Google has launched an attempt to wrap the world in fiber. In the early
> 2000s, taking advantage of the failure of some telecom operations, it began
> buying up abandoned fiber-optic networks, paying pennies on the dollar.
> Now, through acquisition, swaps, and actually laying down thousands of
> strands, the company has built a mighty empire of glass.
> But when you’ve got a property like YouTube, you’ve got to do even more. It
> would be slow and burdensome to have millions of people grabbing videos
> from Google’s few data centers. So Google installs its own server racks in
> various outposts of its network—mini data centers, sometimes connected
> directly to ISPs like Comcast or AT&T—and stuffs them with popular videos.
> That means that if you stream, say, a Carly Rae Jepsen video, you probably
> aren’t getting it from Lenoir or the Dalles but from some colo just a few
> miles from where you are.
> Over the years, Google has also built a software system that allows it to
> manage its countless servers as if they were one giant entity. Its in-house
> developers can act like puppet masters, dispatching thousands of computers
> to perform tasks as easily as running a single machine. In 2002 its
> scientists created Google File System, which smoothly distributes files
> across many machines. MapReduce, a Google system for writing cloud-based
> applications, was so successful that an open source version called Hadoop
> has become an industry standard. Google also created software to tackle a
> knotty issue facing all huge data operations: When tasks come pouring into
> the center, how do you determine instantly and most efficiently which
> machines can best afford to take on the work? Google has solved this
> “load-balancing” issue with an automated system called Borg.
> These innovations allow Google to fulfill an idea embodied in a 2009 paper
> written by Hözle and one of his top lieutenants, computer scientist Luiz
> Barroso: “The computing platform of interest no longer resembles a pizza
> box or a refrigerator but a warehouse full of computers … We must treat
> the data center itself as one massive warehouse-scale computer.”
> This is tremendously empowering for the people who write Google code. Just
> as your computer is a single device that runs different programs
> simultaneously—and you don’t have to worry about which part is running
> which application—Google engineers can treat seas of servers like a single
> unit. They just write their production code, and the system distributes it
> across a server floor they will likely never be authorized to visit. “If
> you’re an average engineer here, you can be completely oblivious,” Hözle
> says. “You can order x petabytes of storage or whatever, and you have no
> idea what actually happens.”
> But of course, none of this infrastructure is any good if it isn’t
> reliable. Google has innovated its own answer for that problem as well—one
> that involves a surprising ingredient for a company built on algorithms
> and automation: people.
> At 3 am on a chilly winter morning, a small cadre of engineers begin to
> attack Google. First they take down the internal corporate network that
> serves the company’s Mountain View, California, campus. Later the team
> attempts to disrupt various Google data centers by causing leaks in the
> water pipes and staging protests outside the gates—in hopes of distracting
> attention from intruders who try to steal data-packed disks from the
> servers. They mess with various services, including the company’s ad
> network. They take a data center in the Netherlands offline. Then comes
> the coup de grâce—cutting most of Google’s fiber connection to Asia.
> Turns out this is an inside job. The attackers, working from a conference
> room on the fringes of the campus, are actually Googlers, part of the
> company’s Site Reliability Engineering team, the people with ultimate
> responsibility for keeping Google and its services running. SREs are not
> merely troubleshooters but engineers who are also in charge of getting
> production code onto the “bare metal” of the servers; many are embedded in
> product groups for services like Gmail or search. Upon becoming an SRE,
> members of this geek SEAL team are presented with leather jackets bearing a
> military-style insignia patch. Every year, the SREs run this simulated
> war—called DiRT (disaster recovery testing)—on Google’s infrastructure. The
> attack may be fake, but it’s almost indistinguishable from reality:
> Incident managers must go through response procedures as if they were
> really happening. In some cases, actual functioning services are messed
> with. If the teams in charge can’t figure out fixes and patches to keep
> things running, the attacks must be aborted so real users won’t be
> affected. In classic Google fashion, the DiRT team always adds a goofy
> element to its dead-serious test—a loony narrative written by a member of
> the attack team. This year it involves a Twin Peaks-style supernatural
> phenomenon that supposedly caused the disturbances. Previous DiRTs were
> attributed to zombies or aliens.
> Some halls in Google’s Hamina, Finland, data center remain vacant—for now.
> Photo: Google/Connie Zhou
> As the first attack begins, Kripa Krishnan, an upbeat engineer who heads
> the annual exercise, explains the rules to about 20 SREs in a conference
> room already littered with junk food. “Do not attempt to fix anything,”
> she says. “As far as the people on the job are concerned, we do not exist.
> If we’re really lucky, we won’t break anything.” Then she pulls the
> plug—for real—on the campus network. The team monitors the phone lines and
> IRC channels to see when the Google incident managers on call around the
> world notice that something is wrong. It takes only five minutes for
> someone in Europe to discover the problem, and he immediately begins
> contacting others.
> “My role is to come up with big tests that really expose weaknesses,”
> Krishnan says. “Over the years, we’ve also become braver in how much we’re
> willing to disrupt in order to make sure everything works.” How did Google
> do this time? Pretty well. Despite the outages in the corporate network,
> executive chair Eric Schmidt was able to run a scheduled global all-hands
> meeting. The imaginary demonstrators were placated by imaginary pizza.
> Even shutting down three-fourths of Google’s Asia traffic capacity didn’t
> shut out the continent, thanks to extensive caching. “This is the best
> DiRT ever!” Krishnan exclaimed at one point.
> The SRE program began when Hözle charged an engineer named Ben Treynor with
> making Google’s network fail-safe. This was especially tricky for a massive
> company like Google that is constantly tweaking its systems and
> services—after all, the easiest way to stabilize it would be to freeze all
> change. Treynor ended up rethinking the very concept of reliability.
> Instead of trying to build a system that never failed, he gave each
> service a budget—an amount of downtime it was permitted to have. Then he
> made sure that Google’s engineers used that time productively. “Let’s say
> we wanted Google+ to run 99.95 percent of the time,” Hözle says. “We want
> to make sure we don’t get that downtime for stupid reasons, like we
> weren’t paying attention. We want that downtime because we push something
> Nevertheless, accidents do happen—as Sabrina Farmer learned on the morning
> of April 17, 2012. Farmer, who had been the lead SRE on the Gmail team for
> a little over a year, was attending a routine design review session.
> Suddenly an engineer burst into the room, blurting out, “Something big is
> happening!” Indeed: For 1.4 percent of users (a large number of people),
> Gmail was down. Soon reports of the outage were all over Twitter and tech
> sites. They were even bleeding into mainstream news.
> The conference room transformed into a war room. Collaborating with a peer
> group in Zurich, Farmer launched a forensic investigation. A breakthrough
> came when one of her Gmail SREs sheepishly admitted, “I pushed a change on
> Friday that might have affected this.” Those responsible for vetting the
> change hadn’t been meticulous, and when some Gmail users tried to access
> their mail, various replicas of their data across the system were no longer
> in sync. To keep the data safe, the system froze them out.
> The diagnosis had taken 20 minutes, designing the fix 25 minutes
> more—pretty good. But the event went down as a Google blunder. “It’s
> pretty painful when SREs trigger a response,” Farmer says. “But I’m happy
> no one lost data.” Nonetheless, she’ll be happier if her future crises are
> limited to DiRT-borne zombie attacks.
> One scenario that dirt never envisioned was the presence of a reporter on a
> server floor. But here I am in Lenoir, earplugs in place, with Joe Kava
> motioning me inside.
> We have passed through the heavy gate outside the facility, with
> remote-control barriers evoking the Korean DMZ. We have walked through the
> business offices, decked out in Nascar regalia. (Every Google data center
> has a decorative theme.) We have toured the control room, where LCD
> dashboards monitor every conceivable metric. Later we will climb up to
> catwalks to examine the giant cooling towers and backup electric
> generators, which look like Beatle-esque submarines, only green. We will
> don hard hats and tour the construction site of a second data center just
> up the hill. And we will stare at a rugged chunk of land that one day will
> hold a third mammoth
> computational facility.
> But now we enter the floor. Big doesn’t begin to describe it. Row after row
> of server racks seem to stretch to eternity. Joe Montana in his prime could
> not throw a football the length of it.
> During my interviews with Googlers, the idea of hot aisles and cold aisles
> has been an abstraction, but on the floor everything becomes clear. The
> cold aisle refers to the general room temperature—which Kava confirms is
> 77 degrees. The hot aisle is the narrow space between the backsides of two
> rows of servers, tightly enclosed by sheet metal on the ends. A nest of
> copper coils absorbs the heat. Above are huge fans, which sound like jet
> engines jacked through Marshall amps. The huge fans sound like jet
> engines jacked through Marshall amps.
> We walk between the server rows. All the cables and plugs are in front, so
> no one has to crack open the sheet metal and venture into the hot aisle,
> thereby becoming barbecue meat. (When someone does have to head back
> there, the servers are shut down.) Every server has a sticker with a code
> that identifies its exact address, useful if something goes wrong. The
> servers have thick black batteries alongside. Everything is uniform and in
> place—nothing like the spaghetti tangles of Google’s long-ago Exodus era.
> Blue lights twinkle, indicating … what? A web search? Someone’s Gmail
> message? A Glass calendar event floating in front of Sergey’s eyeball? It
> could be anything.
> Every so often a worker appears—a long-haired dude in shorts propelling
> himself by scooter, or a woman in a T-shirt who’s pushing a cart with a
> laptop on top and dispensing repair parts to servers like a psychiatric
> nurse handing out meds. (In fact, the area on the floor that holds the
> replacement gear is called the pharmacy.)
> How many servers does Google employ? It’s a question that has dogged
> observers since the company built its first data center. It has long stuck
> to “hundreds of thousands.” (There are 49,923 operating in the Lenoir
> facility on the day of my visit.) I will later come across a clue when I
> get a peek inside Google’s data center R&D facility in Mountain View. In a
> secure area, there’s a row of motherboards fixed to the wall, an honor
> roll of generations of Google’s homebrewed servers. One sits atop a tiny
> embossed plaque that reads july 9, 2008. google’s millionth server. But
> executives explain that this is a cumulative number, not necessarily an
> indication that Google has a million servers in operation at once.
> Wandering the cold aisles of Lenoir, I realize that the magic number, if it
> is even obtainable, is basically meaningless. Today’s machines, with
> multicore processors and other advances, have many times the power and
> utility of earlier versions. A single Google server circa 2012 may be the
> equivalent of 20 servers from a previous generation. In any case, Google
> thinks in terms of clusters—huge numbers of machines that act together to
> provide a service or run an application. “An individual server means
> nothing,” H\0xF6lzle says. “We track computer power as an abstract metric.”
> It’s the realization of a concept H\0xF6lzle and Barroso spelled out three
> years ago: the data center as a computer.
> As we leave the floor, I feel almost levitated by my peek inside Google’s
> inner sanctum. But a few weeks later, back at the Googleplex in Mountain
> View, I realize that my epiphanies have limited shelf life. Google’s
> intention is to render the data center I visited obsolete. “Once our people
> get used to our 2013 buildings and clusters,” Hözle says, “they’re going to
> complain about the current ones.”
> Asked in what areas one might expect change, Hözle mentions data center and
> cluster design, speed of deployment, and flexibility. Then he stops short.
> “This is one thing I can’t talk about,” he says, a smile cracking his
> bearded visage, “because we’ve spent our own blood, sweat, and tears. I
> want others to spend their own blood, sweat, and tears making the same
> discoveries.” Google may be dedicated to providing access to all the
> world’s data, but some information it’s still keeping to itself.
> Senior writer Steven Levy (steven_levy at wired.com) interviewed Mary Meeker
> in issue 20.10.
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
University College London
Department of Chemistry
email: j.sassmannshausen at ucl.ac.uk
Please avoid sending me Word or PowerPoint attachments.
More information about the Beowulf