[Beowulf] Re: failure trends in a large disk drive population (google fileing system)

Sun Feb 18 13:49:47 PST 2007

i've read in the past somewhere that the Google File System is capable 
of having many copies of the data. often having 4 copies on different 
nodes. and as you say run the query to many of them. if one fails there 
are still 3, if another there are still 2. i've also read somewhere else 
that if one fails, it can automatically recreate the image from the 
remaining ones on a spare node. bringing it back to 4. this approach is 
rather ott, but it works and works well.

i suspect this sort of thing could be done cheaper by just using 3 per 
copy and hoping that you never lose 2 or more nodes at once.

essentially this is a huge distributed files system with integrated RAID 
software.

Chris Samuel wrote:

 > IIRC they also have figured out a way to be fault tolerant by sending 
   queries out to multiple systems for each part of the DB they are 
querying, so if one of those fails others will respond anyway.
 >
 > Apparently they use more reliable hardware for things like the 
advertising service

-- 
matt.