cluster frustrations
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Phillips jim at ks.uiuc.eduWed Jan 16 08:31:06 PST 2002
- Previous message: cluster frustrations
- Next message: cluster frustrations
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, I've had Scyld running successfully for quite a while, and have even taught others (http://www.ks.uiuc.edu/Research/namd/tutorial/NCSA2001/). I know what I'm doing, and have even set up an older non-Scyld cluster, but I was tearing my hair out for several weeks at the beginning because of random crashes. These turned out to be hardware and BIOS related rather than software-related, altough different versions of the software exhibited the problems to varying degrees. When you build a cluster, you are often taking consumer-class hardware and driving it much harder than a normal user. You also have zero error tolerance across the entire cluster. While in theory this should all be worked out in testing, cluster users are the only people likely to see errors in the real world. In our case, the problem was that a BIOS setting of "optimal" for some PCI bus parameters was leading to occasional data corruption between the CPU and the network card. Since we had nice network cards, capable of doing their own checksumming, the errors were never caught. The was never an issue on the old cluster, which used cheap "tulip" cards and made the CPU do the checksumming. A normal user would drive maybe 100 MB per day across that network card, probably at 10 Mbit, or 1/10 of it's peak capacity, almost all of the data would be incoming, probably web images. We were driving 100 MB across every 15 seconds, which is 5000x more opportunities for error. Put 32 machines together and you have over 100,000x the error rate that a typical user would see. Add in a 10x lower tolerance for program failure and you could easily say that a cluster user is demanding one million times more hardware reliability than a normal desktop user. This is why server-class, error-correcting hardware exists. -Jim
- Previous message: cluster frustrations
- Next message: cluster frustrations
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
