cluster frustrations (Suggestions for same)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduThu Jan 17 08:29:05 PST 2002
- Previous message: cluster frustrations
- Next message: cluster frustrations (Suggestions for same)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 16 Jan 2002, Mark Hahn wrote: > > This is why server-class, error-correcting hardware exists. > > uh, let's not go too far! it's quite possible to drive > well-chosen and carefully-configured commodity hardware > 100%, 24/7, wire-speed, platter-level, etc. but it definitely > requires a certain amount of luck/study/experience. > > there are shortcuts, of course. for instance, you can buy > very nicely configured building blocks from compaq/dell/etc, > usually from their "business desktop/workstation" lines > and expect robustness under load, albeit often at a slightly > lower performance and/or higher price than white-box, > hand-picked-with-TLC parts... > > I still think that beowulf implies commodity parts, which > in many cases rules out "server-class". I totally agree with Mark here -- I can buy somewhere between 2 and 4 ECC-memory-equipped nodes over the counter from Intrex (my local vanilla PC supplier) for the cost of one server-class node of equivalent power and never experience a moment's difficulty. I can even get custom configured rackmount systems through them, although I then have to work a bit harder to ensure a HPWTLC fit. Here are some simple suggestions for those wishing to build beowulfs or clusters with truly commodity nodes: a) Prototype a single node before you buy 8,16,32 of them. This is easy (and cheap) enough. That way you can test the motherboard, memory configuration, hard disk setup, video, and ethernet controller before you buy a lot of them and blow your wad. If one component proves to be troublesome, either trade it in (with the sale of 31 more systems hanging in the balance, most vendors become remarkably cooperative about swapping things around and working with you to find something -- ANYTHING -- that they sell that you'll be happy with;-) or just throw it away and try something else in its place -- you're saving enough to throw a couple of SYSTEMS away at the end and come out WAY ahead. b) Run configurations by the list before buying even a prototype. This happens all the time, and is a very reasonable thing to do. You won't ALWAYS find out that your hardware combo isn't right that way (you should still prototype) but you'll likely get some useful advice or reassurance. At least the motherboard, memory, and NIC and switch are excellent things to query if you are in doubt. c) Use quality components. This is really a mix of caveat emptor and common sense. Find a vendor you can work with and trust who will make things right if they sell you substandard components, and who is unlikely to sell you substandard components in the first place. Commodity NIC prices range from $10 to $50, and (as one might expect) there is a bit of you get what you pay for in that range. There are (or have been in the past) decent NICs even at the middle of that range, but you've DEFINITELY got to work to find a good cheap one, and the more expensive ones (eepro100, 3c905) are more expensive in part because they have the best performance and stability and features. "Generic" memory is often fine, but sometimes is a source of endless trouble, so make sure your vendor is willing to get quality memory (e.g. Kingston) if you encounter trouble with their OTC brand. d) It might well be the hardware. How many times have I experienced inexplicable problems getting the network to work? Getting an attached camera to work? Getting a system to boot? Getting a system to work for more than thirty minutes without crashing? -- and be tearing my hair out and cursing linux and device drivers and all the ancestors of the creators of same only to find that my network cable with broken -- loose wire, worked if you wiggled it just right and then broke when it felt like it. The card wasn't seated in the PCI bus properly, once it was the bttv driver autoloaded charmlike and it worked perfectly just like the hardware lists said it would. The floppy cable was in upside down, or the power connector wasn't fully seated on the motherboard. A memory stick was in a dusty slot and not making a good electrical connection. Just shipping a system can cause cables to bounce loose. If a box is DOA, ALWAYS open it up and reseat the cables and connectors before cursing the vendor and sending it back. It might be that something really is broken, not the configuration or the software or you at all. It might even be trivial to fix, once you look for it. d) Be patient. Work it out. One thing to remember is that problems with hardware can be like lightning or shark attacks -- rare and very local. The PARTICULAR combination of motherboard, memory, device, case may not work for you, while each one of them works fine for other people combined with other hardware. Five of the motherboards in an order of twenty five may have a different flash of the BIOS (one that doesn't work). The power supplies may be marginal, and work fine for systems with only three components or while idle but cause instability when run under load. Like it or not, this sort of thing happens, and getting server-class packages doesn't necessarily ameliorate the problem (depending on the vendor and whether or not you're getting them turnkey preconfigured). Sure, you can ALWAYS pay somebody to do the work for you, but it is ALWAYS cheaper to do it yourself and, if you go about it sensibly, can be a fun and rewarding experience. Just >>expect<< to have to learn some things, to have to solve some problems, to get better over time. Just because you weren't born knowing all about TCP/IP, account management, software installation and operation, programming technique, and all the other things that are at least useful if not essential to cluster operation doesn't make you an idiot, it makes you a student. The beowulf list is filled with teachers (literally -- myself, Walt and Rob Ross, and many, many more) and students on their way to being teachers. As always, try it yourself, then look for help. The longer you've been doing it, the easier you will find it to solve the problems you encounter. If it makes you feel any better, I've been doing Unix [systems administration and engineering and cluster computing and etc.] for about 15 years, and have been doing computers in general from punched paper tape on, and I still put connectors in backwards, fail to seat memory or a PCI card properly, install something that overwrites a key configuration file and have to do it all over again, and could make you cringe with stories of the REALLY dumb things I've done in the past (tried to copy files from a backup of /etc on another filesystem into the /etc it was running on at the time, for example -- had to reinstall the system from tape after that one as I rendered it totally unbootable). Live and learn. Experiment and play. Have fun. You'll get better, and one day you too will be an "expert", even if you only do it a little at a time. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: cluster frustrations
- Next message: cluster frustrations (Suggestions for same)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
