cluster frustrations (Suggestions for same)

Robert G. Brown rgb at phy.duke.edu
Thu Jan 17 08:29:05 PST 2002


On Wed, 16 Jan 2002, Mark Hahn wrote:

> > This is why server-class, error-correcting hardware exists.
> 
> uh, let's not go too far!  it's quite possible to drive
> well-chosen and carefully-configured commodity hardware
> 100%, 24/7, wire-speed, platter-level, etc.  but it definitely
> requires a certain amount of luck/study/experience.
> 
> there are shortcuts, of course.  for instance, you can buy
> very nicely configured building blocks from compaq/dell/etc,
> usually from their "business desktop/workstation" lines
> and expect robustness under load, albeit often at a slightly
> lower performance and/or higher price than white-box,
> hand-picked-with-TLC parts...
> 
> I still think that beowulf implies commodity parts, which 
> in many cases rules out "server-class".

I totally agree with Mark here -- I can buy somewhere between 2 and 4
ECC-memory-equipped nodes over the counter from Intrex (my local vanilla
PC supplier) for the cost of one server-class node of equivalent power
and never experience a moment's difficulty.  I can even get custom
configured rackmount systems through them, although I then have to work
a bit harder to ensure a HPWTLC fit.

Here are some simple suggestions for those wishing to build beowulfs or
clusters with truly commodity nodes:

  a) Prototype a single node before you buy 8,16,32 of them.

This is easy (and cheap) enough.  That way you can test the motherboard,
memory configuration, hard disk setup, video, and ethernet controller
before you buy a lot of them and blow your wad.  If one component proves
to be troublesome, either trade it in (with the sale of 31 more systems
hanging in the balance, most vendors become remarkably cooperative about
swapping things around and working with you to find something --
ANYTHING -- that they sell that you'll be happy with;-) or just throw it
away and try something else in its place -- you're saving enough to
throw a couple of SYSTEMS away at the end and come out WAY ahead.

  b) Run configurations by the list before buying even a prototype.

This happens all the time, and is a very reasonable thing to do.  You
won't ALWAYS find out that your hardware combo isn't right that way (you
should still prototype) but you'll likely get some useful advice or
reassurance.  At least the motherboard, memory, and NIC and switch are
excellent things to query if you are in doubt.

  c) Use quality components.  

This is really a mix of caveat emptor and common sense.  Find a vendor
you can work with and trust who will make things right if they sell you
substandard components, and who is unlikely to sell you substandard
components in the first place.  Commodity NIC prices range from $10 to
$50, and (as one might expect) there is a bit of you get what you pay
for in that range.  There are (or have been in the past) decent NICs
even at the middle of that range, but you've DEFINITELY got to work to
find a good cheap one, and the more expensive ones (eepro100, 3c905) are
more expensive in part because they have the best performance and
stability and features.  "Generic" memory is often fine, but sometimes
is a source of endless trouble, so make sure your vendor is willing to
get quality memory (e.g. Kingston) if you encounter trouble with their
OTC brand.

  d) It might well be the hardware.

How many times have I experienced inexplicable problems getting the
network to work?  Getting an attached camera to work?  Getting a system
to boot?  Getting a system to work for more than thirty minutes without
crashing? -- and be tearing my hair out and cursing linux and device
drivers and all the ancestors of the creators of same only to find that
my network cable with broken -- loose wire, worked if you wiggled it
just right and then broke when it felt like it.  The card wasn't seated
in the PCI bus properly, once it was the bttv driver autoloaded
charmlike and it worked perfectly just like the hardware lists said it
would.  The floppy cable was in upside down, or the power connector
wasn't fully seated on the motherboard.  A memory stick was in a dusty
slot and not making a good electrical connection.

Just shipping a system can cause cables to bounce loose.  If a box is
DOA, ALWAYS open it up and reseat the cables and connectors before
cursing the vendor and sending it back.

It might be that something really is broken, not the configuration or
the software or you at all.  It might even be trivial to fix, once you
look for it.

  d) Be patient.  Work it out.

One thing to remember is that problems with hardware can be like
lightning or shark attacks -- rare and very local.  The PARTICULAR
combination of motherboard, memory, device, case may not work for you,
while each one of them works fine for other people combined with other
hardware.  Five of the motherboards in an order of twenty five may have
a different flash of the BIOS (one that doesn't work).  The power
supplies may be marginal, and work fine for systems with only three
components or while idle but cause instability when run under load.
Like it or not, this sort of thing happens, and getting server-class
packages doesn't necessarily ameliorate the problem (depending on the
vendor and whether or not you're getting them turnkey preconfigured).

Sure, you can ALWAYS pay somebody to do the work for you, but it is
ALWAYS cheaper to do it yourself and, if you go about it sensibly, can
be a fun and rewarding experience.  

Just >>expect<< to have to learn some things, to have to solve some
problems, to get better over time.  Just because you weren't born
knowing all about TCP/IP, account management, software installation and
operation, programming technique, and all the other things that are at
least useful if not essential to cluster operation doesn't make you an
idiot, it makes you a student.  The beowulf list is filled with teachers
(literally -- myself, Walt and Rob Ross, and many, many more) and
students on their way to being teachers.  As always, try it yourself,
then look for help.  The longer you've been doing it, the easier you
will find it to solve the problems you encounter.

If it makes you feel any better, I've been doing Unix [systems
administration and engineering and cluster computing and etc.] for about
15 years, and have been doing computers in general from punched paper
tape on, and I still put connectors in backwards, fail to seat memory or
a PCI card properly, install something that overwrites a key
configuration file and have to do it all over again, and could make you
cringe with stories of the REALLY dumb things I've done in the past
(tried to copy files from a backup of /etc on another filesystem into
the /etc it was running on at the time, for example -- had to reinstall
the system from tape after that one as I rendered it totally
unbootable).

Live and learn.  Experiment and play.  Have fun.  You'll get better, and
one day you too will be an "expert", even if you only do it a little at
a time.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list