32 + 8 nodes beowulf cluster design.

Wed Feb 14 06:42:02 PST 2001

Paul Maragakis wrote:
> 
> Hi everyone,
> 
> I have just drafted a proposal listing the hardware components and their
> prices for a 40 nodes beowulf using 1.2 GHz Athlons, 8*1.5 GB + 32*512 MB
> memory, a double fast-ethernet network, and a front-end for storage and
> administration.  The draft is at:
> http://hdsc.deas.harvard.edu/~plm/beowulf.pdf
> All comments are highly appreciated, especially those regarding the
> network, the UPS and the price estimates.
> 
> Take care,
> 
> Paul
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

Paul,

	As someone who has recently had to write a proposal for a Beowulf
cluster, I thought I'd check yours out and see if I could learn anything
and make a few suggestions from what I've learned.  This is kind of
long, so if your on the Beowulf list and not interested in this, you may
want to just skip this E-mail.
	Ok, first off, I know some of your prices are a little high.  Our local
vendor of choice listed 512MB memory (for PIII) at just over $500, and
you may be able to find it cheaper elsewhere.  It would definitely be
worth the time since that's the predominant cost of your cluster.  Like
you said though, you want to leave a little room for unexpected costs.
	The cluster I've been working with has been using Pentium (II
currently, III proposed) processors and Myrinet, so I don't know all the
details of your setup, but it seems to me that your networking plan is
... unique.  Are you just going to be using the Hubs and bootable NICs
(I assume this means Wake-On-LAN) for booting?  If so could you not just
use the Cisco switch and bootable NICs for booting?  If you are going to
use the hubs and extra NICs for channel bonding, this introduces some
interesting segmentation problems in the network (having two hubs) and
of course you'll have to deal with extra packet collisions on the hubs. 
If you are going for channel bonding, it might be worth getting a second
Cisco switch.  I'll agree with Joseph's $0.02 in that you should at
least check out switches with one or two gigabit ports to go to the
server, although the need for this would be dictated by the code your
running.
	Speaking of the code, I'm not sure exactly how PGI's licensing
agreement works, but I think all you would have to buy their compilers
for is the server (Lead node), since this is where you'll be doing your
compilation.  If you only buy the license for one user, only one user
can compile a program at a time, but if you don't have many users logged
in at once, this may not be a problem.
	I didn't understand quite all of your reasoning for choosing Debian
over other distributions of Linux (I like Red Hat, although I'd like to
try Scyld), but it sounds like you've got more experience than I do in
this area.  I'd say the exact distribution is usually mostly a
preference thing anyway (i.e. go with what you know).
	Oh yeah, going back to booting the cluster, is there any particular
reason you are wanting to boot your cluster from the network?  Your
proposal sounded like the only other alternative was booting from a
floppy, but you do have those 20GB hard drives sitting there.  Maybe you
could spare a few megabytes and install your OS there (I think you can
still WOL but boot from there).  Also, I understand the desire for a
large scratch space on nodes, but I've found with Myrinet that it is
MUCH faster to use the network to get to another node's (or the
server's) memory than to just go to your own hard drive, although this
isn't always a reasonable option.  I don't think this is true with
ethernet, but I haven't tested it.
	On to the server... Do you really want a separate UPS for the server? 
Personally if my cluster is going to go down, I'd just assume the whole
thing go down instead of all the nodes and not the server (it makes the
"why isn't my parallel job running?" question have a much more obvious
answer, you can't even log in to the server...)  If the nodes go down
and not the server, your equally in a "who cares, the cluster doesn't
work" situation.  I guess if your at the max of the big UPS or if the
server and cluster are in different places this might make sense.
	Have you looked into just using EIDE drives on the server?  Maybe
buying a RAID controller (there was a thread on RAID stuff on Monday I
think).  From my limited experience, SCSI is much more expensive and
much more trouble than it's worth, again, this depends on your code. 
For the $2400 your spending on it though, it might be worth looking into
an integrated RAID solution.  I would still do the tape backups even if
you do go with RAID, though, stick to your guns on that one.
	Also on the server motherboard, I've heard that it's better to use the
exact type of motherboard on your server as on the nodes.  I believe
this is partly because some compilers will optimize for the architecture
that they're compiling on.  I don't know if there will be enough
difference in these boards to cause a noticeable difference in
performance, also you can change an option somewhere in your compiler,
but that is one more thing to worry about.  Also, it sometimes helps if
your server is configured nearly identically to (at least some) of your
nodes.  That way if someone runs a reasonably big single job
interactively (for testing their code) and it does run, they can be
confident it will run correctly on the nodes.  If you can, for this
reason, and because of compiling and multi-user overhead, I'd go to
1.5GB RAM on your server like your "large" nodes.
	Well, I know I've said a lot here, so I better quit before I think of
something else (oh, too late, don't forget to check your spelling on
your proposal by hand before you send it off to the Money People, there
"were" a few errors).  Good luck to you on your cluster and let me know
what you think of the suggestions.  It would be good for me to hear
other people's input on these things so I can better plan my own
upgrades.
-- 
Jared Hodge
Institute for Advanced Technology
The University of Texas at Austin
3925 W. Braker Lane, Suite 400
Austin, Texas 78759

Phone: 512-232-4460
FAX: 512-471-9096
Email: Jared_Hodge at iat.utexas.edu