Newbie who needs help!

Robert G. Brown rgb at phy.duke.edu
Fri Jul 6 07:39:26 PDT 2001


On Thu, 5 Jul 2001, Eric Linenberg wrote:

> Ok, this is going to be kind of long, but I figured there are people
> out there with more experience than me, and I don't have the option to
> mess up as I have to be finished this project by AUG. 7th!
>
> I am working as a research assistant, and my task is to build an 8
> node Beowulf cluster to run LS-DYNA (the world's most advanced
> general purpose nonlinear finite element program (from their page))
> (lstc.com)  My budget is $25,000 and I just want general help with

A pretty generous budget for an eight node operation based on Intel or
AMD, depending on what kind of networking the application needs.  I
spent only $15K on a 16 node beowulf equipped with 1.33 GHz cpus
(including the Home Depot heavy duty shelf:-).  Duals are actually
generally cheaper on a per-CPU basis, although if you get large memory
systems the cost of memory goes up very quickly.

> where I should begin and what should be done to maximize the
> upgradibility (would it be possible to just image the disk -- change
> the IP settings, maybe update a boot script to add another node to the
> cluster?) and to maximize the performance (what are the benefits of
> dual-processor machines -- what about gigabit network cards?)

Any of the 15 "slave" nodes (not the server node, which is of course
more complicated) can be reinstalled from scratch in between five and
six minutes flat by simply booting the node with a kickstart-default
floppy (no keyboard, monitor, or mouse required at any time).  I've
reinstalled nodes in just this way to demonstrate to visitors just how
easy it is to administer and maintain a decent beowulf design -- just
pop the floppy in, press the reset button, and by the time I'm finished
giving them a tour of the hardware layout the node reboots itself back
into the state it was in when I pressed reset, but with a brand new disk
image.

Then there is Scyld, which is even more transparently scalable but
requires that you adopt the Scyld view and build a "true beowulf"
architected cluster and think about the cluster a bit differently than
just a "pile of PC's" with a distributed application (dedicated or not).
Not being able to login to nodes, NFS mount from nodes, use nodes like
networked workstations (headless or not) works for folks used to e.g.
SP3's but isn't as mentally comfortable to folks used to using a
departmental LAN as a distributed computing resource.

Finally, as has been discussed on the list a number of times, yes, you
can maintain "preinstalled" disk images and write post-install scripts
to transform a disk image into a node with a particular name and IP
number.  Although I've written at least four generations worth of
scripts to do just this over the last 14 years or so (some dated back to
SunOS boxes) I have to say that I think that this is the worst possible
solution to this particular problem.  Perhaps it is >>because<< I've
invested so much energy in it for so long that I dislike this approach
-- I know from personal experience that although it scales better than a
one-at-a-time installation/maintenance approach, it sucks down immense
amounts of personal energy to write and/or tune the scripts used and it
is very difficult and clumsy to maintain.

For example, if your cluster paradigm is a NOW/COW arrangement and the
nodes aren't on a fully protected private network (with a true
firewall/gateway/head node between them and the nasty old Internet with
all of its darkness and impurity and evil:-) then you will (if you are a
sane person, and you seem sensible enough) want to religiously and
regularly install updates on whatever distribution you put on the nodes.
After all, there have been wide open holes in every release of every
networked operating system (that claimed to have a security system in
the first place) ever made.  If you don't patch these holes as they are
discovered in a timely way, you are inviting some pimple-faced kid in
Arizona or some juvenile entrepreneur in Singapore to put an IRC server
or a SPAM-forwarder onto your systems.  If you use the image-based
approach, you will have to FIRST upgrade your image, THEN recopy it to
all of your nodes and run your post-install script.  If any of the
software packages in your upgrade/update that interact with your
post-install script have changed, you'll have to hand edit and test your
post-install script.  Even so, there is a good chance that you'll have
to reinstall all the nodes more than once to get it all right.

Sure, there are alternatives.  You can maintain a single node as a
template (you'll have to anyway) and then write a fairly detailed
rsync-based script to synchronize the images of your template and your
nodes, but not >>this<< file or >>that<< file, and even so if you do a
major distribution upgrade you'll simply have to reinstall from the bare
image as replacing e.g. glibc on a running system is probably not a good
idea.  No matter how you cut it, you'll end up doing a fair amount of
work to keep your systems sync'd and current and quite a lot of work for
a full upgrade.

Compare with the kickstart alternative above.  The ONLY work is in
building the kickstart file for "a node", which is mostly a matter of
selecting packages for the install and yes, writing a post-install
script to handle any site-specific customization.  The post-install
script will generally NOT have to mess with specific packages, though,
since their RPMs already contain the post-install instructions
appropriate for seamless installation as a general rule.  At most it
will have to install the right fstab, set up NIS or install the default
/etc/password and so forth -- the things that have to be done regardless
of the (non-Scyld) node install methodology.

Regarding the scaling to more nodes -- the work is truly not significant
and how much there is depends on how much you care that a particular
node retains its identity.  I tend to boot each new node twice -- once
to get its ethernet number for the dhcpd.conf table entry that assigns
it its own particular IP number, and once to do the actual install.
This is laziness on m part -- if I were more energetic (or had 256 nodes
to manage and were PROPERLY lazy:-) I'd invest the energy in developing
a schema whereby nodes were booted with an IP number from a pool during
the install while gleaning their ethernet addresses and then e.g. run a
secondary script on the dhcpd server to install all the gleaned
addresses with hard IP numbers and do a massive reboot of the newly
installed nodes.  Or something -- there are several other ways to
proceed.  However, with only 8-16 nodes it is hardly worth it to mess
with this as it takes only twenty seconds to do a block copy in the
dhcpd.conf and edit the ethernet numbers to correspond to what you pull
from the logs -- maybe five minutes total for 8 nodes, and even the
simplest script would take a few hours to write and test.  For 128 nodes
it is worth it, of course.

>
> Another concern here is actual floor space.  We have about a 6ft x 3ft
> area for the computers, so I think I am just going to be putting them
> onto a Home Depot industrial shelving system or something similar, so
> dual processor systems may be much better for me.  Cooling and
> electricity have both already been taken care of.

With only 8 nodes the space is adequate and they should fairly easily
run on a single 20 Amp circuit.  You're right at the margin where
cooling becomes an issue -- you'll be burning between one and two
kilowatts, sustained, with most node designs -- depending mostly on
whether they are single or dual processor nodes.

>
> I appreciate any help that is provided as I know someone out there has
> had similar experiences (possibly with this software package)

I'm afraid I cannot help you with the software package, but I can still
give you some generic advice -- in one sense you MAY be heavily
overbudgeted for only 8 nodes.  I actually favor answering all the
architectural questions before setting the budget and not afterwards,
but I'm also fully aware that this isn't always how things work in the
real world.

What you need to do (possibly by checking with the authors of the
package itself) is to figure out what is likely to bottleneck its
operation at the scales you wish to apply it.  Is it CPU bound (a good
thing, if so)?  Then use your budget to get as many CPU cycles as
possible per dollar and minimize your networking and memory expenses
(e.g. cheap switched 100BT and just get the smallest memory that will
comfortably hold your application, or at least 256 MB, whichever is
larger).  Is it memory I/O bound (lots of vector operations, stream-like
performance)?  Then investing in DDR-equipped Athlons or perhaps a P4
with a wider memory path may make sense.  Look carefully at stream or
cpu-rate benchmarks and optimize cost benefit in the occupied memory
size regime you expect to run the application at.  Is it a "real
parallel application" that has moderate-to-fine granularity, may be
synchronous (has barriers where >>all<< the nodes have to complete a subtask
before computation proceeds on >>any<< node)?  In that case your budget
may be about right for eight nodes as you'll need to invest in a
high-end network like myrinet or possibly gigabit ethernet.  In either
case you may find yourself actually spending MORE on the networking per
node than you do on the nodes themselves.

Finally, you need to think carefully about the single vs dual node
alternatives.  If the package is memory I/O bound AT ALL on a single CPU
it is a BAD IDEA to get a dual packaging as you'll simply ensure that
one CPU is often waiting for the other CPU to finish using memory so it
can use memory.  You can easily end up paying for two processors in one
node and getting only 1.3-1.4x as much work done as you would with two
processors in two nodes.  You also have think carefully about duals if
you are network bound -- remember, both CPUs in a dual will be sharing a
single bus structure and quite possibly sharing a single NIC (or bonded
channel).  Again, if your computation involves lots of communication
between nodes, one CPU can often be waiting for the other to finish
using the sole IPC channel so it can proceed.  Waiting is "bad".  We
hate waiting.  Waiting wastes money and our time.

Generally, duals make good economic sense for strictly CPU bound tasks
and "can" make decent sense for certain parallel computation models
where the two CPUs can sanely share the communications resource or where
one CPU manages net traffic while the other does computations.  The
latter can often be accomplished just as well with better/higher end
communications channels, though -- you have to look at the economics and
scaling.

Given a choice between myrinet and gigabit ethernet, my impressions from
being on the list a long time and listening are that myrinet is pretty
much "the best" IPC channel for parallel computations.  It is very low
latency, very high bandwitch, and puts a minimal burden on the CPU when
operating.  Good drivers exist for the major parallel computation
libraries e.g. MPI.  Check to make sure your application supports its
use if it is a real parallel app.  It may be that gigabit ethernet is
finally coming into its own -- I personally have no direct experience
with either one as my own tasks are generally moderately coarse grained
to embarrassingly parallel and I don't need high speed networking.

Hope some of this helps.  If you are very fortunate and your task is CPU
bound (or only weakly memory bound) and coarse grained to EP and will
fit comfortably in 512-768 MB of memory, you can probably skip the
eight-node-cluster stage altogether.  If you build a "standard" beowulf
with switched 100BT and nodes with minimal gorp (a floppy and HD,
memory, a decent NIC, perhaps a cheap video card) you can get 512 MB
DDR-equipped bleeding edge (1.4 GHz) Athlon nodes for perhaps $850
apiece.  (Cheap) switched 100Base ports cost anywhere from $10 each to
perhaps $30 each in units from 8 to 40 ports.  You can easily do
something like:

23 $900 nodes = $20700
1 $2000 "head node" with lotsa disk and maybe a Gbps ethernet NIC
1 <$1000 24 port 100BT switch with a gigabit port/uplink for your head node.
$500 for shelving etc.

you could build a 24 node 'wulf easily for your $25K budget.  Even if
you have to get myrinet for each node (and hence spend $2000/node) you
can probably afford 12 nodes, one equipped as a head node.

Good luck.

    rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu







More information about the Beowulf mailing list