[Beowulf] Personal Introduction & First Beowulf Cluster Question
gus at ldeo.columbia.edu
Mon Dec 8 13:17:47 PST 2008
Hello Steve and list
Steve Herborn wrote:
>The hardware suite is actually quite sweet, but has been mismanaged rather
>badly. It has been left in a machine room that is too hot & on power that
>is more then flaky with no line conditioners. One of the very first things
>I had to do was replace almost two-dozen Power Supplies that were DOA.
Yes, 24 power supplies may cost as much as the savings in UPS,
plus the headache of replacing them, plus failing nodes.
>I think I have most of the hardware issues squared away right now and need
>to focus on getting here up & running, but even installing the OS on a
>head-Node is proving to be troublesome.
Besides my naive encouragement to use Rocks,
I remember some recent discussions here on the Beowulf list
about different techniques to setup a cluster.
See this thread, and check the postings by
Bogdan Cotescu, from the University of Heidelberg.
He seems to administer a number of clusters, some of which have
constraints comparable to yours, and to use a variety of tools for this:
>I really wish I could get away with using ROCKS as there would be such a
>greater reach back for me over SUSE. Right now I am exploring AutoYast to
>push the OS out to the compute nodes,
Long ago I looked into System Imager, which was then part of Oscar,
but I don't know if it is current/maintained:
>but that is still going to leave me
>short on any management tools.
That is true.
Tell bosses they are asking you to reinvent the Rocks wheel.
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
>Steven A. Herborn
>U.S. Naval Academy
>Advanced Research Computing
>From: Gus Correa [mailto:gus at ldeo.columbia.edu]
>Sent: Monday, December 08, 2008 1:45 PM
>Cc: Steve Herborn
>Subject: Re: [Beowulf] Personal Introduction & First Beowulf Cluster
>Hello Steve and list
>In the likely case that the original vendor will no longer support this
>5-year old cluster,
>you can try installing the Rocks cluster suite, which is free from SDSC,
>and you already came across to:
>This would be a path or least resistance, and may get your cluster up and
>running again with relatively small effort.
>Of course there are many other solutions, but they may require more effort
>from the system administrator.
>Rocks is well supported and documented.
>It is based on CentOS (free version of RHEL).
>There is no support for SLES on Rocks,
>so if you must keep the current OS distribution, it won't work for you.
>I read your last paragraph, but you may argue with your bosses that the
>age of this
>machine doesn't justify being picky about the particular OS flavor.
>Bringing it back to life, making it an useful asset,
>with a free software stack, would be a great benefit.
>You would spend money only in application software (e.g. Fortran
>compiler, Matlab, etc).
>Other solutions (e.g. Moab) will cost money, and may not work with
>this old hardware.
>Sticking to SLES may be a catch-22, a shot on the foot.
>Rocks has a relatively large user base, and an active mailing list for help.
>Moreover, for Rocks minimally you must have 1GB of RAM on every node,
>two Ethernet ports on the head node, and one Ethernet port on each
>Check the hardware you have.
>Although PXE boot capability is not strictly required, it makes
>installation much easier.
>Check your motherboard and BIOS.
>I have a small cluster made of five salvaged Dell Precision 410 (dual
>running Rocks 4.3, and it works well.
>For old hardware Rocks is a very good solution, requiring a modest
>investment of time,
>and virtually no money.
>(In my case I only had to buy cheap SOHO switches and Ethernet cables,
>but you probably already have switches.)
>If you are going to run parallel programs with MPI,
>the cheapest thing would be to have GigE ports and switches.
>I wouldn't invest on fancier interconnect on such an old machine.
>(Do you have any fancier interconnect already, say Myrinet?)
>However, you can buy cheap GigE NICs for $15-$20, and high end ones (say
>Intel Pro 1000) for $30 or less.
>This would be needed only if you don't have GigE ports on the nodes already.
>Probably your motherboards have dual GigE ports, I don't know.
>MPI over 100T Ethernet is a real pain, don't do it, unless you are a
>A 64-port GigE switch to support MPI traffic would also be a worthwhile
>Keeping MPI on a separate network, distinct from the I/O and cluster
>control net, is a good thing.
>It avoids contention and improves performance.
>A natural precaution would be to backup all home directories before you
>and any precious data or filesystems.
>I suggest sorting out the hardware issues before anything else.
>It would be good to evaluate the status of your RAID,
>and perhaps use that particular node as a separate storage appliance.
>You can try just rebuilding the RAID, and see if it works, or perhaps
>replace the defective disk(s),
>if the RAID controller is still good.
>Another thing to look at is how functional your Ethernet (or GigE)
>switch or switches are,
>and if you have more than one switch how they are/can be connected to
>(One for the whole cluster? Two or more separate? Some specific topology
>connecting many switches?)
>I hope this helps,
More information about the Beowulf