[Beowulf] Personal Introduction & First Beowulf Cluster Question

Mon Dec 8 13:17:47 PST 2008

Hello Steve and list

Steve Herborn wrote:

>The hardware suite is actually quite sweet, but has been mismanaged rather
>badly.  It has been left in a machine room that is too hot & on power that
>is more then flaky with no line conditioners.  One of the very first things
>I had to do was replace almost two-dozen Power Supplies that were DOA.  
>  
>
Yes, 24 power supplies may cost as much as the savings in UPS,
plus the headache of replacing them, plus failing nodes.

>I think I have most of the hardware issues squared away right now and need
>to focus on getting here up & running, but even installing the OS on a
>head-Node is proving to be troublesome.
>  
>
Besides my naive encouragement to use Rocks,
I remember some recent discussions here on the Beowulf list
about different techniques to setup a cluster.
See this thread, and check the postings by
Bogdan Cotescu, from the University of Heidelberg.
He seems to administer a number of clusters, some of which have
constraints comparable to yours, and to use a variety of tools for this:

http://www.beowulf.org/archive/2008-October/023433.html
http://www.iwr.uni-heidelberg.de/services/equipment/parallel/

>I really wish I could get away with using ROCKS as there would be such a
>greater reach back for me over SUSE.  Right now I am exploring AutoYast to
>push the OS out to the compute nodes, 
>
Long ago I looked into System Imager, which was then part of Oscar,
but I don't know if it is current/maintained:

http://wiki.systemimager.org/index.php/Main_Page

>but that is still going to leave me
>short on any management tools.
>
>
>  
>
That is true.
Tell bosses they are asking you to reinvent the Rocks wheel.

Good luck,
Gus Correa

-- 
---------------------------------------------------------------------
Gustavo J. Ponce Correa, PhD - Email: gus at ldeo.columbia.edu
Lamont-Doherty Earth Observatory - Columbia University
P.O. Box 1000 [61 Route 9W] - Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------

>Steven A. Herborn
>U.S. Naval Academy
>Advanced Research Computing
>410-293-6480 (Desk)
>757-418-0505 (Cell)
>
>
>-----Original Message-----
>From: Gus Correa [mailto:gus at ldeo.columbia.edu] 
>Sent: Monday, December 08, 2008 1:45 PM
>To: Beowulf
>Cc: Steve Herborn
>Subject: Re: [Beowulf] Personal Introduction & First Beowulf Cluster
>Question
>
>Hello Steve and list
>
>In the likely case that the original vendor will no longer support this 
>5-year old cluster,
>you can try installing the Rocks cluster suite, which is free from SDSC,
>and you already came across to:
>
>http://www.rocksclusters.org/wordpress/
>
>This would be a path or least resistance, and may get your cluster up and
>running again with relatively small effort.
>Of course there are many other solutions, but they may require more effort
>from the system administrator.
>
>Rocks is well supported and documented.
>It is based on CentOS (free version of RHEL).
>
>There is no support for SLES on Rocks,
>so if you must keep the current OS distribution, it won't work for you.
>I read your last paragraph, but you may argue with your bosses that the 
>age of this
>machine doesn't justify being picky about the particular OS flavor.
>Bringing it back to life, making it an useful asset,
>with a free software stack, would be a great benefit.
>You would spend money only in application software (e.g. Fortran 
>compiler, Matlab, etc).
>Other solutions (e.g. Moab) will cost money, and may not work with
>this old hardware.
>Sticking to SLES may be a catch-22, a shot on the foot.
>
>Rocks has a relatively large user base, and an active mailing list for help.
>
>Moreover, for Rocks minimally you must have 1GB of RAM on every node,
>two Ethernet ports on the head node, and one Ethernet port on each 
>compute node.
>Check the hardware you have.
>Although PXE boot capability is not strictly required, it makes 
>installation much easier.
>Check your motherboard and BIOS.
>
>I have a small cluster made of five salvaged Dell Precision 410 (dual 
>Pentium III)
>running Rocks 4.3, and it works well.
>For old hardware Rocks is a very good solution, requiring a modest 
>investment of time,
>and virtually no money.
>(In my case I only had to buy cheap SOHO switches and Ethernet cables,
>but you probably already have switches.)
>
>If you are going to run parallel programs with MPI,
>the cheapest thing would be to have GigE ports and switches.
>I wouldn't invest on fancier interconnect on such an old machine.
>(Do you have any fancier interconnect already, say Myrinet?)
>However, you can buy cheap GigE NICs for $15-$20, and high end ones (say 
>Intel Pro 1000) for $30 or less.
>This would be needed only if you don't have GigE ports on the nodes already.
>Probably your motherboards have dual GigE ports, I don't know.
>MPI over 100T Ethernet is a real pain, don't do it, unless you are a 
>masochist.
>A 64-port GigE switch to support MPI traffic would also be a worthwhile 
>investment.
>Keeping MPI on a separate network, distinct from the I/O and cluster 
>control net, is a good thing.
>It avoids contention and improves performance.
>
>A natural precaution would be to backup all home directories before you 
>start,
>and any precious data or filesystems.
>
>I suggest sorting out the hardware issues before anything else.
>
>It would be good to evaluate the status of your RAID,
>and perhaps use that particular node as a separate storage appliance.
>You can try just rebuilding the RAID, and see if it works, or perhaps 
>replace the defective disk(s),
>if the RAID controller is still good.
>
>Another thing to look at is how functional your Ethernet (or GigE) 
>switch or switches are,
>and if you have more than one switch how they are/can be connected to 
>each other.
>(One for the whole cluster? Two or more separate? Some specific topology 
>connecting many switches?)
>
>I hope this helps,
>Gus Correa
>
>  
>