[Beowulf] lost in parallel computing

Robert G. Brown rgb at phy.duke.edu
Tue Dec 13 16:26:05 PST 2005


On Wed, 7 Dec 2005, CHEN, XIAOMING  wrote:

> Dear all,
>
> I've been practicing scientific parallel computing for 3~4 years, but as
> a remote user I never really touched the subjects on parallel computer
> management. Things work out if the remote computers I am working on are
> managed well. However, when they are not in good hands, they will go on
> 'strike' for a long time. This is what I am experiencing now. One remote
> cluster just reloated recently and it lost myrinet. A new cluster
> purchased from Dell hasn't been working since it was installed 3 months
> ago. Another one has some strange behavior. For example, sometimes it
> writes data twice into a file in a random order; a user cannot kill his
> process unless he terminates the xwindow (i.e, exit). I guess during
> this holiday season nobody will stand out to solve the problem. But it
> seems such problems will continue to exist and evolve as computer
> technologies evolve themselves. I am wondering if a inexpensive but
> robust parallel executing environment is possible to build. If it is so
> difficult to maintain a parallel computer, how can we persuade people to
> invest money in parallel computers?

You're dealing with two issues here.

   a) People who DO invest money in parallel computers don't experience
the issues you report.  I promise.  They are all due to somebody NOT
investing money in e.g. good hardware, good software, good people.  Hire
any of a half dozen people or groups I know to build your cluster for
you, hire good experienced cluster managers, and you won't have these
problems.  Expect things to work with inadequate investment of money and
time and other resources, well, that's why I used the word "inadequate".

   b) Cluster parallel computers are often touted (correctly) as money
savers because they are "do it yourself".  That is, you can trade your
own energy and time (presumably "opportunity cost" time that is already
paid for and which can just be diverted into this where it makes sense
in terms of efficiency) for money, and build and run the clusters
themselves.  However, DIY or store bought, YOU GET WHAT YOU PAY FOR.  If
you buy cheap equipment, expect failures.  Expect failures anyway if you
have bad luck.  Expect to have to learn a lot and become expert.  Expect
to have to become a detective and run down and fix the causes of the
problems you describe above.  DIY can save you money (even a LOT of
money if you have good skills and some free time) but you work for it.

What you should NOT expect is for a cluster computer to just "happen".
As a matter of fact, one kind of cluster computer design DOES just
happen, in a matter of speaking, but only because it is already
installed and well-managed as an ordinary linux-based ethernet LAN of
workstations.

The final point is that cluster computers don't have perfect to be
cost-beneficial.  They just have to get more work done for less money
compared to the alternatives.  If you want to see REAL expense, buy a
"big iron" supercomputer design with any sort of proprietary interface,
custom hardware, and hot and cold running service contracts.  Your
service alone will cost as much as a well designed cluster might cost
(designed to accomplish the same work), and then there are the people
you need to run it.

    rgb

>
>
> This is the first time for me to post a message. Please kindly remind me
> if I do not follow the rules. I appreciate your response.
>
> Xiaoming Chen
> University of South Carolina
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list