[Beowulf] lost in parallel computing

CHEN, XIAOMING CHEN25 at engr.sc.edu
Wed Dec 7 12:53:16 PST 2005


Dear all,

I've been practicing scientific parallel computing for 3~4 years, but as
a remote user I never really touched the subjects on parallel computer
management. Things work out if the remote computers I am working on are
managed well. However, when they are not in good hands, they will go on
'strike' for a long time. This is what I am experiencing now. One remote
cluster just reloated recently and it lost myrinet. A new cluster
purchased from Dell hasn't been working since it was installed 3 months
ago. Another one has some strange behavior. For example, sometimes it
writes data twice into a file in a random order; a user cannot kill his
process unless he terminates the xwindow (i.e, exit). I guess during
this holiday season nobody will stand out to solve the problem. But it
seems such problems will continue to exist and evolve as computer
technologies evolve themselves. I am wondering if a inexpensive but
robust parallel executing environment is possible to build. If it is so
difficult to maintain a parallel computer, how can we persuade people to
invest money in parallel computers? 


This is the first time for me to post a message. Please kindly remind me
if I do not follow the rules. I appreciate your response. 

Xiaoming Chen
University of South Carolina




More information about the Beowulf mailing list