Dual-Athlon Cluster Problems

Craig Tierney ctierney at hpti.com
Mon Jan 27 10:09:36 PST 2003


On Sun, Jan 26, 2003 at 10:13:09PM -0800, Ben Ransom wrote:
...stuff deleted...
> 
> It is still curious to me, that we can run other codes on Dolphin SCI and 
> showing 100% cpu utilization (full power/heat) on a ring away from a 
> suspect SCI card.  This implies the reliability is code dependant, as other 
> have alluded.  I spose this may be due to the amount of message 
> passing?  Hopefully the problem will disappear once we get our full Dolphin 
> set in working order.

Are you using the ScaMPI or are you using MPICH over SCI?  ScaMPI is fast,
but implemented MPI differently than MPICH did.  If the code runs correctly
with MPICH in other places you might try using MPICH on your cluster as
well.  That might fix the problem if it is message passing related or at
least provide more data to help debug the real problem.

Craig

-- 
Craig Tierney (ctierney at hpti.com)



More information about the Beowulf mailing list