[Beowulf] Cluster doesn't like being moved
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Steve Herborn herborn at usna.eduTue Mar 10 11:35:39 PDT 2009
- Previous message: [Beowulf] HPCC "intel_mpi" error
- Next message: [Beowulf] Cluster doesn't like being moved
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I have a small test cluster built off Novell SUES Enterprise Server 10.2 that is giving me fits. It seems that every time the hardware is physically moved (keep getting kicked out of the space I'm using), I end up with any number of different problems. Personally I suspect some type of hardware issue (this equipment is about 5 years old), but one of my co-workers isn't so sure hardware is in play. I was having problems with the RAID initializing after one move back which I resolved a while back by reseating the RAID controller card. This time It appears that the file system & configuration databases became corrupted after moving the equipment. Several services aren't starting up (LADP, DHCP, PBS to name a few) and YAST2 hangs any time an attempt is made to use it. For example adding a printer or software package. My co-worker feels the issue maybe related to the ReiserFS file system with AMD processors. The ReiserFS file system was the default presented when I initially installed SLES so I went with it. Do you know of any issues with using the ReiserFS file system on AMD based systems or have any other ideas what I maybe facing? Steven A. Herborn U.S. Naval Academy Advanced Research Computing 410-293-6480 (Desk) 757-418-0505 (Cell) _____ From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of gossips J Sent: Monday, March 09, 2009 5:08 AM To: beowulf at beowulf.org Subject: [Beowulf] HPCC "intel_mpi" error Hi, We are using ICR validation. We are facing following problem while running below command: cluster-check --debug --include_only intel_mpi /root/sample.xml Problem is: Output of cluster checker shows us that "intel_mpi" FAILED, where as by looking into debug.out file it is seen that "Hello World" is returned from all nodes. I have 16 nodes configuration and we are running 8 proc/node. Above behavior is observed with even 1 proc/node, 2 proc/node, 4 proc/node as well. I also tried "rdma" and "rdssm" as a DEVICE in XML file but no luck. If anyone can shed some light on this issue, it would be great help. Another thing I would like to know is: Is there a way to specify "-env RDMA_TRANSLATION_CACHE" option with Intel Cluster Checker? Awaiting for kind response, Thanks in advance, Polk. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20090310/57513a00/attachment.html
- Previous message: [Beowulf] HPCC "intel_mpi" error
- Next message: [Beowulf] Cluster doesn't like being moved
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
