Dual-Athlon Cluster Problems
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Steward chris at wehi.edu.auWed Jan 22 22:45:40 PST 2003
- Previous message: Cluster programming...
- Next message: Dual-Athlon Cluster Problems
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi, We're in the process of setting up a new 32-node dual-athlon cluster running Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're having problems with nodes hanging during calculations, sometimes only after several hours of runtime. We have a serial console connected to such nodes but that is unable to interact with the nodes once they hang. Nothing is logged either. It seems that running jobs on one CPU doesn't seem to present too much of a problem, but when the machines are fully loaded (both CPU's 100% utilization) errors start to occur and machines die often up to 8 nodes within 24 hours. Temperature of nodes under full load is approximately 55C. We have tried using the "noapic" option but the problems still persist. Using other software not requiring enfuzion 6 also produces the same problems. The seek feedback on the following: 1/ Are there issues using redhat 7.3 as opposed to 7.2 in such a setup ? 2/ Are there known issues with 2.4.18 kernels and AMD chips ? We suspect the problems are kernel related. 3/ Are there any problems with dual-athlon clusters using the MSI K7D Master L motherboard ? 4/ Are there any other outstanding issues with these machines under constant heavy load ? Any advice/help would be greatly appreciated. Thanks in advance Chris -------------------------------------------------------------- Cluster configuration node configuration: CPU's: Athlon MP2000+ RAM: 1024Mb Kingston PC2100 DDR Operating system: Redhat 7.3 (with updates) Kernel: 2.4.18-19.7.xsmp Motherboard: MSI K7 Master L motherboard (Award Bios 1.5). Network: On-board PCI (Ethernet controller: Intel Corp. 82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set) head-node: CPU single Athlon MP2000+ Dataserver: CPU: single Athlon MP2000 & Network: PCI Gigabit NIC Network Interconnect: cisco 2950 (one GBIC installed) Software: Cluster management Enfusion 6 Computational Dock V4.0.1
- Previous message: Cluster programming...
- Next message: Dual-Athlon Cluster Problems
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
