Myrinet hardware reliability
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduFri Feb 7 09:56:54 PST 2003
- Previous message: Myrinet hardware reliability
- Next message: Myrinet hardware reliability
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, 7 Feb 2003, Victoria Pennington wrote: > Hi, > > We have a 113 node IBM x330 cluster with Myrinet 2000. We're > experiencing very high failure rates on Myrinet switch ports > (average 3 per month) and on Myrinet NICs to a lesser extent > (about 1 per month). Ports and NICs are fine one minute, > then one or the other just dies (for good). Cables > (fibre, not copper) seem fine - one or two failures only in > nearly a year. > > There is no pattern in the failures, and they are entirely > unrelated to usage levels; seldom used nodes are just as > likely to have failures as heavily used nodes. I'd at least suspect wiring and spikes. There was a long discussion on ways to MISwire computer rooms on the list a few months ago that you might look up in the list archives; within it are some URL's to sites that describe some of the problems that can occur when one plugs lots of nodes into multiphase circuits that e.g. share a common neutral. In a nutshell you can be building up a significant neutral line voltage and browning out your supply voltage during the critical middle third of each half-phase when switching power supplies tend to draw all their power. This harmonic distortion of the supply voltage can cause all sorts of problems, premature and inexplicable component failure being one of them. Since systems experiencing it tend to run with their power supply capacitors inadequately charged, it can significantly reduce the ability of those power supplies to filter out spikes. Even surge protectors don't eliminate all of the problems that can be caused. You might check up on just how the room was wired. In particular, look at the voltage between neutral and ground at the receptacles where all the nodes are plugged in. If it is as high as a few volts, you may have a problem, especially if it is high on the circuit where the systems are located and not high on the circuit where the switch is located. If you have an oscilloscope, you can look at the actual supply line voltage at the receptacles loaded and unloaded, to see how badly distorted the wave is. Solutions (if this turns out to be your problem): a) NEVER share a neutral wire between three phases in a computer room. The load isn't resistive and doesn't have a power factor near one, and it is actually dangerous to do so (it overheats the neutral and the main supply transformer). Run a separate neutral for each phase. b) Try to keep the runs as short as possible and use heavy gauge wire. The neutral line voltage depends on the current it carries and its resistance. Resistance increases with the length of the run, decreases with the cross-sectional area of the wire. c) Use power factor corrected power supplies if possible, or a harmonic correction supply transformer for the entire space (there are companies that would love to sell you one). d) Some people on the list suggested that a UPS would probably help. It seems like an expensive solution compared to running additional neutrals, and nearly as expensive as getting a harmonic correction transformer. rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Myrinet hardware reliability
- Next message: Myrinet hardware reliability
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
