MP S2460 Problem
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Ken Chase math at velocet.caWed Feb 26 11:39:50 PST 2003
- Previous message: MP S2460 Problem
- Next message: MP S2460 Problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, Feb 26, 2003 at 12:50:46PM -0500, Robert G. Brown's all... > On Wed, 26 Feb 2003, John Morelle wrote: > > > Hi all, > > > > I would like to notice here to everyone about our complaining problems on > > the TYAN MP based : the Tiger S2460 motherboard. > You can probably find plenty of them in the list archives. I've > detailed ours numerous times. > > a) They persist in crashing (even to this day) when hammered by a > memory-intensive application. We have some 23 2460's, and probably 2 or > 3 times that many 2466's. When the owners run the computation(s) on > them that they got the workstations for, they crash, often with three > days to a week. We put Tbird 1.33s into ours like we were NOT ADVISED TO DO, into 2460s with the original bios, and we've had NO PROBLEMS at all. I wish there was a way to flash back to this old bios, because you can still get 1.33 and 1.4 Tbirds, and they're rock solid. > b) Even that is maddeningly inconsistent. The same job run with the > same parameters might crash in one day one time, four days another time, > and not crash at all on a different 2460 -- until the time it does. Now you need a RAIC - Redundant Array of Inexpensive Clusters. Then you verify the results and toss out the garbage. I actually did this with a bunch of overclocked boxen to weed out the bad ones. I ran each job 3 times, and started flagging which boxes gave results that differed from others. When all 3 differed, I ran it two more times and hand picked through results. (I dont condone OC anymore though at all, this was back in the C300A days -- tho it is interesting to note my C300A OC'd to 450 is going on 5 years at home, as my firewall -- I WISH they built things stable as that still! ;) > c) We had incredibly horrible problems initially getting them to work > with off the shelf risers. Some cards would work, in some slots, > sometimes. Some of the same cards that failed would work in some of the > slots on the motherboard if plugged directly in (no riser at all). > We're not talking odd cards, either -- things like 3c905's and > off-the-shelf PCI video. Man that bios musta been garbage. Hows the new bios? We had no problems with the bios on the 2460 (mfg'd between oct and dec 2001) and on the 2466 (initial release mfg'd between dec 2001 and feb 2002). We NEVER upgraded the bios and dont intend to for fear of having the same problems. Oh do you mean PCI riser cards to let 2U servers have full height PCI cards in them? Bad karma on those - we've used them in a few machines around here and we've had problems with almost every board we've tried them on. PCI doesnt like to have extra goop in the way of its cards. > e) Such as the fact that if you flash the BIOS, it resets the serial > console (which doesn't work horribly well, as it requires a keyboard to > be plugged directly in if you want to do all sorts of important things > but which does work). So if you actually bought 2466's WITHOUT a video > card, expecting to use the serial console, if you reflash the BIOS you > have to disassemble the case, insert a video card, reenter the BIOS, > turn the serial console on again, shut it down and take out the video > card, rerack it, power it up and do whatever via the serial console, and > God help you if you made any sort of mistake or anything failed to > "take" because you then get to do it all over again. WOW. Thats amazing. We can run them without keyboards fine. Occasionally the BIOS is reset by as power flux or something like that, and yes you have to get out the video card and stuff, but thats a minor deal. It happens infrequently enough - when there's a power fail and subsequent very unfriendly repowering (where it flickers and comes back on and just stresses ALL the gear as much as possible..) we find only 2-3% of machines have a problem. [ dont ask why there's no ups. its political at the customer end. ] > Overall, the 2460's are just plain broken unstable pieces of shit that > suck systems administration time like a black hole and have cost us > something like 1/3 of the productivity of the cluster in question and > infinite annoyance at the management level. We are finally biting the > bullet and trying to gradually replace them with 2466's (reusing all the > rest of the hardware). My experience has been completely contrary to this. Perhaps its because we bought them at different times, or perhaps because we dont have MPs in them. We have regular Tbirds, which were "not supported" by AMD. We used them after 2 months of testing and having no problems, and here we are a year later with no problems after the install. We've had to replace two boards, but they were both 2466s (and then we had to buy MPs because the replacements were the 4.01 bios that doesnt work with tbirds. Those new BIOS boards have had more problems than all the 2460s and 2466s in the cluster since then as well! Yay!). > If you can get Tyan to replace them for free, please let us know. God We got RMAs on dead boards pretty easily through our supplier. Not sure what your deal is. extra 10% for 3 year warranty on CPUs and boards was worth it. > knows that they should -- they should replace ours as well and those > belonging to any other poor suckers who bought them. These systems > overall drove us to seriously consider e.g. dual Xeons (at a fairly > similar price) just because they are relatively stable. Alas, the Xeons > don't run my particular problem as well as Athlons... We use alot of athlon gear around here and have built two clusters of them, so far, no complaints. They blow around the same speed as all our intel gear, in the long run. Again, however, if you buy inexpensive gear, just set up so its most cost effective to throw it away! You and I have come to this conclusion before actually. :) Im sure it makes environmentalists (including the tiny one inside me) cringe. As long as you experience regular fail rates, like perhaps a board a month, then yer set. If you have problems across the whole cluster, where 30-60% of your gear is affected you have a different problem. Im just waiting to see what happens at 18-24 months -- see if the 246x's get bit by the exploding capacitor problem, yay! Anyone else experiencing that? /kc > > rgb > > Robert G. Brown http://www.phy.duke.edu/~rgb/ > Duke University Dept. of Physics, Box 90305 > Durham, N.C. 27708-0305 > Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Ken Chase, math at velocet.ca * Velocet Communications Inc. * Toronto, CANADA
- Previous message: MP S2460 Problem
- Next message: MP S2460 Problem
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
