Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Tyan S2882

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mark Hahn hahn at physics.mcmaster.ca
Tue Sep 26 08:14:45 PDT 2006


> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have

these are older, well-known, widely installed and certainly _can_ run stable.

> found the system to be quite unstable. After BIOS updates and kernel
> changes we still get random kernel panics when under load.

have you run memtest86?  are you monitoring temperatures?
(and perhaps voltages)

> So far we have solved the
> - broken BIOS problem with an update to the most recent BIOS.

due to a newer cpu?  the cluster I have with S2882's (mixed with 
S2881's, I think) hasn't needed any updates, but it's not using 
dual-core or anything exotic.

> - Discovered that some power supplies can produce problems
> http://www.anandtech.com/mb/showdoc.aspx?i=2608

I have a hard time believing this is specific to antec+tyan.
yes, certainly, PS's are a sensitive point, especially if you've
got heavily-configured systems.

> - FS corruption due to a firmeware problem in a RAID hardware board

therefore not related to the MB, right?

> - MCE chipkill errors (non-fatal) due to apparent bad RAM

also not related to the MB, right?  also, you really should expect
some small rate of corrected ECC's on any system; it's only a high
rate that's a problem (or uncorrectable ones, of course...)

> To be solved:
> - random kernel panics that take out the logging even when all debug
> flags are set in the kernel, as it fails to sync the disc during the
> kernel panic.

but kernel panics never sync - after all, a panic is specifically
an event from which you can't continue in any way.  or am I misunderstanding
what you're saying?

it sounds like you've done a lot of debugging already, but I'd recommend 
going back to basics.  remove all the io devices, disks, etc and see 
whether the board+cpu+memory can run stably, etc.



More information about the Beowulf mailing list