Tyan S2466N boot problems...

Robert G. Brown rgb at phy.duke.edu
Wed May 22 05:55:02 PDT 2002


On Wed, 22 May 2002, Alberto Ramos wrote:

> 
> 
>   Hi all.
>   
>   We are mounting a beowulf cluster with 4 dual nodes with Tyan S2466N
> motherboards.
> 
>   We have 2 problems:
>   
>   - How can we made a node boot without a video card? A node that work
> perfect with video card seems not to boot without it.

This one I can offer real help with, having spent the last few days
working out an installation protocol.  See below.  This is the sheet our
vendor is going to use building our 2U 2466N nodes, starting from the
ground up.  You can ignore the gigabit instructions.  You WILL need to
visit the Tyan website and get the flash image for the 4.01q bios
update, as it fixes several bugs, one of them critical to serial console
operation (and it is all of six days old:-).

BTW, I'd welcome comments on the protocol from other 2466N owners.  For
example, I set ECC scrub -- is this necessary/desirable?  Given our
sometimes troubles (see next) I've tried to use the most aggressive
memory setting to avoid possible memory-based errors (to no avail,
actually).

>   - We have 512MB PC2100 DDR ECC Reg. memory modules (ATP AB64L72A8S8BOS),
> that in the web page of tyan seems to work perfectly. With just 1 DIMM it
> works fine, but when we put 2 DIMM in the same machine, the machine dont boot.

We have seen both odd and inconsistent behavior.  One of our nodes (of
the two we've brought up so far) flashed and installed more or less
"perfectly", and can be rebooted at will and appears completely stable
on the basis of two or three days of tests.  The other node has been
trouble from the beginning.

Although we successfully flashed it, got PXE-boot-install working, and
configured it for serial console only operation, it had a tendency to
hang part way through a POST or very early in a boot.  If it achieved a
full boot, it would sometimes run for a while and then generate a whole
string of faults in running applications.  Needless to say, this is a
totally stable linux (RH 7.2 based) distribution that we run flawlessly
on well over 100 hosts and beowulf nodes in our department alone.  Then
it would run for a while normally, maybe even boot a few times normally,
then suddenly on a boot it would once again hang in the POST.

Once booted into a hang, it would NEVER boot unless one rearranged the
physical memory.  If I took a DIMM out, it would then boot and run for a
few cycles of boot normally, then hang again.  I could put the DIMM
>>back<< and it would boot and run a few times.  Sometimes I could just
move the DIMM from one socket to another and get a few boots out of it.

We cycled three different ECC registered DIMMS through the box from two
different manufacturers, both on Tyan's approved list, and at least one
of which was known to work perfectly in another host.  We used them one
at a time, in different slots, and in different pairwise combinations
and all exhibited this trait, so we don't think it is the memory per se.

Our solution so far has been to give the system back to our vendor to
bench test by component, as this appears to be a hardware problem, not a
matter of using the right settings in the bios.  At a guess, SOMETHING
is marginal -- the motherboard itself, one of the CPUs (we didn't play
the swap-n-remove game with the CPUs as we didn't have enough extras to
swap through at the time), the power supply -- or there is a cable with
a hidden flaw in it (the latter because swapping the memory around one
inevitably wiggles the power supply wires and the IDE ribbon cable to
the system's single disk).  Because the problem is intermittant, it is
difficult to ensure that any particular swap works -- for a while I
thought that I had a solution in setting the bios to ALWAYS test all
memory (quickboot disabled) as it seemed most often to hang right after
completing the memory test in quickboot mode, but this just displaced
the systematic point of failure to after the system started to boot from
hard disk and loaded the initrd image.

It does have the look of a memory problem, and may even be a bios bug in
the new flash image (something in the POST, perhaps) but if it is it is
one that we have only managed to tweak in one of two more or less
identical boxes of hardware.

My advice to you:  Try the protocol below on several of your boxes, not
just one.  Test the boot/reboot phase especially carefully, both with
full powerdown in between and without.  If you see the problem above
(hanging on a reboot) try altering your memory configuration -- if you
see this and we see this and others see this on systems built far apart
from one another with different components altogether, it makes it very
likely that it is Tyan or AMD's problem and we can bug them for a fix.
So to speak.  If you have a completely consistent failure, try the swap
memory/CPU game, and also try running it "stripped" -- once you get
things configured in the bios, you should be able to get to the PXE
booting phase with absolutely nothing attached to the motherboard but
power and a network cable (and a serial line, of course, to watch).  We
would hang even before this point pretty consistently.

Hope this helps.

The protocol sheet (make your own bios flash floppy):




                         Tyan 2466N Installation
                        (Duke Physics Department)



  a) Assemble basic system -- connect all cables and jumpers for
"normal" operation with a 266 MHz FSB, 66 MHz PCI, and the switches and
LEDs all hooked up.  Connect all fans directly to power supply (not to
fan connectors on the motherboard) for continuous operation.  

Initially, put the Gigabit ethernet adapter (dismounted from the
backplate) in (64 bit) slot 1 or 2 and put a cheap video adapter (also
with backplate disassembled) in (32 bit) slot 3.  This will provide
keyboard and video for the burn in and bios install.  You will also need
a floppy, temporarily, to facilitate both burn in and bios reflashing
below.

  b) Boot (with floppy) and test/burn in as usual.  The system must be
reasonably stable and reliable before the BIOS reflashing step as
interrupting it is a Bad Thing(tm).

  c) Boot with accompanying floppy into DOS.  This floppy should
AUTOMATICALLY run:

     b57util pxee 0

(pxe enable) to reset the Gigabit ethernet adapter to PXE mode.

It should then reflash the motherboard BIOS to 4.01q with the command
below.  DO NOT INTERRUPT THIS PROCESS with the bios half-flashed or you
can render the motherboard "permanently" dysfunctional and have to
return the board to the factory for repair.

  d) As the system reboots after the flash and enters the POST phase,
press F2 to enter the BIOS setup.  Set the following BIOS options:

    i) under Keyboard, toggle the system to ignore a keyboard error on
boot.  This is essential for operation of a serial console and will not
work unless 4.01q has been installed via flash.

   ii) under advanced/chipset/ecc-config, set memory to ECC Scrub

  iii) under advanced/io, disable the parallel printer port

   iv) under power, disable power savings

    v) under advanced/io, disable the USB ports

   vi) under advanced, disable the ps2 mouse.

  vii) under advanced, select console redirection.  Set it to use serial
port A, direct connection, 115200 baud, no flow, vt100 terminal, CR
after POST is on.

 viii) Under boot, move MB/PXE device(s) up to the first position,
followed by the hard disk (remember, the floppy is not a part of the
system).

   ix) Under the main panel, remove/disable the floppy drive A.

    x) Save and exit.

  e) Power the system down.  Remove the floppy drive.

  f) At this point the system SHOULD be controllable from the serial
console.  To test it, connect a null-modem serial cable to Com A and to
the comm port of a nearby system with a terminal program.  Set the
terminal program's comm parameters to match the settings of the serial
console (115.2 kbaud, no flow, vt100).  When you boot, you SHOULD be
able to see the boot occur both on the (still attached) video and the
serial console, and should be able to control most of it from either
location.

  g) After verifying this, power the system down.  Remove the video card
and keyboard, reinstall the network card's backplate and install it in
the 64-bit riser.

  h) Install the riser.  The wired key goes in slot 2, with the wired
part towards the REAR of the case (these are address lines) NOT toward
the middle of the motherboard.

  i) At this point the system should be "completed" and just about ready
to button up for delivery.  Reboot one last time with no video and no
floppy, using only the serial console to monitor the boot.  You should
see the system come up, POST, and then attempt to PXE boot via its
onboard NIC.  As this times out it will sequentially test the second NIC
and the hard disk.  If it gets through all three and stops announcing
that it has no operating system, it is done and ready for us to
PXE-install here.

  j) RECORD THE NIC ADDRESSES!  Please attach labels to each unit with
the ethernet addresses of both the onboard interface and the gigabit
interface (indicating which is which).  The onboard interface address is
printed on top of the RJ45 plug housing on the motherboard; the gigabit
interface address is printed directly on the card.

  k) The system should now be ready for delivery.  Pack it up to go.

Thank you.  Any questions on this protocol can be addressed to
rgb at phy.duke.edu or ....(sorry, he wants to remain anoymous --
beowulfers should contact only me:-).  If necessary one of us will come
in to walk you through an install.

   rgb


> 
>   Thank you very much.
>   
>   Alberto.
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list