node problems
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Kim Branson Kim.Branson at csiro.auThu Apr 4 07:38:50 PST 2002
- Previous message: Fw: Node boot disk to designate eth0
- Next message: DHCP Help
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi all
i have a 64node athlon cluster, at the moment i have about 19 nodes that
are flaky, they stay up for a bit and then fall over. one can still ping
them but not telnet or ftp. I'm trying to keep as many up as possible
(more nodes means i can get the final calculations done for my phd
thesis faster....)
this may be an unrelated problem but i see errors in the logs about
telnet
node01 telnetd[16941]: ttloop: peer died: EOF
xinetd[17099]: warning: can't get client address: Connection reset by
peer
Apr 5 00:32:21 node01 rlogind[17099]: Can't get peer name of remote
host: Transport endpoint is not connected
Apr 5 00:32:21 node01 rshd[17098]: getpeername: Transport endpoint is
not connected
Apr 5 00:32:21 node01 ftpd[17097]: getpeername (in.ftpd): Transport
endpoint is not connected
Apr 5 00:32:31 node01 rlogind[17100]: Can't get peer name of remote
host: Transport endpoint is not connected
Apr 5 00:32:31 node01 xinetd[17101]: warning: can't get client address:
Connection reset by peer
Apr 5 00:32:31 node01 xinetd[17102]: warning: can't get client address:
Connection reset by peer
Apr 5 00:32:31 node01 xinetd[17103]: warning: can't get client address:
Connection reset by peer
Apr 5 00:32:31 node01 ftpd[17101]: getpeername (in.ftpd): Transport
endpoint is not connected
i am using enfuzion to do job dispatch and collect. by looking at
the packets i see the enfuzion director on the head node attempts to
send a UDP packet to the node. all udp ports on the nodes are blocked
i checked this by scanning a node with nmap. older installs of redhat
(i.e my workstation) seem to have udp ports enabled.
regardless of the ttloop error the machine appears to work for a while.
i.e enfuzion logs in jobs run etc, untill sudennly all stops.
the machines remain up, and can be pinged. but no other services (rsh
ssh etc run) If i connect a monitor and keyboard to the node it is also
unresponive.
this is a problem across many nodes.
has anyone who uses enfuzion seen this error with nodes that are a rh7.1
install
On one node i have seen on 2 occasions
CPU 0: Machine Check Exception: 0000000000000004
Bank 2: d40040000000017a at 540040000000017a
decoding this using a until i found on the net
Status: (4) Machine Check in progress.
Restart IP invalid.
parsebank(2): f60020000000017a @ 760020000000017a
External tag parity error
Correctable ECC error
MISC register information valid
Memory heirarchy error
Request: Generic error
Transaction type : Generic
Memory/IO : I/O
can anyone tell me what the Restart IP invalid means. is this a dead cpu
or a memory problem causing a mce?
cheers
Kim
--
______________________________________________________________________
Kim Branson
Phd Student
Structural Biology
CSIRO Health Sciences and Nutrition
Walter and Eliza Hall Institute
Royal Parade, Parkville, Melbourne, Victoria
Ph 61 03 9662 7136
Email kbranson at wehi.edu.au
______________________________________________________________________
- Previous message: Fw: Node boot disk to designate eth0
- Next message: DHCP Help
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
