silent networking death

matthew mcglynn matt@debris.com
Mon Jul 12 20:33:38 1999


I have a 2.0.36 / RHL 5.2 box with an EtherExpress Pro card,
running the latest version of the eepro100 driver (v. 1.05).
Periodically this host goes offline; it just falls silent.

There are no pertinent messages in the system logs.

This can happen while people are connected to the host
(via ssh) or not.

We're running ftpd, apache, sshd, and little else. When
the box is reachable, everything works fine. 

On one occasion, two people had 4 total pending connections
to the box (ssh and http) from remote locations. All
connections were "pending" in the sense that the host
was not responding. After a few seconds, a third person,
standing at the server's console, PINGed an outside
host... as soon as the ping went through, all 4 pending
ssh and http connections went through. It was as if
the ping woke up the networking system.

This host is plugged in to a Cisco Catalyst 2900 (24-port)
10/100 switch with the latest Cisco firmware.

According to the ISP, the problem is that their switch
is losing the MAC address of our server. Their solution
was to enable a static (rather than dynamic) MAC address
for the port we're plugged in to. For the past 4 days
our server has indeed not gone offline, whereas for the
two days prior, it had been unreachable about 50%
of the time.

The theory of the tech at the ISP is that, for unknown
reasons, the switch was forgetting the MAC address of
our NIC -- but that traffic from the host side (such as
the ping in the above example) would contain our 
server's MAC address, re-enabling two-way communication
between the world and our server.

Has anyone ever seen anything like this ? Does this indicate
any incompatibility between our NIC and our ISP, and/or
between the eepro100 driver and the Catalyst switch ?

Is there any action we should take, or is it likely
that hardcoding our MAC address is a perfectly viable
solution ?

Should we be looking at replacing the NIC ? In other
words, does this sound like a hardware failure ?

Thanks.

--
matt.