p4_error: net_recv read: probable EOF on socket: 1

Dr. David F. Robinson drobinson at aletheon.com
Mon Jan 28 17:51:37 PST 2002


Several people have responded saying the most likely problem is a memory
leak in my code.  The only difficulty I have with that theory is that
the code doesn't have any problems on a variety of alternate platforms
including Cray T90, T3E, SP2, SGI Origin, and even on a number of
different Linux Beowulf clusters.  It is currently running on several
large Linux clusters setup by VALinux as well as a 512processor system
developed by IBM (Maui supercomputing center).

Because of the large number of users and platforms, I am thinking that
it's a problem with the Scyld setup on my cluster.  I have talked to
several other groups running this software, and the only consistent
difference is the Scyld software.  Has anyone else run into problems
running with the Scyld software that isn't duplicated on other
platforms?

Thanks to all who have responded.... And if anyone else has any further
suggestions, I'm definitely interested....


David


-----Original Message-----
From: walt at splinter.parl.clemson.edu
[mailto:walt at splinter.parl.clemson.edu] On Behalf Of Walter B. Ligon III
Sent: Monday, January 28, 2002 11:33 AM
To: drobinson at aletheon.com
Cc: beowulf at beowulf.org
Subject: Re: p4_error: net_recv read: probable EOF on socket: 1 

--------

I'm not the expert on this, but as best I understand it this is a
fairly generic MPICH error which says a task tried to receive and
the process it tried to receive from has gone away.  More often
than not, unless you have lots of OTHER error messages, this is
some kind of program error (IOW, *your* program's error).

There are so many ways your program could be failing there is no
way to do a reasonable job of telling you what to look for, but
usually adding some kind of debugging output to your code will
help.  If you can isolate which task is failing, you can crash
it in gdb and see exactly where and why it crashed.

Since this is probably your code, there isn't a generic solution,
and that's why it isn't posted.

Walt

> I am receiving the following errors while running my mpi enabled code.

>
> p4_error: net_recv read:  probable EOF on socket: 1
> 
> This error occurs after running the code for several hours using all
> processors in my cluster.  I have seen several postings similar to
this
> on the web, however, I have not seen any posted solutions.  My
> configuration is as follows:
> 
> Mpich_1.2.1 compiled w/ Portland compilers
> Scyld 27cz-8 (Red Hat Linux 6.2)
> Linux 2.2.19
> 
> I have tried to update my eepro100 drivers by downloading and
compiling
> the netdrivers.tgz file from the Scyld ftp site.  They compiled and
> installed fine using 'make' and 'make install', however, the driver on
> the slave nodes has not been updated.  When I reboot the master node
and
> do a dmesg, the latest driver is being implemented on the master.  The
> slave nodes are still booting with the old driver.  How do I get the
> boot image for the slaves to use the updated modules?  Are my problems
> caused by the old eepro100 drivers?
> 
> Any help is greatly appreciated.
> Thanks, David
-- 
Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University






More information about the Beowulf mailing list