HPL residual check failure

Patrick Geoffray patrick at myri.com
Thu Nov 8 03:53:08 PST 2001


Yoon Jae Ho wrote:

> but I guess if you use Myrinet instead of 10/100 LAN. then please check the Cable & Myrinet mpich version.

FYI, bad Myrinet cables do not produced corrupted data, there is 
a hardware CRC check on the NIC. Corrupted packets are just dropped, 
so symptoms of bad cables are messages timing out or very slow. 
You can look at the number of bad CRCs (badcrc_cnt) with "
gm_counters" (if you are using GM).

In the context of Keaton's failure, bad memory is certainely the 
problem. Usually, if things works after cooling the unit, it's 
very likely to be overheating hardware.

Patrick

----------------------------------------------------------
|   Patrick Geoffray, Ph.D.      patrick at myri.com 
|   Myricom, Inc.                http://www.myri.com
|   Cell:  865-389-8852          685 Emory Valley Rd (B)
|   Phone: 865-425-0978          Oak Ridge, TN 37830
----------------------------------------------------------



More information about the Beowulf mailing list