[Beowulf] Infinipath memory parity errors

Nifty niftyompi Mitch niftyompi at niftyegg.com
Wed Aug 13 17:12:40 PDT 2008


On Wed, Aug 13, 2008 at 05:03:46PM +0100, Dave Love wrote:
> [I know in an ideal world the vendor between us and PathScale^WQlogic
> would sort this out.]
> 
> I'm interested in the cause (and possible cure!) of intermittent errors
> on various nodes in our Infinipath system which stop MPI jobs with
> kernel messages like this, in case anyone's familiar with them:
> 
>   lvinfi095:21.Hardware problem: {[RXE EAGERTID Memory Parity]}
> 
> They seem to be new with an upgrade to Linux 2.6.22 from 2.6.11, but
> probably just manifested themselves in some other way previously.
> 
> Google didn't produce any leads, and a brief look in the source suggests
> that tracking it down where it's generated in the ib_ipath module is
> non-trivial and likely won't tell me a lot.
> 
> For what it's worth, the adaptors are
> 
>   06:00.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 02)
> 
> in two different sorts of Supermicro whose model numbers I don't know.
> 

Dave,

Which driver is active?  Which Infinipath software release
is installed?  The tool "ipath_control -i" can show which...

The kernel.org/ofed driver does not have as rich a set of error recovery
code for this card as the shipped driver.   The recovery code was seen
as a badness and not accepted by the kernel.org folk....

With a kernel update the driver will not have been recompiled
and the kernel.org driver would become active.   
Look for this stuff in the Install Guide.

	#   To rebuild the drivers, do the following (as root):
	# cd /usr/src/infinipath/drivers
	# ./make-install.sh
	# /etc/init.d/infinipath restart






 






-- 
	T o m  M i t c h e l l 
	Got a great hat... now what.




More information about the Beowulf mailing list