[Beowulf] Infinipath memory parity errors

Dave Love d.love at liverpool.ac.uk
Wed Aug 13 09:03:46 PDT 2008


[I know in an ideal world the vendor between us and PathScale^WQlogic
would sort this out.]

I'm interested in the cause (and possible cure!) of intermittent errors
on various nodes in our Infinipath system which stop MPI jobs with
kernel messages like this, in case anyone's familiar with them:

  lvinfi095:21.Hardware problem: {[RXE EAGERTID Memory Parity]}

They seem to be new with an upgrade to Linux 2.6.22 from 2.6.11, but
probably just manifested themselves in some other way previously.

Google didn't produce any leads, and a brief look in the source suggests
that tracking it down where it's generated in the ib_ipath module is
non-trivial and likely won't tell me a lot.

For what it's worth, the adaptors are

  06:00.0 InfiniBand: PathScale, Inc InfiniPath HT-400 (rev 02)

in two different sorts of Supermicro whose model numbers I don't know.

Thanks for any leads.




More information about the Beowulf mailing list