[Beowulf] Anyone with really large clusters seeing memory leaks with OFED 1.5 for tcp based apps?
skylar at cs.earlham.edu
Sun Jan 31 17:17:27 PST 2010
Joe Landman wrote:
> Hi folks
> Trying to trace something annoying down, and see if we are running
> into something that is known.
> OFED 1.5 on a 220.127.116.11 kernel. Running a file system atop IPoIB
> (many reasons, none I care to get into here at the moment). Under
> light load, the file system gradually grabs memory. Possibly a leak,
> not entirely sure. Could be the OFED stack underneath. Backing file
> system is xfs. That is has been (on this hardware in other
> situations) rock solid stable. Here, xfs, OFED/IPoIB all toss their
> cookies (and fail allocations) under moderate to heavy load.
> Working with the file system vendor on this. I am not sure we have
> the answer nailed, so I wanted to see who out there is running a big (
> >512 nodes) cluster, doing large data transfers (preferably over
> IPoIB), for data storage, and running a late model OFED. If you fall
> into this category, please let me know, as I'd like to ask a few
> questions offline about any observed OFED/IPoIB failure modes. I am
> not convinced it is OFED/IPoIB, but I'd like to see what other people
> have run into ... if anything.
We're running at OFED 1.4 for our GPFS cluster, with RDMA used for data
and IPoIB used for metadata and backups. We're looking at an upgrade to
1.5 so if you do find anything out I'd be very interested in knowing.
-- Skylar Thompson (skylar at cs.earlham.edu)
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 251 bytes
Desc: OpenPGP digital signature
More information about the Beowulf