[Beowulf] Infiniband Subnet Manager
Nifty niftyompi Mitch
niftyompi at niftyegg.com
Sat Aug 30 12:20:02 PDT 2008
On Thu, Aug 28, 2008 at 08:41:18AM -0400, Prentice Bisbal wrote:
> Since an infiniband fabric needs a subnet mananger, should the master
> node have an IB HCA and be connected to the IB network in order to run
> the subnet manager?
> My logic behind this is that the master node will be full
> enterprise-level hardware (redundant every thing), and should never go
> down or be rebooted during normal use. I expect the nodes to go down
> more frequently (not fully redundant hardware, higher operating loads,
> Exactly what functions does the subnet manager perform, and what happens
> if it disappears from the IB fabric?
> I've been doing research into IB all day yesterday, and I'm continuing
> today, so please no RTFM answers.
How big a fabric?
The subnet manager (SM) manages the fabric.
The most obvious functions are
* assign LID (local ID)
* setup routing (routing is static BTW)
* notices changes.
i.e. discovery, configuration and continuous monitoring of the fabric
Once a fabric is live and correctly setup if the subnet manager dies
nothing bad happens unless something changes. The assigned LIDs
continue to be valid and the routes continue to be valid. You only
Some vendor switches have the ability to manage fabrics with a built in subnet management
card (extra $). In many cases this it the best solution...
If the SM is on the head node it might be easier to watch the SM ....
In the subnet management specification there is stuff about fail over...
It is possible to have a second subnet manager running on the fabric. The second SM should go idle
and only be active if the other one goes silent.
Caution #1 -- failover is hard to test and multiple SMs may introduce instability so test, test but
do not tinker on a prodution fabric. Do monitor -- gently is fine.
Caution #2 -- do not mix subnet managers. If you run a second SM run one that
is identical! Do not mix OpenSM and a managed switch without
vendor approval and testing.... do not mix versions of any SM...
Caution #3 -- Like so many things one is good (required in this case), two might be nice but many is just wrong.
This is a good URL to read and bookmark...
Google for OpenSM, Cisco pages have some good stuff too.
T o m M i t c h e l l
Got a great hat... now what.
More information about the Beowulf