[Beowulf] (no subject)

tegner at renget.se tegner at renget.se
Mon Feb 1 23:45:56 PST 2010


This will boil down to a questions eventually, but I need to give some
background first. We are a small group doing CFD, and when we several
years ago realized that beowulfs would be the right choice for us we
decided to extend our computational capabilities gradually.  Every year,
or every second year we bought two gigabit switches and a bunch of nodes
connected to these switches. One of the switches is used for mpi
communications and one for connecting the nodes to a fileserver and a
master node.

As of today we have five "subclusters", all connected to the same
filserver and master node (torque/maui is used to distribute the jobs on
the different subclusters).

This has worked out great for us, and we do believe the strategy of buying
gradually has been advantageous to us (instead of doing larger purchases
less often), and we want to continue extending our hardware in this
fashion.

Up till now we have not been hurt by the fact that we have a single
fileserver (connected to a bunch of raided drives), but we anticipate
there will be issues when we further extend the number of nodes. And we
plan on building a separate "infiniband storage network" (consisting of a
24 DDR switch) and connect a number of "gluster nodes" to it. Each
subcluster will then be connected to this "infiniband storage network" via
one (or maybe several) ports.

However, we will still limit the jobs to run within there separate
subcluster and we are going to accept lower bandwidth between the
subclusters. By doing this we gain the following:

(i) We can get more computational nodes, since we are limiting the number
of ports used to connect the switches to each other.
(ii) For our application I/O is not as demanding as the "mpi-communiction"
but we are still getting - hopefully - acceptable I/O performance.
(iii) We can extend our storage by adding more "gluster nodes" to the
"infiniband storage network" when needed.
(iv) We can continue adding subclusters when we have the money. And we can
also remove old ones when they "cost" too much (in terms of
electricity/performance, maintenance etc.).

Since we havent worked with infiniband before, the question is simply if
there could be issues with this approach?

Regards, and thanks,

/jon





More information about the Beowulf mailing list