[Beowulf] 3d rendering cluster

Robert G. Brown rgb at phy.duke.edu
Mon May 23 10:53:22 PDT 2005


On Mon, 23 May 2005, Paul K Egell-Johnsen wrote:

> Hi,
> I'm tasked to find prices for a solution involving a rack mounted 3d
> rendering farm, but after searching a bit on the net (and more
> importantly on this list) I've yet to find some simple answers to the
> following question(s):
> 
> What is the most important part for a 3d rendering farm; memory to
> each processor, processor speed and/or access to SAN?

That's because this question doesn't have a simple answer.  It has a
simple (enough) method for determining an answer.  To provide a possibly
useful metaphor: 

 What's the most important part of a pizza delivery service: the speed
 of the cook making the pizza, the speed of the people taking the
 orders, or the speed of the delivery boy?

Hopefully it is >>obvious<< that this doesn't have a universal answer.
I haven't told you what kinds of pizza are being offered or how good the
cook is.  I haven't told you how far the pizzas have to go to be
delivered.  I haven't told you how many people you expect to have
hammering your order taker, or whether they have to use a pencil and
order form or an electronic ordering machine to take and bill the
orders.  I haven't told you how many ovens you have, how long it takes
to make a pizza, how long it takes to make different KINDS of pizza,
what the pattern is of pizza ordering.  The answer clearly is "it
depends" and it depends on so many things that the answer can >>change<<
dynamically as conditions change.

This is a decent metaphor to your problem, if you replace the concept
"cook" with "processor", the order taker with "program", the time in the
oven to a local delivery store with "memory latency", the speed to/from
SAN with "delivery boy".  Not exact, but you get the idea.

Just as is the case with the pizza metaphor, to >>answer<< the question
intelligently you need to get specific and not leave things all general
and unanswerable and everything.  You cannot say "for a 3d rendering
farm" you need to say (as you do below) "for a 3d rendering farm running
the following 3d rendering package" (recognizing as you do that this
reduces the number of people that can answer your question, quite
possibly to zero).  After all, some pizza cooks could be virtuouso's of
making pizza and could turn out one pizza per minute on an oven with
enough capacity.  Another could turn out no better than one every five
minutes, regardless of the oven used, just because it takes them a very
long time to get the instructions for the next pizza from the order
person and read them.

So what you need to do is to >>measure<< the important bottlenecks for
>>your task<<.  How long does it take to render a "work unit" (say, a
single frame) on a given processor and with your particular software?
How much data goes >>in<< to the image being rendered and how long does
it take to get the data from the store?  How does the rendered image
delivery speed depend on the speed of memory (note that this last
question is NOT EASY to answer -- you have to vary memory speed or type
holding processor speed more or less constant, which usually means
borrowing or buying some prototyping platforms)?  Finally, once
rendered, how long does it take to deliver the rendered image to a
store?

When you've answered those questions for ONE processor/memory/SAN
channel, you can then start thinking about the cluster.  Now you've got
several order takers, a bunch of cooks each with their own oven, a
certain delivery capacity.  At some point any given part of this can
overwhelm the others -- cooks turning out pizzas too fast for the
delivery boys to deliver, orders coming in faster than the cooks can
fill them, delivery boys sitting around getting paid but having nothing
to deliver.  However, you can now think about the problem SENSIBLY,
since you have a rough idea how long it takes for each processor to get
its next set of instructions, render an image, and deliver the result
whereever you want it to go, and you can start estimating the
complicated stuff like matching up all these capacities and rates,
correcting for resource contention (or organizing your task so it does
not occur, at least not often) and try to figure out how to optimally
spend your money on cooks, order takers, ovens, and delivery boys to
leave none of them terribly idle  while waiting on another.

Does this help?

FWIW, my recollection of render farms as tasks go are that they are
often CPU bound -- do a lot of work per processor relative to the data
input, although they also spend a lot of time writing to memory.  I'd
guess that a machine with balanced fast floating point and memory I/O
would be a good thing -- maybe an opteron, maybe something with a large
cache (depending VERY MUCH on the size of the rendered image produced).
Something with decent SSE instruction support might be useful, although
most high end machines will have that these days.

I doubt that the SAN will be the bottleneck at first, but of course this
depends on how big the render farm is, doesn't it?  There will clearly
come a point (number of nodes rendering images in parallel) where
processors sit idle waiting on storage.  Where that point is depends on
how many processors you have, how big "an image" is, how long it takes a
processor to produce an image, and whether or not you can arrange for
image completion fairly synchronously or if some images take a lot
longer to complete than others (so you end up with some roughly
poissonian distribution of completion times and therefore sometimes have
lots of images trying to write through at once and at others have the
store sitting idle).

> The kicker is that we're also going to use the same hardware for 2d
> composition in AfterFX and synth/fx processing for sound production.
> Ie. when we're through rendering we're not going to let that hardware
> rest but will utilise it in other parts of the workflow.
> 
> I know that for fx processing prosessor speed is paramount, and for
> sample playback memory size is important, as well as local access to
> the samples, but what about 3d? We're going to use Discreets 3d Studio
> Max w/Brazil.

See above.  I really think that you'll have to prototype your own
answers to the basic questions if you really want to get "the right
answer" from the beginning in terms of spending your money.

Alternatively, you can do the "grow a cluster" thing -- home in on a
particular node design (which you can do with fairly minimal prototyping
-- a few loaner or purchased boxes to do single CPU runs), build
yourself a SMALL cluster (or even two small clusters with competing
architectures) and then ramp up the design that works best after
determining the optimal hardware configuration and scaling.  This
basically mixes prototyping and production, but ensures that you cannot
make a bad mistake by spending your whole budget on an unbalanced system
and hence wasting a bunch of the resource with no money to rebalance it.
If your nodes need e.g. a better network at the expense of CPU speed or
memory configuration, you can just retrofit your pilot cluster and buy
expansion nodes with the cost-benefit ideal.  Your SAN is difficult to
"fix" this way, but as I said you'll know very early if you SAN is going
to be a bottleneck and can then consider solutions for getting around
it if and only if necessary.

> 
> We're aiming for off the shell 1U units or blades. Any pros/cons would
> be apprecieated, as for price they are pretty equal. I've been looking
> mostly at Intel solutions, since they are easily available from Dell
> and IBM.

I'd at least consider Opterons, available from Sun and Penguin.  The
Opteron is most people's first choice as an HPC platform these days, I
think, on the basis of its very excellent numerical performance coupled
with a very reasonable price.

Blades tend to cost more than 1U nodes, they run "cooler" individually
but hotter in aggregate, reaching really phenomenally high power
densities.  This in turn requires more/better cooling and powering
infrastructure, and is less robust against even a transient failure of
cooling.  OTOH it requires a bit less space, and depending on what your
bottlenecks are, may give you more uniform throughput.

I personally "prefer" regular rackmount boxes as being closer to the OTS
ideal (less single-vendor tie in -- you can usually change to a
completely different vendor for future expansions and not significantly
alter anything in your installation/maintenance/operation stream), being
cooler, letting you use processors closer to the bleeding edge, letting
you use networks closer to the bleeding edge (or not, as needed), giving
you more memory, local disk, etc. capacities (or not, as needed).  With
a node design that uses a local disk as an image buffer, for example,
you can separate the scheduling of image delivery from nodes back to the
SAN store from the rate at which they are produced and avoid resource
contention issues much closer to running at full capacity, where if the
images are only in memory and are large enough that the memory cannot
buffer more than one image while waiting for delivery the CPU can be
blocked.  Again, there are SO many ways to organize and optimize that
depend on the DETAILS of your production cycles here that I hesitate to
make general statements.

> Furthermore, but this is off topic I guess, would a larger iron from
> SGI or SUN be better than a cluster for this kind of use (though we'd
> be loosing sound fx and AfterFX processing)?
> 
> Investment timeframe is 6 months, and we're looking at < USD 50K

I'm almost certain that you're better off with a cluster.  Rendering and
sound fx etc are almost always embarrassingly parallel tasks that can be
run independently on independent processors.  You have relatively
straightforward questions to answer concerning bandwidths, latencies,
and compute capacities in order to be able to figure out a good design.
Also, I think that for MOST purposes you wouldn't be able to get a whole
lot of computer from SGI or Sun for only $50K.

You don't mention how much of your budget is for SAN -- my own pricing
of this indicates that this is EXTREMELY variable and high enough that
you could blow $50K on a SAN alone and not even get that much storage.
As far as cluster is concerned, the number of nodes you can buy depends
quite a bit on their memory requirement, but assuming that the images
themselves are in the 10 MB each range (so that 1 GB of memory per
processor would buffer many tens of images) you can probably put
together >>a<< cluster design based on e.g. dual opterons with 1 GB
memory per processor for about $1500/processor ($3000/system) and get a
pretty high end system out of it.  12 units (24 processors) at $36000
including rack leaves you $14K for network SAN, backup, front end nodes,
miscellaneous (and nothing for infrastructure per se, so one hopes that
you've got a suitable space with 5 KW of power and a couple of tons of
AC capacity handy on a separate budget).  This will get you gig ethernet
between the nodes and/or SAN for starters, a couple of graphics-equipped
heads, and a TB (or even two) of SAN.  Or just get six dual nodes for
starters and save the second $18 or so K for EITHER getting more nodes
OR shoring up whatever your weak point is and THEN getting more nodes.

> I'm sure this is an asked and answered question, but I've been going
> through this the last few days, doing searches and still nothing in
> the first 30 results on google..., even when constraining on a
> particular site, like the mailing list archives.

That's because it isn't a "recipe", it's a "recipe for a recipe".  Your
question cannot be answered by other people, probably, unless you're
really lucky and someone has done EXACTLY what you want to do and
doesn't view you as a future competitor and refuse to answer.  However,
you can answer it yourself.  Or, if you post more specific numbers (like
how fast, how big, how many, etc for the measured/prototyped workflow)
you can probably get some help answering it yourself.

   rgb

> 
> Best regards,
> Paul K Egell-Johnsen
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu





More information about the Beowulf mailing list