Newbie question

Mike mladwig at comcast.net
Sun May 26 17:07:45 PDT 2002


First thanks for the detailed response.  I'm continuing to work through it, 
but here is some additional information.

On Sunday 26 May 2002 12:53, Robert G. Brown wrote:
> To see how
> far you can scale it, you need to estimate or measure some more numbers,
> in particular how long your IPC's will take relative to your
> computation.

The small blocks are all precalculated "constants" (the 1.5M component is the 
variable) to the system.  They will rarely change.  The problem is in what 
order to apply the constants to the variable in order to convince yourself 
that you have a work unit answer.

Right now, I know the sizes of info travelling through the system, and how 
long the core calculation takes; I don't know yet how the components of the 
parallelized design will perform or how they interact.  I think I can only 
determine them statistically through experimentation once I build it, because 
they will vary wildly depending on the variable being processed.

> Now, I have no idea how many slaves you can afford or the relative value
> of your result.  100 cpu clusters aren't THAT expensive.  Looking back

At this stage of the work, I can probably justify purchasing enough nodes to 
prove how many nodes I need to really solve the problem :-).  I'm trying to 
get to that first step.  In the long run, if the problem can be solved 
sufficient nodes will be purchased.

>   c) Go to a multiple masters parallelization -- arrange a sort of a
> tree of masters, each with 200 slaves, themselves communicating back
> with a supermaster.

Hmmm.  In some ways, the problem lends it self to this, but determining the 
optimum number of slaves per master is difficult as it can continuously vary 
depending on the situation.  Because of this variability I would ideally be 
able to dynamically reassign these resources, and that would in turn 
complicate resource prelocation (mentioned below).
 
> cluster (plus a master) isn't particularly expensive -- with rackmount
> dual Athlons I'd guesstimate less about $1000-1100 per cpu, not much

This is a good time to ask another question.  The calculation is highly 
optimized to the P4; does anyone have 1u-style P4 nodes that they like in 
this context?

> In others it isn't -- each slave gets a unique set of numbers being
> (say) read from a big file on the master's disk (which, by the way,
> might take a nontrivial amount of time to locate on disk, read into a
> buffer, and arrange to send to the next free slave, hence the need to
> estimate the total amount of SERIAL work done by the master to send an
> IPC packet off to the next free slave!).

An optimization would be to prelocate the constants to the node(s) that will 
be using them, keep them in memory or on local disk, and not need to transfer 
them while a work unit is ongoing.   When I've looked at some of the batch 
processing tools (e.g. SGE - thanks for the suggestion Rayson), I don't see a 
way to support this kind of resource "affinity".  Did I miss something?

>   Hope this helps.

Yes, tremendously - much thanks.

>    rgb

mike.




More information about the Beowulf mailing list