Walter B. Ligon III
walt at parl.ces.clemson.edu
Wed Jun 21 06:45:52 PDT 2000
> Kragen Sitaker wrote:
> > W Bauske writes:
> > > Also, do you control the source code that does the processing?
> > > If not, then the only way to split the work is split the log into
> > > chunks and run the log processing on each chunk. Then you have
> > > the question of is the data partitionable such that you get the
> > > same analysis when it's split.
> > This is not correct.
> We'll see, see comments below.
> > There are several ways to partition problems in
> > general, and log-processing problems in particular, and splitting up
> > the input data is only one of them.
> > Some examples:
> > - if you're running a pipelinable problem --- separable, sequential
> > stages, each with a relatively high computation-to-data ratio (say, a
> > billion or more instructions for every twelve megabytes, thus a
> > thousand instructions for every twelve bytes or so) --- you can build
> > a pipeline with different stages on different machines. In an ideal
> > world, you'd be able to migrate pipeline stages between machines to
> > load-balance.
> Pipelining is good if the processing stages are dependent.
> The original request is too vague to say whether it would work
> though. One could always call the "chunk" the whole file and
> give it to separate programs on separate machines, depending
> on whether the processing is dependent or not on previous
> steps, similar to your last example below.
I'll agree that the other issues raised essentially reduce to forms of
data parallelism that your original post claimed was the "only" way
to split the work. But not this one. This amounts to what used to (a
long time ago) be call MISD processing - the same data is processed by
multiple programs and it is NOT a form a data parallelism - it is a form
of control parallelism. Your argument that 'One could always call the
"chunk" the whole file' is really weak. You implied data parallelism
was the ONLY option and it ISN'T. Let's just admit that.
On the other hand, data parallelism is generally the better approach,
and the rest of your comments were valuable. No point in beating this
dead horse futher, but the lesson should be learned: there are always
other ways and it is sometimes worthing considering them if only to
convince yourself the given solution is best.
Dr. Walter B. Ligon III
More information about the Beowulf