kragen at pobox.com
Tue Jun 20 13:37:14 PDT 2000
W Bauske writes:
> Also, do you control the source code that does the processing?
> If not, then the only way to split the work is split the log into
> chunks and run the log processing on each chunk. Then you have
> the question of is the data partitionable such that you get the
> same analysis when it's split.
This is not correct. There are several ways to partition problems in
general, and log-processing problems in particular, and splitting up
the input data is only one of them.
- if you're running a pipelinable problem --- separable, sequential
stages, each with a relatively high computation-to-data ratio (say, a
billion or more instructions for every twelve megabytes, thus a
thousand instructions for every twelve bytes or so) --- you can build
a pipeline with different stages on different machines. In an ideal
world, you'd be able to migrate pipeline stages between machines to
- if you want to generate ten reports for ten different web sites whose
logs are interleaved in the same log file, you can run the log into
one guy whose job it is to divvy it up, line by line, among ten
machines doing analysis, one for each web site.
- if you're looking for several different kinds of information in the
log file --- again, with a high computation-to-data ratio --- you can
send a copy of the log file to several processes, each extracting one
of the kinds of information.
Of course, all of this depends on the problem. My guess is that the
original querent can, as you suggested, rewrite his log-processing
script in C instead of Perl and get the performance boost he needs, and
it will be easier than parallelizing by anything but the simplistic
[I'm just guessing that the log-processing code is currently in Perl. :) ]
<kragen at pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
The Internet stock bubble didn't burst on 1999-11-08. Hurrah!
The power didn't go out on 2000-01-01 either. :)
More information about the Beowulf