Robert G. Brown
rgb at phy.duke.edu
Tue Jun 20 14:03:39 PDT 2000
On Tue, 20 Jun 2000, Kragen Sitaker wrote:
> This is not correct. There are several ways to partition problems in
> general, and log-processing problems in particular, and splitting up
> the input data is only one of them.
> Some examples:
> - if you're running a pipelinable problem --- separable, sequential
> stages, each with a relatively high computation-to-data ratio (say, a
> billion or more instructions for every twelve megabytes, thus a
> thousand instructions for every twelve bytes or so) --- you can build
> a pipeline with different stages on different machines. In an ideal
> world, you'd be able to migrate pipeline stages between machines to
> - if you want to generate ten reports for ten different web sites whose
> logs are interleaved in the same log file, you can run the log into
> one guy whose job it is to divvy it up, line by line, among ten
> machines doing analysis, one for each web site.
> - if you're looking for several different kinds of information in the
> log file --- again, with a high computation-to-data ratio --- you can
> send a copy of the log file to several processes, each extracting one
> of the kinds of information.
All good points. Another good point is that if the reports are the
result of syslogd output, a sensible /etc/syslog.conf can often achieve
a lot of partitioning for you. If the reports are the result of a
centralized syslog loghost that receives all the syslog output of (say)
100+ hosts, you might look into "syslog-ng", which basically filters
input as it comes into the loghost and squirrels it away in a nice set
of host/loglevel-specific files according to your specification.
Either of these will result in significantly smaller files to process
and a lot of the processing will already be done.
> Of course, all of this depends on the problem. My guess is that the
> original querent can, as you suggested, rewrite his log-processing
> script in C instead of Perl and get the performance boost he needs, and
> it will be easier than parallelizing by anything but the simplistic
> split-the-log-into-chunks approach.
> [I'm just guessing that the log-processing code is currently in Perl. :) ]
Agreed and agreed.
> <kragen at pobox.com> Kragen Sitaker <http://www.pobox.com/~kragen/>
> The Internet stock bubble didn't burst on 1999-11-08. Hurrah!
> The power didn't go out on 2000-01-01 either. :)
> Beowulf mailing list
> Beowulf at beowulf.org
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf