ashley at quadrics.com
Fri Apr 13 01:36:40 PDT 2007
On Fri, 2007-04-13 at 01:14 +0900, Naoya Maruyama wrote:
> On 4/12/07, Ashley Pittman <ashley at quadrics.com> wrote:
> > My advice would be first and foremost to look at the core file, I assume
> > your program is receiving a SEGV and exiting? core files can be
> > problematical, partly because they aren't always enabled and partly
> > because to extract anything useful out of them you need to run the
> > debugger with the same environment as the application was, this isn't
> > always as easy as it sounds if you are using modules or something like
> > that.
> One question. When the debuggee app was a 32-PE MPI job, you would end
> up with 32 core files. Would you check each of them manually? Or do
> you have any trick to parallellize the checking process? Say, using a
> parallel debugger?
Typically the job is torn down after the first process has exited so
only one or two core dumps would be preserved, I've never had the need
to examine every core dump from a job. RMS has automatic core file
analysis so for every "core" file there is a corresponding "core.out"
which contains all the information I'm likely to need, you could do this
yourself using a wrapper script around the application if required.
It's also quite common for jobs to hang which is where debuggers become
more useful, the trick here is not to look at every process but just the
interesting ones, we have a tool developed in-house for doing just this.
More information about the Beowulf