[Beowulf] Computation on the head node

Mon May 19 06:47:44 PDT 2008

"Jeffrey B. Layton" <laytonjb at charter.net> writes:
> Here comes the $64 question - how do you benchmark the IO portion of
> your code so you can understand whether you need a parallel file
> system, what kind of connection do you need from a client to the
> storage, etc. This is a difficult problem and one in which I have an
> interest.

This is straightforward, though not easy to explain compactly.  The
key is to know how to run tools like top, vmstat, etc. and read
them.

If you run your code on a real machine, you can swiftly see if you are
using 100% of your CPU or not. The goal, naturally, is to have the CPU
busy at all times. If you are CPU bound, congratulations, you then can
turn to tools like cache performance evaluators to determine if you
can tune your CPU utilization somehow (which you almost certainly can).

If, however, your CPU is not at 100% utilization, you are somehow I/O
bound. There are several reasons this could be happening.

First, you could be using lots of virtual memory -- the tools will
tell you in a moment -- in which case the single best thing to do is
not to increase the speed of the file system at all but to increase
the amount of memory you have available so your working set fits very
comfortably in RAM.

Second, you could doing lots of file i/o paging in program text
segments, which is another flavor of the first problem. Again, more
memory will help, but so will proper tuning of the page cache
parameters.

Third, you could be doing lots of file i/o to legitimate data
files. Here again, it is possible that if the files are small enough
and your access patterns are repetitive enough that increasing your
RAM could be enough to make everything fit in the buffer cache and
radically lower the i/o bandwidth. On the other hand, if you're
dealing with files that are tens or hundreds of gigabytes instead of
tens of megabytes in size, and your access patterns are very
scattered, that clearly isn't going to help and at that point you need
to improve your I/O bandwidth substantially.

> The best way I've found is to look a the IO pattern of your
> code(s). The best I've found to do this is to run an strace against
> the code. I've written an strace analyzer that gives you a
> higher-level view of what's going on with the IO.

That will certainly give you some idea of access patterns for case 3
(above), but on the other hand, I've gotten pretty far just glancing
at the code in question and looking at the size of my files.

I have to say, though, that really dumb measures (like increasing the
amount of RAM available for buffer cache -- gigs of memory are often a
lot cheaper than a fiber channel card -- or just having a fast and
cheap local drive for intermediate data i/o) can in many cases make
the problem go away a lot better than complicated storage back end
hardware can.

If you really are hitting disk and can't help it, a drive on every
node means a lot of spindles and independent heads, versus a fairly
small number of those at a central storage point. 200 spindles always
beat 20.

In any case, let me note the most important rule: if your CPUs aren't
doing work most of the time, you're not allocating resources
properly. If the task is really I/O bound, there is no point in having
more CPU than I/O can possibly manage. You're better off having 1/2th
the number of nodes with gargantuan amounts of cache memory than
having CPUs that are spending 80% of their time twiddling their
thumbs. The goal is to have the CPUs crunching 100% of the time, and
if they're not doing that, you're not doing things as well as you can.

Of course, if your CPU is crunching 100% of the time, there is no
point in wasting time on faster i/o as it pretty much by definition is
going to go to waste.

> I'm also working on a tool that can take the strace output and
> create a "simulator" that will run in a similar manner to the
> original code but actually perform the IO of the original code using
> dummy data. This allows you to "give" away a simple dummy code to
> various HPC storage vendors and test your application.  This code is
> taking a little longer than I'd hoped to develop :(

It sounds cool, but I suspect that with even simpler tools you can
probably deduce most of what is going on and get around it.

-- 
Perry E. Metzger		perry at piermont.com