utilization / efficiency - was: Re: [Beowulf] Can one Infiniband net support MPI and a

Fri Aug 15 09:03:41 PDT 2008

Mark Kosmowski wrote:
>> Message: 4
>> Date: Fri, 15 Aug 2008 00:08:27 -0500
>> From: Gerry Creager <gerry.creager at tamu.edu>
>> Subject: Re: [Beowulf] Can one Infiniband net support MPI and   a
>>        parallel        filesystem?
>>
>> Alan Louis Scheinine wrote:
>>> This thread has moved to the question of utilization,
>>> discussed by Mark Hahn, Gus Correa and Håkon Bugge.
>>> In my previous job most people developed code, though test runs
>>> could run for days and use as many as 64 cores.  It was
>>> convenient for most people to have immediate access due to
>>> the excess computation capacity whereas some people in top
>>> management wanted maximum utilization.
>>>
>>> I was at a parallel computing workshop where other people
>>> described the contrast between their needs and the goals of
>>> their computer centers.  The computer centers wanted maximum
>>> utilization whereas the spare capacity of the various clusters
>>> in the labs were especially useful for the researchers.  They
>>> could bring to bear the computational power of their informally
>>> administered clusters for special tasks such as when a huge
>>> block of data needed to be analyzed in nearly realtime to see
>>> if an experiment of limited duration was going well.
>>>
>>> When most work involves code development, waiting for jobs in
>>> a batch queue means that the human resources are not being
>>> used efficiently.  Of course, maximum utilization of computer
>>> resources is necessary for production code, I just want to
>>> emphasize the wide range of needs.
>>>
>>> I would like to add that maximum utilization and fast turn-
>>> around are contradictory goals, it would seem to me based
>>> on the following reasoning.  Consider packing a truck with
>>> boxes where the heigth of the boxes represents the number
>>> of cores and the width of the boxes represents the time of
>>> execution (leaving aside third spatial dimension).  To most
>>> efficiently solve the packing problem we would like to have
>>> all boxes visible on the loading dock before we start packing.
>>> On the other hand, if boxes arrive a few at a time and we must
>>> put the boxes into the truck as they arrive (low queue wait time)
>>> then the packing will not be efficient.  Moreover, as a very
>>> rough estimate, the size of the box defines the scale of the
>>> problem, specifically, if the average running time is 4 hours,
>>> then to have efficient "packing" the time spent waiting in a
>>> queue must on the order of at least 4 and more likely 8 hours
>>> in order to have enough requests visible to be able to find
>>> an efficient solution to the scheduling problem.
> 
> So far the utilization discussion is discussing number of cpus as a
> bottelneck.  Especially for general use clusters, RAM may also be a
> bottle neck.  It is easy to imagine where giving a large RAM
> requirement job 16 instead of 32 cores with each node allocating 75%
> total node RAM to the job might be preferable so that a small RAM,
> cpu-intensive job could then use the remaining 16 cores with each node
> allocating 10% total node RAM to this second job.  Multi-dimensional
> scheduling gets difficult quickly when different jobs have very
> different resource profiles.

In our experience, we've found this to be true especially of throughput 
problems rather than HPC problems.  The cluster we've just stood up is 
designed to facilitate HTC as evidenced by the gigabit ethernet 
interconnect instead of a  modern infiniband or myricom interconnect. 
This is limiting in some regards, but was a conscious tradeoff.

>> An interesting analogy, and further, the thread has been interesting.
>> However, it doesn't even begin to really address near-realtime
>> processing requirements.  Examples of these are common in the weather
>> modeling I'm engaged in.  In some cases, looking at severe weather and
>> predictive models, a model needs to initiate shortly after a watch or
>> warning is issued, something that's controlled by humans and is not
>> scheduled, hence somewhat difficult to model for job scheduling.  These
>> models would likely be re-run with new data assimilated into the
>> forcings, and a new solution produced.  Similarly, models of toxic
>> release plumes are unscheduled events with a high priority and low
>> queue-wait time requirement.
>>
>> Other weather models are more predictable but have fairly hard
>> requirements for when output must be available.
>>
>> Conventional batch scheduling handles these conditions pretty poorly.  A
>> full queue with even reasonable matching of available cores to request
>> isn't likely to get these jobs out very quickly on a loaded system.
>> Preemption is the easy answer but unpopular with administrators who have
>> to answer the phone, users whose jobs are preempted (some never to see
>> their jobs return), and the guy who's the preemptor... who gets blamed
>> for all the problems.  Worse, arbitrary preemption assignment means
>> someone made a value judgment that someone's science is more important
>> than someone else's, a sure plan for troubles when the parties all
>> gather somewhere... like a faculty meeting.
> 
> This may mark me as hopelessly naive, but for the emergency critical
> use clusters, couldn't there be a terms of use agreement in place
> stating that the purpose of the cluster is for the emergency events
> and that non-emergency usage, while allowed to make the cluster create
> more value for itself, are subject to preemption in emergency
> situations?  Maybe have some sort of policy in place to give restarts
> of preempted jobs an earlier place in the post-emergency queue?  At
> least this way folks might be upset that their jobs died unexpectedly
> due to preemption, but reasonable folks (I know, a big assumption
> here) will understand that this was explained at the beginning.

This is great for a special purpose cluster but these are getting harder 
to create a business case for.  Instead, utilization is, indeed, a 
metric folks look at.  A cluster essentially reserved for near-real-time 
requirements and backfilled is likely to be less utilized.  It's also 
likely to be smaller because you're not going to find several groups 
willing to finance a cluster, but then say, "Your 
research/applications's so much more important than mine that I'll be a 
background task all the time.  At least in my experience, that's not 
happened so far.

>> OK, so I've laid out a piece of the problem.  I've got some ideas on
>> solutions, and avenues for investigation to address these but I'd like
>> to see others ideas.  I don't want to influence the outcome any mroe
>> than I already have.
>>
>> Oh, and, yeah, I'm aware of SPRUCE but I see a few potential problems
>> there, although that framework has some potential.
>>
>> gc
>> --
>> Gerry Creager -- gerry.creager at tamu.edu
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Gerry Creager -- gerry.creager at tamu.edu
Texas Mesonet -- AATLT, Texas A&M University	
Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983
Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843