Linux cpusets and HPC (was Re: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem?)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Samuel csamuel at vpac.orgThu Aug 14 03:42:10 PDT 2008
- Previous message: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem?
- Next message: [Beowulf] Gigabit Ethernet and RDMA
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
----- "Paul Jackson" <pj at sgi.com> wrote: Hi Paul, > Let me see if I understand this. Is the following right: > > Without the cpuset constraint, such a 'bad' job could tell the > cluster management software (PBS or Torque or ...) it needed just > one CPU, which could end up putting it on a cluster node with > say eight CPUs, along with some other jobs that expect to use the > other seven CPUs. It's the user that specifies how many CPUs their particular problem will require, but effectively that's correct. > But then OpenMP code in that 'bad' job could notice it had eight > CPUs, think to itself 'wow - cool', and proceed to hog all eight > CPUs, messing up those other jobs. That's correct, some OpenMP codes automatically detect the number of cores in a system and will (if not told otherwise) use them all. Alternatively some users forget they've reduced the number of cores they've said the job needs and have (in a config file say) still a larger number specified. > With the cpuset constraint, that 'bad' job -will- only be able to > use that one CPU, and if OpenMP or other code in that job can't > deal reasonably with that circumstance, well, tough, the owner of > that job should fix something. Well, "tough" might be a tad hard on them, but yes. > But at least the other jobs that were > hoping to use the other seven CPUs won't be bothered much by this. Spot on. > Did I say that right? That's a pretty fair summary for our main use case, yes. The memory locality possibilities are also important, just not currently covered by Torque and will probably require more smarts in the pbs_mom and how it detects core/socket relationships, locality, hyperthreading/SMT, etc.. It will also change how it reports that to the pbs_server and then how the scheduler (Maui or Moab) allocates resources to jobs based upon policies and what the job has requested. cheers, Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
- Previous message: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem?
- Next message: [Beowulf] Gigabit Ethernet and RDMA
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
