[Beowulf] Mixing 32-bit compute nodes with 64-bit head nodes

Wed May 10 18:53:22 PDT 2006

Joe Landman wrote:
> Andrew D. Fant wrote:
> 
>> I know that the common wisdom on this subject is "don't do that", but for
> 
> 
> Shouldn't be an issue if you have a sane distribution and distribution 
> load system, a way to automatically handle the ABI (bit width) during 
> installation/package selection.  Distros which do this (mostly) 
> correctly include FCx, SuSE, Centos, ...

We're using Gentoo, so I'm not worried about the system having ABI issues. 
Compiling from scratch has eliminated most of the shared library and bit-width 
issues I used to worry about.

>> into.  The 64-bit motivation is mostly about providing adequate memory 
>> for
>> multiple users running gui applications.
> 
> 
> Hmmm... so you want to provide a single 64 bit machine to run GUI code 
> on rather than hacking stuff for the cluster?  Assuming I understood 
> this right, apart from contention for that resource, this should be 
> fine.  Is there any reason why the SGE/PBS methods (qrsh/qsub -I) 
> wouldn't work?  Or is this the pain of which you speak?

We're using a certain proprietary distributed load management facility that uses 
their own job launch protocol and that doesn't support ssh tunneling of $DISPLAY 
back to the desktop. ( $0.0021 if you can guess which one it is).  We are forced 
to have wrapper scripts that launch the application on a compute node after 
asking the DLM for the least loaded system.  We've found that it's not trivial 
to validate things like $CWD and pass it cleanly to the remote system, so the 
scripts are pretty limited in what they can do.  Also, because the jobs don't 
run under the auspices of the DLM, they don't show up in the batch accounting 
logs and I end up having to mangle the pacct results from 40 different systems 
to satisfy the reporting requirements of my (multiple) management chains.  My 
uncle used to tell me that there is no problem that cannot be solved by the 
suitable application of high explosives.  Given my limited staff and time, 
throwing a big box at the problem seems easiest.

>> Has anyone had any success with this approach, or failing that, any 
>> horror
>> stories that would support the more flexible approach of separating 
>> the shell
>> server from the head node?
> 
> 
> I think this is actually a good practice.  You really don't want users 
> logging onto a management node to run jobs. You would likely prefer them 
> to run on some sort of user-login-node.  Lots of cluster distros do fuse 
> these two.  This is assuming a non-SSI machine (e.g. not 
> Scyld/bproc/Clustermatic/...).

Yeah.  I don't see many people doing this, and it surprises me.  If the users 
can connect to a machine that isn't the super-duper must-be-up system, it means 
that adding front-end capacity and fault-tolerance with simple round-robin DNS 
becomes easier, and it means that a user bziping up their last set of runs on 
the head node won't siphon off all the CPU cycles to the point that system 
logging and batch scheduling are starved (don't laugh, I had a user fire off 10 
bzips on a 2 CPU system and set my pager off in the middle of a staff meeting. 
if cluster management had been on the user node, bad things could have happened).

> The only major issue is that if they then submit a job with a binary 
> which happens to be the wrong ABI, you will get lots of dud runs and 
> unhappy users.  You can fix that with some clever defaults on the 
> submission side for each user-login-node.
> 

This is what my boss and I were worried about as we wargamed various scenarios 
this afternoon.  My bias is to put some prelaunch scripts in place that verify 
binary formats for jobs coming from the "central" login node, and bomb out if 
someone tries to run a 64-bit binary on a 32-bit machine.  For jobs that 
originate on the "local" user login node, I am more inclined to avoid that pain, 
since people who are power-user enough to use the local cluster user node ought 
to be smart enough to understand executable formats.  It's people using the 
central shell server who might get confused.

Now, if we could only get more application vendors to accept that separating the 
gui and the computational engine via a batch engine is a good idea.  If the 
government can make this work with Ecce, and Ansys can do it, why can't (for the 
sake of argument) Fluent understand that someone might want to set up a big 
calculation in the GUI without running the simulation on the same system?  And 
don't get me started on vendors that ship a version of MPI that depends on a 
static list of hosts.  How many people really build a cluster for each application?

Andy

-- 
Andrew Fant    | And when the night is cloudy    | This space to let
Molecular Geek | There is still a light          |----------------------
fant at pobox.com | That shines on me               | Disclaimer:  I don't
Boston, MA     | Shine until tomorrow, Let it be | even speak for myself