[Beowulf] Joe Blaylock's notes on running a MacOS cluster, Nov. 2007

Robert G. Brown rgb at phy.duke.edu
Tue Nov 20 12:27:12 PST 2007


On Tue, 20 Nov 2007, Kragen Javier Sitaker wrote:

> This is not strictly about Beowulfs, but it is probably of interest to
> their users.

Hey Kragen, long time no, um, read. Or see.  Or something.

Glad to see you're still on the list.

   rgb

>
> My friend Joe's team from Indiana University just fielded a MacOS
> cluster for the Supercomputing '07 Cluster Challenge.  His experiences
> weren't that great; I encouraged him to jot something quick down so that
> other people could benefit from his hard-won lessons.
>
> There's more information about the challenge at
> http://sc07.supercomp.org/?pg=clusterchlng.html&pp=conference.html.
>
> ----- Forwarded message from Kragen Javier Sitaker <kragen at pobox.com> -----
>
> From: Kragen Javier Sitaker <kragen at pobox.com>
> To: kragen-fw
> Subject: Joe Blaylock's notes on running a MacOS cluster, Nov. 2007
>
>  Disordered thoughts on using MacOS X for HPC.
>
> By Joe Blaylock, 2007-11.
>
>
>    Recollections:
>
>    * we were the first people to ever try that particular combination:
>      Tiger on Xeons with Intel's ICC 10 compiler suite and MKL linear
>      algebra libraries. Blazing new territory is never easy.
>    * We didn't use XGrid or Apple's cluster management stuff, only
>      Server Admin and ARD.
>    * Pov-Ray was easy; OpenMPI was easy; using Myrinet over 10Gig
>      Ethernet was easy
>    * GAMESS was more challenging, but we got it working somewhat. We
>      still don't know how to run jobs of type ccsd(t), which require
>      System V shared memory.
>    * We never got POP to work.
>    * Apparently, ICC 10 has some bugs. There were several times when we
>      were trying to use it to build, IIRC, GAMESS or POP, and it would
>      give illegal instruction errors during compile. Or it would build
>      a binary that we would run, and then it would do something
>      horrible (like hang the machine (probably a bug interaction
>      between icc and MacOSX).
>    * OpenDirectory doesn't seem ready for prime time. It's pretty easy
>      to set up, but it's unreliable and mysterious. In MacOS X, there
>      seems to be a fundamental disconnect between things in the CLI
>      world and things in the GUI world. Setting something up in one
>      place won't necessarily be reflected in the other place. I'm sure
>      that this is all trivial, if you're a serious Darwin user. But
>      none of us were. So for example, you set up your NFS exports in
>      the Server Admin tool, rather than by editing /etc/exports. The
>      Admin tool won't put anything into /etc/exports. So if you're on
>      the command line, how do you check what you're exporting? With the
>      complexity of LDAP, this becomes a real problem. You set up
>      accounts on your head node, and say to export that information.
>      But perhaps you create an account, but can't log into it on a
>      node. If you're ssh'd in from the outside, where do you check to
>      see (from the command-line) what the authentication system is
>      doing? Our local Mac guru couldn't tell us. And then you'd create
>      another account, and the first one would start working again. WTF?
>    * This may be the most frustrating thing about working with OS X
>      Server. The CLI is the redheaded stepchild, and lots of HPC is
>      mucking about on the command-line. You can use VNC to connect to
>      ARD (but only if a user is logged in on the desktop and running
>      ARD!), but it's slow, and only provides desktop control, not
>      cluster management. ARD can then be run on the desktop, to provide
>      desktop control of the nodes in the cluster, and some cluster
>      management: run unix command everywhere, shut nodes down, etc.
>      There were a handful of tasks which seemed important, but which I
>      couldn't figure out how to do on the command-line at all. The most
>      heinous of these is adding and removing users to/from LDAP.
>    * Most of the time, I found it more convenient to use a 'for' loop
>      that would ssh to nodes to run some command for me.
>    * MacOS X lacks a way to do cpu frequency scaling. This killed us in
>      the competition. We couldn't scale cores to save on our power
>      budget, we could only leave them idle.
>    * Being a Linux dude, I found having to have license keys for my
>      operating systems, and (separately) my administration and
>      management tools, to be odious in the extreme. Having to
>      separately license ICC and IFORT and MKL just added frustration
>      and annoyance.
>    * We didn't make detailed performance comparisons between stuff
>      built with the intel suite and things built with, e.g., the GNU
>      suite and GotoBLAS. We were too busy just trying to get everything
>      to work. I'm sure that Intel produces better code under normal
>      circumstances, but we had lots of cases where version 10 couldn't
>      even produce viable binaries. So, make of that what you will.
>
>
>    What I would recommend (if you were going to use MacOS X):
>
>    * Learn Darwin, in detail. Figure out the CLI way to do everything,
>      and do it. In fact, forget Mac OS X; just use Darwin. Learn the
>      system's error codes, figure out how to manipulate fat binaries
>      (and how to strip them to make skinny ones), be able to manipulate
>      users, debug the executing binaries, etc. Consider looking into
>      the Apple disk imaging widget so you can boot the nodes diskless.
>
>
>    What I would do differently (whether I stick with MacOS X or not):
>
>    * diskless clients
>    * Flash drive for head node
>    * no GPUs
>    * Get Serial Console set up and available, even if you don't use it
>      routinely
>    * CPU Frequency Scaling!!
>    * many more, smaller cores. we had 36 at 3GHz. this was crazy. We
>      were way power hungry.
>    * Go to Intel 45nm dies.
>
>
> ----- End forwarded message -----
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977



More information about the Beowulf mailing list