[Beowulf] Joe Blaylock's notes on running a MacOS cluster, Nov. 2007
Robert G. Brown
rgb at phy.duke.edu
Tue Nov 20 12:27:12 PST 2007
On Tue, 20 Nov 2007, Kragen Javier Sitaker wrote:
> This is not strictly about Beowulfs, but it is probably of interest to
> their users.
Hey Kragen, long time no, um, read. Or see. Or something.
Glad to see you're still on the list.
> My friend Joe's team from Indiana University just fielded a MacOS
> cluster for the Supercomputing '07 Cluster Challenge. His experiences
> weren't that great; I encouraged him to jot something quick down so that
> other people could benefit from his hard-won lessons.
> There's more information about the challenge at
> ----- Forwarded message from Kragen Javier Sitaker <kragen at pobox.com> -----
> From: Kragen Javier Sitaker <kragen at pobox.com>
> To: kragen-fw
> Subject: Joe Blaylock's notes on running a MacOS cluster, Nov. 2007
> Disordered thoughts on using MacOS X for HPC.
> By Joe Blaylock, 2007-11.
> * we were the first people to ever try that particular combination:
> Tiger on Xeons with Intel's ICC 10 compiler suite and MKL linear
> algebra libraries. Blazing new territory is never easy.
> * We didn't use XGrid or Apple's cluster management stuff, only
> Server Admin and ARD.
> * Pov-Ray was easy; OpenMPI was easy; using Myrinet over 10Gig
> Ethernet was easy
> * GAMESS was more challenging, but we got it working somewhat. We
> still don't know how to run jobs of type ccsd(t), which require
> System V shared memory.
> * We never got POP to work.
> * Apparently, ICC 10 has some bugs. There were several times when we
> were trying to use it to build, IIRC, GAMESS or POP, and it would
> give illegal instruction errors during compile. Or it would build
> a binary that we would run, and then it would do something
> horrible (like hang the machine (probably a bug interaction
> between icc and MacOSX).
> * OpenDirectory doesn't seem ready for prime time. It's pretty easy
> to set up, but it's unreliable and mysterious. In MacOS X, there
> seems to be a fundamental disconnect between things in the CLI
> world and things in the GUI world. Setting something up in one
> place won't necessarily be reflected in the other place. I'm sure
> that this is all trivial, if you're a serious Darwin user. But
> none of us were. So for example, you set up your NFS exports in
> the Server Admin tool, rather than by editing /etc/exports. The
> Admin tool won't put anything into /etc/exports. So if you're on
> the command line, how do you check what you're exporting? With the
> complexity of LDAP, this becomes a real problem. You set up
> accounts on your head node, and say to export that information.
> But perhaps you create an account, but can't log into it on a
> node. If you're ssh'd in from the outside, where do you check to
> see (from the command-line) what the authentication system is
> doing? Our local Mac guru couldn't tell us. And then you'd create
> another account, and the first one would start working again. WTF?
> * This may be the most frustrating thing about working with OS X
> Server. The CLI is the redheaded stepchild, and lots of HPC is
> mucking about on the command-line. You can use VNC to connect to
> ARD (but only if a user is logged in on the desktop and running
> ARD!), but it's slow, and only provides desktop control, not
> cluster management. ARD can then be run on the desktop, to provide
> desktop control of the nodes in the cluster, and some cluster
> management: run unix command everywhere, shut nodes down, etc.
> There were a handful of tasks which seemed important, but which I
> couldn't figure out how to do on the command-line at all. The most
> heinous of these is adding and removing users to/from LDAP.
> * Most of the time, I found it more convenient to use a 'for' loop
> that would ssh to nodes to run some command for me.
> * MacOS X lacks a way to do cpu frequency scaling. This killed us in
> the competition. We couldn't scale cores to save on our power
> budget, we could only leave them idle.
> * Being a Linux dude, I found having to have license keys for my
> operating systems, and (separately) my administration and
> management tools, to be odious in the extreme. Having to
> separately license ICC and IFORT and MKL just added frustration
> and annoyance.
> * We didn't make detailed performance comparisons between stuff
> built with the intel suite and things built with, e.g., the GNU
> suite and GotoBLAS. We were too busy just trying to get everything
> to work. I'm sure that Intel produces better code under normal
> circumstances, but we had lots of cases where version 10 couldn't
> even produce viable binaries. So, make of that what you will.
> What I would recommend (if you were going to use MacOS X):
> * Learn Darwin, in detail. Figure out the CLI way to do everything,
> and do it. In fact, forget Mac OS X; just use Darwin. Learn the
> system's error codes, figure out how to manipulate fat binaries
> (and how to strip them to make skinny ones), be able to manipulate
> users, debug the executing binaries, etc. Consider looking into
> the Apple disk imaging widget so you can boot the nodes diskless.
> What I would do differently (whether I stick with MacOS X or not):
> * diskless clients
> * Flash drive for head node
> * no GPUs
> * Get Serial Console set up and available, even if you don't use it
> * CPU Frequency Scaling!!
> * many more, smaller cores. we had 36 at 3GHz. this was crazy. We
> were way power hungry.
> * Go to Intel 45nm dies.
> ----- End forwarded message -----
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
More information about the Beowulf