From csamuel at vpac.org Mon Jan 1 18:06:54 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:33 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <4594F3E6.5010803@gmail.com> Message-ID: <200701021306.57800.csamuel@vpac.org> On Saturday 30 December 2006 04:24, Robert G. Brown wrote: > On Fri, 29 Dec 2006, Geoff Jacobs wrote: > > > What I'd like to see is an interested party which would implement a > > good, long term security management program for FC(2n+b) releases. RH > > obviously won't do this. > > I thought there was such a party, but I'm too lazy to google for it. Fedora Legacy. It's pretty much dead these days. :-( http://fedoralegacy.org/ Important Notice: December 12, 2006 The current model for supporting maintenance distributions is being re-examined. In the meantime, we are unable to extend support to older Fedora Core releases as we had planned. As of now, Fedora Core 4 and earlier distributions are no longer being maintained. -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070102/5c891c9d/attachment.bin From csamuel at vpac.org Mon Jan 1 18:10:48 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <4594E874.9060905@gmail.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <4594E874.9060905@gmail.com> Message-ID: <200701021310.48609.csamuel@vpac.org> On Friday 29 December 2006 21:05, Geoff Jacobs wrote: > Here's a bare bones kickstart method (not Kickstart[tm] per se): > http://linuxmafia.com/faq/Debian/kickstart.html Good old Rick, he crops up everywhere & is a mine of information. ;-) > Regarding kickstart, among choices for pre-scripted installers it is one > of many. I personally favor the likes of SystemImager, even though it's > not quite in the same category (FAI is though, IMO). Even dd with netcat > is pretty powerful for homogeneous nodes. FAI is the one I've heard of before, but never had the chance to play with it yet. I hear tell that Warewulf is distro neutral and will deploy J.Random Distro onto hardware (and maybe even 'doze, shudder). > Once you've chosen your distro based on experience/need, there are > usually a few ways to put it on your spindles. Oh indeed! cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070102/2ae2b284/attachment.bin From dag at sonsorol.org Tue Jan 2 13:06:08 2007 From: dag at sonsorol.org (Chris Dagdigian) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> Message-ID: For what it's worth I'm a biased Grid Engine and Platform LSF user ... On Dec 29, 2006, at 11:40 AM, Nathan Moore wrote: > I've presently set up a cluster of 5 AMD dual-core linux boxes for > my students (at a small college). I've got MPICH running, shared > NIS/NFS home directories etc. After reading the MPICH installation > guide and manual, I can't say I understand how to deploy MPICH for > my students to use. So far as I can tell, there no load balancing > or migration of processes in the library, and so now I'm trying to > figure out what piece of software to add to the cluster to (for > example) prevent the starting of an MPI job when there's already > another job running. > > (1) Is openPBS or gridengine the appropriate tool to use for a > multi-user system where mpich is available? Are there better > scheduling options? > Both should be fine although if you are considering *PBS you should look at both Torque (a fork of OpenPBS I think) and PBSPro (commercial but last time I checked they had very good options for academic sites). I can't speak intelligently about the PBS variants these days... it's been too long since I've been hands on. Lots of people use Grid Engine with MPICH using both loose and tight integration methods. The mailing list (users@gridengine.sunsource.net) has a very helpful community with an excellent signal to noise ratio. Despite being an SGE zealot there are times when I can make both a technical and business argument for why Platform LSF is the "best" solution for a particular project or problem -- you may want to add this to your evaluation plate if you are considering (at all) commercial options. If not, don't sweat it. For a small cluster in an academic environment LSF may be hard to justify but if you can get good academic pricing it is often worthwhile to crunch the numbers -- LSF in some cases can 'win' from a features, lower-administrative- burden and support perspective but this a case-by-case thing. > (1.5) Can mortals install and configure Gridengine? Thus far it > seems too wonderful for me to understand. Grid Engine is easy to install. I've posted an article here that covers the stuff I wish someone had told me beforehand about SGE: "Things to think about before installing Grid Engine" http://gridengine.info/articles/2005/09/29/things-to-think-about- before-installing ... it boils down to the fact that during installation SGE is unusually sensitive to issues regarding hostnames and forward/reverse DNS resolution. > > (2) Also, if my cluster is made up of a mix of single and dual > processor machines, what's the proper way to tell mpd about that > topology? Depends on which MPI implementation and which of the many available methods you are using to bootstrap the process. > > (3) Its likely that in the future I'll have part-time access to > another cluster of dual-boot (XP/linux) machines. The machines > will default to booting to Linux, but will occasionally (5-20 hours > a week) be used as windows workstations by a console user (when a > user is finished, they'll restart the machine and it will boot back > to linux). If cluster nodes are available in this sort of > unpredictable and intermittent way, can they be used as compute > nodes in some fashion? Wil gridengine/PBS /??? take care of this > sort of process migration? > Grid Engine will not transparently preserve and migrate running jobs off of machines that get bounced suddenly. This sort of transparent and automatic checkpointing and migration is actually pretty hard to do in practice. If you know in advance which machines are going to be shut down and rebooted into windows then there are tools in all the common scheduling packages for "draining" a particular machine or queue. You can also "kill and reschedule" jobs that are running on any one queue instance or cluster queue. One can even do this on a calendar basis when the "need windows" schedule is predictable (does not seem possible in your case). If the running cluster jobs are short lived so that you don't have a big runtime investment then you can bounce machines whenever you want - Grid Engine can be told to reschedule failed jobs automatically to a different available host -- the hard case to deal with is the very long running jobs that (a) can't be reliably checkpoint or (b) are difficult to suspend/resume/ migrate due to the parallel application itself. The answer may be application specific in your case. Regards, Chris > best regards, > > Nathan > > > > - - - - - - - - - - - - - - - - - - - - - - - > Nathan Moore > Physics > Winona State University > nmoore@winona.edu > AIM:nmoorewsu > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Tue Jan 2 15:32:09 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri May 9 01:05:34 2008 Subject: FW: [Beowulf] Which distro for the cluster? In-Reply-To: <3D92CA467E530B4E8295214868F840FE0A317F81@emss01m12.us.lmco.com> References: <3D92CA467E530B4E8295214868F840FE0A317F81@emss01m12.us.lmco.com> Message-ID: On Thu, 28 Dec 2006, Cunningham, Dave wrote: > I notice that Scyld is notable by it's absence from this discussion. Is > that due to cost, or bad/no experience, or other factors? There is a > lot of interest in it around my company lately. Scyld is a fine choice for a cluster, but not usually for a first time learning cluster for non-professionals. This is in part because it costs money, and in part because it is designed to encapsulate a lot of what one has to do to "make a cluster" to the point where it is nearly entirely hidden from the user/administrator. This is desireable from a corporate point of view (although I personally think that one needs a certain amount of actual cluster experience to get the most out of even Scyld) but not so good for poor people seeking to learn. It also limits you at least somewhat to the particular parallel computing model that Scyld itself embraces. A good friend of mine at Duke uses Scyld for his biochemistry cluster, and although he's been doing cluster computing for a rather long time (close to 10 years at a guess, maybe even more) and COULD and HAS IN THE PAST done it all himself, he really likes Scyld's general cluster administration and encapsulation features. Of course the grants that fund the research are deep-pocketed enough to afford it, as well. That isn't always the case in academe, and it really isn't the case at home. However, Don Becker is on the list and you've given him an open invitation to present Scyld, who it is really designed and intended for, and maybe even an overview of how it (currently) works. Don? rgb > > Dave Cunningham > > -----Original Message----- > From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] > On Behalf Of Andrew M.A. Cater > Sent: Thursday, December 28, 2006 8:40 AM > To: beowulf@beowulf.org > Subject: Re: [Beowulf] Which distro for the cluster? > > On Wed, Dec 27, 2006 at 06:46:25PM +0100, Chetoo Valux wrote: >> Dear all, >> >> As a Linux user I've worked with several distros as RedHat, SuSE, > Debian and >> derivatives, and recently Gentoo. >> >> Now I face the challenge of building a HPC for scientific > calculations, and >> I wonder which distro would suit me best. As a Gentoo user, I've > recognised >> the power of customisation, optimisation and lightweight system, for >> instance my 4 years old laptop flies like a youngster, and some > desktops >> too. So I thought about building the HPC nodes (8+1 master) with > Gentoo .... >> > > Don't use Gentoo unless you've a full, fast connection to the internet > _AND_ you're prepared for your cluster to be internet connected while > you build it. This IMHO. > > Scientific calculations: Quantian? Debian. Debian for the number of math > > and other packages and the ease of install. Over 8 nodes, it should be > relatively easy to set up. But it depends what you want to do, what > other users want to do etc. etc. > >> But then it comes the administration and maintenance burden, which for > me it >> should be the less, since my main task here is research ... so > browsing the >> net I found Rocks Linux with plenty of clustering docs and > administration >> tools & guidelines. I feel this should be the choice in my case, even > if I >> sacrifice some computation efficiency. > > Rocks / Warewulf perhaps. If you just want something you can > build/update/maintain in your sleep, I'd still suggest Debian - if only > because a _minimal_ install on the nodes is as small as you want it to > be - and because it's fairly consistent. Your cluster - your choice but > you may have to justify it to your co-workers. > > Andy >> >> Any advice on this will be appreciated. >> >> Chetoo. > >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Tue Jan 2 15:44:50 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> Message-ID: On Tue, 2 Jan 2007, Chris Dagdigian wrote: >> (3) Its likely that in the future I'll have part-time access to another >> cluster of dual-boot (XP/linux) machines. The machines will default to >> booting to Linux, but will occasionally (5-20 hours a week) be used as >> windows workstations by a console user (when a user is finished, they'll >> restart the machine and it will boot back to linux). If cluster nodes are >> available in this sort of unpredictable and intermittent way, can they be >> used as compute nodes in some fashion? Wil gridengine/PBS /??? take care of >> this sort of process migration? >> > > Grid Engine will not transparently preserve and migrate running jobs off of > machines that get bounced suddenly. This sort of transparent and automatic > checkpointing and migration is actually pretty hard to do in practice. If > you know in advance which machines are going to be shut down and rebooted > into windows then there are tools in all the common scheduling packages for > "draining" a particular machine or queue. You can also "kill and reschedule" For what it is worth, the current generation of Condor can, for some code and linked with its own migration library, permit transparent checkpointing and code migration, and it also has a very complex "policy" engine that lets one specify in great deal how to turn jobs on and off as user/owners use the systems in the pool. It has recently become "true open source" although the download website is still a PITA to navigate and requires a kind of "registration" and its license is still not a straight GPL. This is kind of funny because as I read it, the toolset can now be wrapped up in source RPMs and distributed as a standard component of e.g. FC in extras or elsewise without violating any aspect of its license agreement. Doing this (for Duke, but if it is in one of Duke's public repos it is pretty public) is on my list of things to do this week or next. One of the bitches that I and many others have about all of the alternatives is that they are too damn complicated. Many sites -- I won't say most but many -- have very, very simple needs for a scheduler/queuing system. Needs that could be met without requiring the admin to read a 1000 page manual, join a mailing list, work through a really complicated build, and try to figure out several distinct security models and policy models. What is really needed is a fully open source "scheduler lite" that pretty much sets up a simple queue for a simple list of machines with a simple cron-like policy statement, maybe all defined with an XMLish config file that permitted classes of machines (like a bunch that belong to user A) to share a policy. Some people on list (Mark Hahn, e.g.) have IIRC even written their own lightweight schedulers out of sheer pique with this situation. However, I don't know if any of them have been developed to where they are moderately portable and packagable for general use. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From dsimas at imageworks.com Tue Jan 2 16:03:51 2007 From: dsimas at imageworks.com (David Simas) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> Message-ID: <20070103000350.GB19664@kadee.spimageworks.com> On Tue, Jan 02, 2007 at 06:44:50PM -0500, Robert G. Brown wrote: > > One of the bitches that I and many others have about all of the > alternatives is that they are too damn complicated. Many sites -- I > won't say most but many -- have very, very simple needs for a > scheduler/queuing system. Needs that could be met without requiring the > admin to read a 1000 page manual, join a mailing list, work through a > really complicated build, and try to figure out several distinct > security models and policy models. What is really needed is a fully > open source "scheduler lite" that pretty much sets up a simple queue for > a simple list of machines with a simple cron-like policy statement, > maybe all defined with an XMLish config file that permitted classes of > machines (like a bunch that belong to user A) to share a policy. Ruby Queue? http://raa.ruby-lang.org/project/rq/ http://www.artima.com/rubycs/articles/rubyqueue.html DGS > From csamuel at vpac.org Tue Jan 2 17:23:43 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> Message-ID: <200701031223.46333.csamuel@vpac.org> On Wednesday 03 January 2007 08:06, Chris Dagdigian wrote: > Both should be fine although if you are considering *PBS you should ? > look at both Torque (a fork of OpenPBS I think) That's correct, it (and ANU-PBS, another fork) seem to be the defacto queuing systems in the state and national HPC centers down here. Torque is just *so* much better than OpenPBS used to be (not that it was particularly hard). cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070103/27ed5e14/attachment.bin From landman at scalableinformatics.com Tue Jan 2 20:54:03 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] OT: Announcing MPI-HMMER Message-ID: <459B36EB.1060509@scalableinformatics.com> Hi folks: Short OT break. http://code.google.com/p/mpihmmer/ an MPI implementation of HMMer 2.3.2. Back to your regularly scheduled cluster. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From nixon at nsc.liu.se Wed Jan 3 02:54:03 2007 From: nixon at nsc.liu.se (Leif Nixon) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: (Robert G. Brown's message of "Fri, 29 Dec 2006 02:48:04 -0500 (EST)") References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> Message-ID: "Robert G. Brown" writes: > Also, plenty of folks on this list have done just fine running "frozen" > linux distros "as is" for years on cluster nodes. If they aren't broke, > and live behind a firewall so security fixes aren't terribly important, > why fix them? Because your users will get their passwords stolen. If your cluster is accessible remotely, that firewall doesn't really help you very much. The attacker can simply login as a legitimate user and proceed to walk through your wide-open local security holes. But you know this already. -- Leif Nixon - Systems expert ------------------------------------------------------------ National Supercomputer Centre - Linkoping University ------------------------------------------------------------ From reuti at staff.uni-marburg.de Wed Jan 3 04:01:26 2007 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: <200701031223.46333.csamuel@vpac.org> References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> <200701031223.46333.csamuel@vpac.org> Message-ID: Hi, Am 03.01.2007 um 02:23 schrieb Chris Samuel: > On Wednesday 03 January 2007 08:06, Chris Dagdigian wrote: > >> Both should be fine although if you are considering *PBS you should >> look at both Torque (a fork of OpenPBS I think) although I'm somehow biased to suggest SGE, I also check from time to time the Torque mailing list. > That's correct, it (and ANU-PBS, another fork) seem to be the > defacto queuing > systems in the state and national HPC centers down here. Whether any queuing system is a standard might not matter. More important for chosing one, may be the technical points. To compare SGE and Torque e.g.: - Do you need support for Tight Integrated Linda (I think this will most often mean Gaussian) (and PVM) parallel jobs: use SGE - Do you have some special nodes inside your cluster, and you need to specify your resource requests for a parallel job (i.e. combination of different types of machines you need for it) in a fine granulated manner: use Torque It's of course impossible to know already in advance a) the needs of all the applications, and b) all the features of the queuingsystems, if you just start to look into queuing systems. And I must admit: some years ago I was also shocked by the many pages of the manuals of the queuing systems - we only wanted to submit some jobs at that point in time. Nowadays I see many possible enhancements, which would make the manuals even thicker. -- Reuti From glen.beane at jax.org Wed Jan 3 05:24:31 2007 From: glen.beane at jax.org (Glen Beane) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler Message-ID: <24120637.1167830671803.JavaMail.ocsadmin@jcs-mid-prod.jax.org> If you are doing mostly MPI, I would strongly reccoment TORQUE (a free, open source, OpenPBS fork with *many* enhancements). I would not reccoment OpenPBS, as Altair no longer updates it and hasn't for quite some time. TORQUE has great integration with mpich by using mpiexec from Pete at (http://www.osc.edu/~pw/mpiexec/index.php). LAM and OpenMPI have native PBS (and TORQUE) support as well. Glen L. Beane The Jackson Laboratory Software Engineer II Phone 207-288-6153 From rgb at phy.duke.edu Wed Jan 3 06:51:44 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> Message-ID: On Wed, 3 Jan 2007, Leif Nixon wrote: > "Robert G. Brown" writes: > >> Also, plenty of folks on this list have done just fine running "frozen" >> linux distros "as is" for years on cluster nodes. If they aren't broke, >> and live behind a firewall so security fixes aren't terribly important, >> why fix them? > > Because your users will get their passwords stolen. > > If your cluster is accessible remotely, that firewall doesn't really > help you very much. The attacker can simply login as a legitimate user > and proceed to walk through your wide-open local security holes. So: a) Our cluster wasn't remotely accessible. In fact, it was on a 192.168 network and in order to even touch it, one had to login to an up to date, carefully defended desktop workstation login server in the department. b) If an attacker has compromised a user account on one of these workstations, IMO the security battle is already largely lost. They have a choice of things to attack or further evil they can try to wreak. Attacking the cluster is one of them, and as discussed if the cluster is doing real parallel code it is likely to be quite vulnerable regardless of whether or not its software is up to date because network security is more or less orthogonal to fine-grained code network performance. Still, a cluster is paradoxically one of the best monitored parts of a network. Although it would make a gangbusters DoS platform, network traffic on the cluster, cpu consumption on the cluster, user access to the cluster are all relatively carefully monitored. The cluster installation is likely to be different enough and "odd" enough to make standard rootkit encapsulations fail for anyone but the legendary Ubercracker (who can always do whatever they want anyway, right?;-) In an organization that tightly monitors everything all the time on general security principles (first line of defense, really, as one can NEVER be sure all exploitable holes are closed even with a yum-updated, stable, currently supported distro and human eyes are better at picking up anomalies in system operation than any automated tool) I think it is pretty likely that any attempt to take over a cluster and use it for diabolical ends would be almost instantly detected. BTW, the cluster's servers were not (and I would not advise that servers ever be) running the old distro -- we use a castle keep security model where servers have extremely limited access, are the most tightly monitored, and are kept aggressively up to date on a fully supported distro like Centos. The idea is to give humans more time to detect intruders that have successfully compromised an account at the workstation LAN level and squash them like the nasty little dung beetles that they are. FWIW, our department is entirely linux at the server level, and almost entirely linux at the workstation level. A very few experimental groups and individuals run either Windows boxes (usually to be able to use some particular software package) or Macs (because they are, umm, "that kind of user":-). I'm guessing that the ratio is something like 4:1 linux to Win at the workstation level (Macs down there in the noise) and maybe 10:1 linux to win if you include cluster nodes, whatever OS they might be running. Since Seth introduced yup on top of RH (maybe 7-8 years ago? How time flies...), and then proceeded to write yum to replace yup for RPM distros in general, we haven't had a single successful promotion to root in the department. Nothing done locally can prevent some grad student's password from being trapped as they login from some compromised win-based system in their hometown over fall break, but the very few of these that have occurred have been quickly detected and quickly squashed without further compromise. In that same interval, we had a WinXX system compromised and turned into a pile of festering warez rot something like twice a year. Pretty amazing given that they are kept up to date as best as possible and they make up only 10-20% of our total system count. > But you know this already. Oh yeah;-) And we didn't do this "willingly" and aren't that likely to repeat it ourselves. We had some pretty specific reasons to freeze the node distro -- the cluster nodes in question were the damnable Tyan dual Athlon systems that were an incredible PITA to stabilize in the first place (they had multiple firmware bugs and load-based stability issues under the best of circumstances). Once we FINALLY got them set up with a functional kernel and library set so that they wouldn't crash, we were extremely loathe to mess with it. So we basically froze it and locked down the nodes so they weren't easily accessible except from inside the department, and then monitored them with xmlsysd and wulfstat in addition to the usual syslog-ng and friends admin tools. Odd usage patterns (that is, almost any sort of running binary that wasn't a well-known numerical task associated with one of the groups, logins by anyone who wasn't a known user) would have been noticed by any of a half-dozen people, one of whom was me, almost immediately. The kernel was "barely stable" as it was and couldn't easily have been replaced with a hacker kernel (to e.g. erase /proc trace) without a VERY high probability that the hacker kernel would crash the system and reveal the hacker on the first try. xmlsysd reads all sorts of stuff from all over /proc and was custom code that I was working on and periodically updating, even while Seth was working on yum and updating THAT. Somebody would have had to literally custom craft some very advanced C code to stay hidden on the cluster and even then would have been revealed by e.g. an update of xmlsysd unless they were a bit beyond even Ubercracker status. In general, though, it is very good advice to stay with an updated OS. My real point was that WITH yum and a bit of prototyping once every 12-24 months, it is really pretty easy to ride the FC wave on MANY clusters, where the tradeoff is better support for new hardware and more advanced/newer libraries against any library issues that one may or may not encounter depending on just what the cluster is doing. Freezing FC (or anything else) long past its support boundary is obviously less desireable. However, it is also often unnecessary. On clusters that add new hardware, usually bleeding edge, every four to six months as research groups hit grant year boundaries and buy their next bolus of nodes, FC really does make sense as Centos probably won't "work" on those nodes in some important way and you'll be stuck backporting kernels or worse on top of your key libraries e.g. the GSL. Just upgrade FC regularly across the cluster, probably on an "every other release" schedule like the one we use. On clusters (or sub-clusters) with a 3 year replacement cycle, Centos or other stable equivalent is a no-brainer -- as long as it installs on your nodes in the first place (recall my previous comment about the "stars needing to be right" to install RHEL/Centos -- the latest release has to support the hardware you're buying) you're good to go indefinitely, with the warm fuzzy knowledge that your nodes will update from a "supported" repo most of their 3+ year lifetime, although for the bulk of that time the distro will de-facto be frozen except for whatever YOU choose to backport and maintain. And really, there isn't much stopping folks from adopting a range of "mixed" strategies -- running FC-whatever on new nodes for a year or whatever as needed in order to support their hardware or use new libraries, then reinstalling them with Centos/RHEL (which is basically FC-even-current-at-release-time frozen and supported or so it seems recently anyway) as Centos support catches up with the hardware by syncing with an FC-current on a new release. Nowadays, with PXE/Kickstart/Yum (or Debian equivalents, or the OS of your choice with warewulf, or...) reinstalling OR upgrading a cluster node is such a non-event in terms of sysadmin time and effort that it can pretty much be done at will. Except for pathological cases (like the Tyans) we're talking at most a few days of sysadmin time to set up a prototyping node or four, flash over to the new distro via a discrete node reboot (unattended automated reinstall or a new node diskless image), and let selected users whack on it for a week or two. If it proves invisibly stable and satisfactory -- the rule rather than the exception -- crank it on up across the cluster. Even if it "fails" on some untested pathway after you do this, it costs you at most a reboot (again to a reinstall/replacement of a node image) to put things back as they were while you fix things. The worst thing that such a strategy might require is a rebuild of user applications for both distros, but with shared libraries to my own admittedly anecdotal experience this "usually" isn't needed going from older to newer (that is, an older Centos built binary will "probably" still work on a current FC node, although this obviously depends on the precise libraries it uses and how rapidly they are changing). It's a bit harder to take binaries from newer to older, especially in packaged form. There you almost certainly need an rpmbuild --rebuild and a bit of luck. Truthfully, cluster installation and administration has never been simpler. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Wed Jan 3 06:59:47 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: <24120637.1167830671803.JavaMail.ocsadmin@jcs-mid-prod.jax.org> References: <24120637.1167830671803.JavaMail.ocsadmin@jcs-mid-prod.jax.org> Message-ID: On Wed, 3 Jan 2007, Glen Beane wrote: > If you are doing mostly MPI, I would strongly reccoment TORQUE (a > free, open source, OpenPBS fork with *many* enhancements). I would not > reccoment OpenPBS, as Altair no longer updates it and hasn't for quite > some time. TORQUE has great integration with mpich by using mpiexec > from Pete at (http://www.osc.edu/~pw/mpiexec/index.php). LAM and > OpenMPI have native PBS (and TORQUE) support as well. FWIW (and to me it is worth a lot:-) torque appears to be in FC 6 extras, ready to install and run. This may or may not mean that FC is being used as (one of) its primary development/maintenance platform(s) -- this is often the case. I'll have to give it a try along with condor and yes, ruby queue. We're more interested in using it for simple EP job distribution, though. Not so many people here do MPI or PVM computations, it is more parallel simulations or parametric explorations. rgb > > Glen L. Beane > The Jackson Laboratory > Software Engineer II > Phone 207-288-6153 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From hahn at physics.mcmaster.ca Wed Jan 3 07:53:35 2007 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] running MPICH on AMD Opteron Dual Core Processor Cluster( 72 Cpu's) In-Reply-To: <9fe360270612290226yb9e3ccbua77a1febf4123fc6@mail.gmail.com> References: <9fe360270612290226yb9e3ccbua77a1febf4123fc6@mail.gmail.com> Message-ID: > " p1_8544: p4_error: Timeout in Establishing connection to remote process: > 0 " > rm_l_1_8667: (359.417969) net_send: could not write to fd=5, errno=104 > > We have been trying the same for the past two days and we didnt get any > solution for the above. but what have you tried? I would guess that this is a simple rsh config problem, nothing to do with mpich. > Also we downloaded the Latest MPICH 1.2.7p1 and configured the same. now for but why do you think the problem lies with mpich? > The same testing with LAM/MPI and OPENMPI are working fine. lam being mostly just a previous version of lam, and I think inheriting lam's agent-based process-starting, no? personally, I'm pretty convinced that MPI implementations should stay out of the jobstarter business, and go with straight agentless (ssh-based) job spawning. From hahn at physics.mcmaster.ca Wed Jan 3 08:05:00 2007 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] SW Giaga, what kind? In-Reply-To: <1bef2ce30612300153o4d1ae055n1374b976e846d258@mail.gmail.com> References: <1bef2ce30612272322p4a0d1807m3c6d9ea615f58873@mail.gmail.com> <4594E95E.6060102@streamline-computing.com> <4594F4B3.4070602@gmail.com> <4594FB6D.6070402@streamline-computing.com> <45957713.2080001@gmail.com> <1bef2ce30612300153o4d1ae055n1374b976e846d258@mail.gmail.com> Message-ID: > But, originally, my question was about the quality and reliability of the > brand of *LevelOne* SW (Unmanaged, Gigabit ports), in comparison to its I've never heard "level one" used in this context. the closest would be "layer 2", which refers to mac-based switching, and might be what you mean. > fairly low price, on one hand, and the brand of *3COM* SW (Unmanaged, > Gigabit ports) on the other hand. 3com is not an exceptional switch vendor, except in the historic sense. they make OK stuff, but I don't think I'd give them special credit against any of the well-known brands (dlink, netgear for commodity, hp-procurve, cisco and some others for higher-end, enterpriseish stuff.) > The number of nodes in our initial plan is 6 nodes, AMD DualCore, desktop > type systems. a dime-store, no-name 8-port gigabit switch would serve just fine. at gigabit latencies (with a normal stack, etc: ~50 us), internal details of the switch are basically irrelevant. small switches like this are probably single-chip, giving appliance-like in reliability, insensitivity to the name on the case, and probably line speed. From malallen at indiana.edu Wed Jan 3 08:52:40 2007 From: malallen at indiana.edu (Matt Allen) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] running MPICH on AMD Opteron Dual Core Processor Cluster( 72 Cpu's) In-Reply-To: References: <9fe360270612290226yb9e3ccbua77a1febf4123fc6@mail.gmail.com> Message-ID: <459BDF58.9010205@indiana.edu> Mark Hahn wrote: > personally, I'm pretty convinced that MPI implementations should stay > out of the jobstarter business, and go with straight agentless (ssh-based) > job spawning. I'm curious about your reasoning, Mark. We've had nightmare situations for years with ssh-based job spawning. The most common case is where sshd processes terminate on nodes without the child mpi processes exiting. Then we have orphaned mpi processes, owned by init, scattered throughout the cluster. If any of these processes are using limited resources (like Myrinet adapters), subsequent jobs can (more likely, will) exit immediately upon dispatch to the node. We've found ways around this with prolog/epilog scripts, and scheduling policy, but the slickest solutions so far, in my opinion, have been mpiexec (admittedly not part of an MPI implementation) and lam/openmpi. Allowing the resource manager to completely handle job spawning has provided better post-job cleanup, and more complete job statistics (cpu-time, mostly) for us. Do you not have to deal with these sorts of issues? If not, lay some wisdom on me; I could use it. Matt -- Matt Allen | Systems Analyst malallen@indiana.edu | Research and Technical Services 812-855-7318 | Indiana University From mathog at caltech.edu Wed Jan 3 09:27:04 2007 From: mathog at caltech.edu (David Mathog) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] RE: OT: Announcing MPI-HMMER Message-ID: > Short OT break. http://code.google.com/p/mpihmmer/ an MPI > implementation of HMMer 2.3.2. There's also my PVM version of 2.3.2 from 2003/2004, with a few minor fixes since then rolled up into the lastest distribution: ftp://saf.bio.caltech.edu/pub/software/molbio/parallelhmmer.tar.gz Joe and I aparently exist in parallel software universes ;-). Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From ed at eh3.com Wed Jan 3 09:30:13 2007 From: ed at eh3.com (Ed Hill) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: References: <24120637.1167830671803.JavaMail.ocsadmin@jcs-mid-prod.jax.org> Message-ID: <20070103123013.50def508@ernie> On Wed, 3 Jan 2007 09:59:47 -0500 (EST) "Robert G. Brown" wrote: > On Wed, 3 Jan 2007, Glen Beane wrote: > > > If you are doing mostly MPI, I would strongly reccoment TORQUE (a > > free, open source, OpenPBS fork with *many* enhancements). I would > > not reccoment OpenPBS, as Altair no longer updates it and hasn't > > for quite some time. TORQUE has great integration with mpich by > > using mpiexec from Pete at > > (http://www.osc.edu/~pw/mpiexec/index.php). LAM and OpenMPI have > > native PBS (and TORQUE) support as well. > > FWIW (and to me it is worth a lot:-) torque appears to be in FC 6 > extras, ready to install and run. This may or may not mean that FC is > being used as (one of) its primary development/maintenance platform(s) > -- this is often the case. I'll have to give it a try along with > condor and yes, ruby queue. TORQUE packages have been available in Fedora Extras since April 2006. Since then, versions have been built and pushed for Fedora Extras 3, 4, 5, and 6. And if you run into any problems with the Fedora torque packages then please create a Fedora bugzilla entry! Ed -- Edward H. Hill III, PhD | ed@eh3.com | http://eh3.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070103/ac8d28c2/signature.bin From landman at scalableinformatics.com Wed Jan 3 10:29:24 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] RE: OT: Announcing MPI-HMMER In-Reply-To: References: Message-ID: <459BF604.1090006@scalableinformatics.com> David Mathog wrote: >> Short OT break. http://code.google.com/p/mpihmmer/ an MPI >> implementation of HMMer 2.3.2. > > > There's also my PVM version of 2.3.2 from 2003/2004, with a few > minor fixes since then rolled up into the lastest distribution: > > ftp://saf.bio.caltech.edu/pub/software/molbio/parallelhmmer.tar.gz > > Joe and I aparently exist in parallel software universes ;-). Heh... :) Will pull it down and look. My fault David, I was not aware of this, or it never mentally clicked. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From ntmoore at gmail.com Tue Jan 2 20:55:10 2007 From: ntmoore at gmail.com (Nathan Moore) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: <200701031223.46333.csamuel@vpac.org> References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> <200701031223.46333.csamuel@vpac.org> Message-ID: <9988D8B1-94CF-4C93-AD1D-54274993C00C@gmail.com> Torque was really easy to install, but it seems like my /etc/hosts file must be screwed up, as I can't get the cluster nodes to respond. Specifically, within a cluster of 3 machines, each having an /etc/hosts file of: 127.0.0.1 localhost.localdomain localhost 199.17.152.17 runner 199.17.152.135 muscovey 199.17.152.13 pekin (( other workstations follow )) Now, when I have the pbs_server running on runner, and the pbs_mom daemons running on muscovey, pekin, and runner, I et the following status message, [root@runner torque-2.1.6]# pbsnodes -a pekin state = down np = 1 ntype = cluster muscovey state = down np = 1 ntype = cluster runner state = down np = 1 ntype = cluster I realize this is a pretty low-level question, but what the heck is wrong with my /etc/hosts file? regards, NT ps, the trouble shooting message given by torque is, [root@runner torque-2.1.6]# momctl -d 3 Host: runner/runner Version: 2.1.6 WARNING: server not specified (set $pbsserver) PID: 30531 HomeDirectory: /var/spool/torque/mom_priv MOM active: 2518 seconds Server Update Interval: 45 seconds LOGLEVEL: 0 (use SIGUSR1/SIGUSR2 to adjust) Communication Model: RPP TCP Timeout: 20 seconds NOTE: no prolog configured Alarm Time: 0 of 10 seconds Trusted Client List: 199.17.152.17,127.0.0.1 Configured to use /usr/bin/scp -rpB NOTE: no local jobs detected diagnostics complete - - - - - - - - - - - - - - - - - - - - - - - Nathan Moore Physics Winona State University nmoore@winona.edu AIM:nmoorewsu - - - - - - - - - - - - - - - - - - - - - - - On Jan 2, 2007, at 7:23 PM, Chris Samuel wrote: On Wednesday 03 January 2007 08:06, Chris Dagdigian wrote: > Both should be fine although if you are considering *PBS you should > look at both Torque (a fork of OpenPBS I think) That's correct, it (and ANU-PBS, another fork) seem to be the defacto queuing systems in the state and national HPC centers down here. Torque is just *so* much better than OpenPBS used to be (not that it was particularly hard). cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http:// www.beowulf.org/mailman/listinfo/beowulf From m.janssens at opencfd.co.uk Wed Jan 3 04:08:31 2007 From: m.janssens at opencfd.co.uk (Mattijs Janssens) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] cluster trips power switch In-Reply-To: <9fe360270612290226yb9e3ccbua77a1febf4123fc6@mail.gmail.com> References: <9fe360270612290226yb9e3ccbua77a1febf4123fc6@mail.gmail.com> Message-ID: <200701031208.31933.m.janssens@opencfd.co.uk> When I switch off our small (16 node) cluster it trips the power switch. Guess there is a temporary power surge. Are there any devices (line conditioners?) that will prevent this? Experiences? Mattijs -- Mattijs Janssens From wharman at prism.net Wed Jan 3 09:59:11 2007 From: wharman at prism.net (wharman@prism.net) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: <20070103123013.50def508@ernie> References: <24120637.1167830671803.JavaMail.ocsadmin@jcs-mid-prod.jax.org> <20070103123013.50def508@ernie> Message-ID: <52586.10.238.10.70.1167847151.webmail@10.238.10.70> or download TORQUE: http://www.clusterresources.com/pages/products.php -Bill -----Original Message----- From: "Ed Hill" Sent: Wednesday, January 3, 2007 12:30 pm To: "Robert G. Brown" Cc: "beowulf@beowulf.org" Subject: Re: [Beowulf] picking out a job scheduler On Wed, 3 Jan 2007 09:59:47 -0500 (EST) "Robert G. Brown" wrote: > On Wed, 3 Jan 2007, Glen Beane wrote: > > > If you are doing mostly MPI, I would strongly reccoment TORQUE (a > > free, open source, OpenPBS fork with *many* enhancements). I would > > not reccoment OpenPBS, as Altair no longer updates it and hasn't > > for quite some time. TORQUE has great integration with mpich by > > using mpiexec from Pete at > > (http://www.osc.edu/~pw/mpiexec/index.php). LAM and OpenMPI have > > native PBS (and TORQUE) support as well. > > FWIW (and to me it is worth a lot:-) torque appears to be in FC 6 > extras, ready to install and run. This may or may not mean that FC is > being used as (one of) its primary development/maintenance platform(s) > -- this is often the case. I'll have to give it a try along with > condor and yes, ruby queue. TORQUE packages have been available in Fedora Extras since April 2006. Since then, versions have been built and pushed for Fedora Extras 3, 4, 5, and 6. And if you run into any problems with the Fedora torque packages then please create a Fedora bugzilla entry! Ed -- Edward H. Hill III, PhD | ed@eh3.com | http://eh3.com/ _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From csamuel at vpac.org Wed Jan 3 14:39:36 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: <9988D8B1-94CF-4C93-AD1D-54274993C00C@gmail.com> References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> <200701031223.46333.csamuel@vpac.org> <9988D8B1-94CF-4C93-AD1D-54274993C00C@gmail.com> Message-ID: <200701040939.36395.csamuel@vpac.org> On Wednesday 03 January 2007 15:55, Nathan Moore wrote: > ????????WARNING: ?server not specified (set $pbsserver) This has already been answered on the Torque list, but for the folks on the Beowulf list this was the issue. cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070104/c63e7880/attachment.bin From csamuel at vpac.org Wed Jan 3 14:50:06 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> <200701031223.46333.csamuel@vpac.org> Message-ID: <200701040950.06887.csamuel@vpac.org> On Wednesday 03 January 2007 23:01, Reuti wrote: > - Do you need support for Tight Integrated Linda (I think this will ? > most often mean Gaussian) (and PVM) parallel jobs: use SGE Interesting, why so ? I know a number of sites around Australia (including a 1900+ CPU cluster) run Gaussian using PBS (I don't know how much pain, if any, they went through for that but my understanding is that anything else that involves Gaussian involves pain and many dead chickens). My memory is that the one time I've had to set up a PVM dependant application (TGICL) it wasn't particularly hard to get going. Mind you PVM seems pretty dead, TGICL was the only application we've ever had requested that needed it and that was a couple of years ago and was only for a couple of weeks work. cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070104/8344377a/attachment.bin From csamuel at vpac.org Wed Jan 3 14:59:58 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] running MPICH on AMD Opteron Dual Core Processor Cluster( 72 Cpu's) In-Reply-To: References: <9fe360270612290226yb9e3ccbua77a1febf4123fc6@mail.gmail.com> Message-ID: <200701040959.58626.csamuel@vpac.org> On Thursday 04 January 2007 02:53, Mark Hahn wrote: > personally, I'm pretty convinced that MPI implementations should stay > out of the jobstarter business, and go with straight agentless (ssh-based) > job spawning. Noooooo... please not ssh again, make the pain go away! Seriously though, this is what the PBS TM interface is for (not used SGE, so I don't know if it has a similar interface, I'd be surprised if it didn't).. The TM interface is important as it means Torque can keep a close beady eye on the MPI processes spawned and kill off the processes when needed (which all too often get left behind otherwise and need hacks like epilogue scripts to fix). It also stops users changing their previous 32 CPU job script to ask for 4 CPUs in the queue and then forgetting to change the -np parameter for mpirun as well. Nodes don't like that sort of load. cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070104/6866f1bf/attachment.bin From csamuel at vpac.org Wed Jan 3 15:23:10 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] RE: OT: Announcing MPI-HMMER In-Reply-To: References: Message-ID: <200701041023.10843.csamuel@vpac.org> On Thursday 04 January 2007 04:27, David Mathog wrote: > Joe and I aparently exist in parallel software universes ;-). Being MPI means it can take advantage of high speed interconnects (e.g. building it with MPICH-GM to use native Myrinet). Of course whether that would benefit HMMER is something I don't know! cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070104/593a8923/attachment.bin From landman at scalableinformatics.com Wed Jan 3 17:48:20 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] RE: OT: Announcing MPI-HMMER In-Reply-To: <200701041023.10843.csamuel@vpac.org> References: <200701041023.10843.csamuel@vpac.org> Message-ID: <459C5CE4.1040005@scalableinformatics.com> Chris Samuel wrote: > On Thursday 04 January 2007 04:27, David Mathog wrote: > >> Joe and I aparently exist in parallel software universes ;-). > > Being MPI means it can take advantage of high speed interconnects (e.g. > building it with MPICH-GM to use native Myrinet). Of course whether that > would benefit HMMER is something I don't know! It does. The code does very nicely with low latency (and low OS jitter). XD1's, while discontinued, are sweet boxes. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From reuti at staff.uni-marburg.de Thu Jan 4 03:16:07 2007 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: <200701040950.06887.csamuel@vpac.org> References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> <200701031223.46333.csamuel@vpac.org> <200701040950.06887.csamuel@vpac.org> Message-ID: Am 03.01.2007 um 23:50 schrieb Chris Samuel: > On Wednesday 03 January 2007 23:01, Reuti wrote: > >> - Do you need support for Tight Integrated Linda (I think this will >> most often mean Gaussian) (and PVM) parallel jobs: use SGE > > Interesting, why so ? I know a number of sites around Australia > (including a > 1900+ CPU cluster) run Gaussian using PBS (I don't know how much > pain, if > any, they went through for that but my understanding is that > anything else > that involves Gaussian involves pain and many dead chickens). Linda and PVM* need some kind of rsh/ssh between the nodes, and I didn't get a clue up to now to convince Linda to use the PBS TM of Torque. As you mentioned in your other post about keeping control of MPI processes, the similar thing to TM is the qrsh command in SGE, which will replace rsh/ssh and SGE is controlling this way these spawned processes on the nodes. I'm also always looking in a cluster setup, without any common rsh/ssh between the nodes at all, where users could by accident start processes out of control of the queuing system on the nodes. -- Reuti * I'm aware, that PVM can be started interactively without any rsh/ ssh between the nodes, by supplying some strings to the daemons and its response back to the startup process. > > My memory is that the one time I've had to set up a PVM dependant > application > (TGICL) it wasn't particularly hard to get going. Mind you PVM > seems pretty > dead, TGICL was the only application we've ever had requested that > needed it > and that was a couple of years ago and was only for a couple of > weeks work. > > cheers! > Chris > -- > Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager > Victorian Partnership for Advanced Computing http://www.vpac.org/ > Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Thu Jan 4 08:54:44 2007 From: mathog at caltech.edu (David Mathog) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] RE: OT: Announcing MPI-HMMER Message-ID: > There's also my PVM version of 2.3.2 from 2003/2004 I ought to clarify that slightly. Sean Eddy's original code had a PVM mode. My variant retained the original but defaults to a new "database sliced" mode - where each query runs on every node against a fraction of the database. (Similar to what Joe's MPI-BLAST and my PVM parallel BLAST do.) To support that hmmfetch was also modified to work over PVM, which wasn't required in SE's original since the master node always had a complete database, so there was no need to fetch entries back from the database slices on the compute nodes. Now back to your regularly scheduled topics... Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From robl at mcs.anl.gov Thu Jan 4 09:46:11 2007 From: robl at mcs.anl.gov (Robert Latham) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] mpiJava + MPICH In-Reply-To: <45929555.5090108@duke.edu> References: <45929555.5090108@duke.edu> Message-ID: <20070104174610.GY24143@mcs.anl.gov> On Wed, Dec 27, 2006 at 10:46:29AM -0500, Sean Dilda wrote: > I'm working on setting up mpiJava for a cluster user. I'm compiling it > against Sun's Java 1.5.0 and MPICH 1.2.5, on a cluster running CentOS 4 Is mpiJava dependent on a specific vesion of MPI? Not only is MPICH-1.2.5 a quite old version of the MPICH release, the entire series has been superceeded by MPICH2 (unless you require heterogenerous support). does mpijava work with MPICH2-1.0.5 ? > I've also noticed that when a normal MPI program runs, the process tree > shows mpirun, which has a child of your program. That child of mpirun > then has a child that's your program running locally, and a bunch of > children that are all the 'rsh' command for launching remote copies. > Whenever I run a mpiJava program, the only thing in the process tree is > mpirun and a single child of mpirun. > > Has anyone run across this, or have any ideas of what I could do to fix > this problem? The job management in MPICH is a little hairy. Things have been cleaned up somewhat in MPICH2. For example, running MPICH2 progams under valgrind is a lot easier than doing so with MPICH1-based programs. Perhaps the same will hold for mpijava, though I know nothing about that project. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From csamuel at vpac.org Thu Jan 4 16:09:52 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> <200701040950.06887.csamuel@vpac.org> Message-ID: <200701051109.56867.csamuel@vpac.org> On Thursday 04 January 2007 22:16, Reuti wrote: > Linda and PVM* need some kind of rsh/ssh between the nodes, and I ? > didn't get a clue up to now to convince Linda to use the PBS TM of ? > Torque. Torque provides a pbsdsh command that uses the TM interface and acts like the various DSH variants. What it doesn't appear to be able to do (which I've just discovered) is to be able to only run once per node in the job. Hmm.. > As you mentioned in your other post about keeping control of ? > MPI processes, the similar thing to TM is the qrsh command in SGE, ? > which will replace rsh/ssh and SGE is controlling this way these ? > spawned processes on the nodes. Sounds very similar to pbsdsh in the way it works. > I'm also always looking in a cluster setup, without any common rsh/ssh > between the nodes at all, where users could by accident start processes out > of control of the queuing system on the nodes. Exactly. What we do here is a hack in the /etc/profile that checks for the existence of $PBS_ENVIRONMENT and kicks them off with a message about only being permitted to access the node if you have a job on it. Ugly, but it works. Newer versions of Torque have a PAM module contributed by Jim Prewett which will check the user against the current list of Torque jobs on a node and only permit access if they have a job on the node. We prefer to only allow access via a PBS jobs which is why we still use our hack, but the PAM module might be a handy backstop for us. cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070105/877e54aa/attachment.bin From reuti at staff.uni-marburg.de Fri Jan 5 04:58:28 2007 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] picking out a job scheduler In-Reply-To: <200701051109.56867.csamuel@vpac.org> References: <58106678-33A8-40D7-BA1E-4DF128F1A7FC@gmail.com> <200701040950.06887.csamuel@vpac.org> <200701051109.56867.csamuel@vpac.org> Message-ID: <440DA641-65DA-4B9D-B975-F440B4851367@staff.uni-marburg.de> Am 05.01.2007 um 01:09 schrieb Chris Samuel: > On Thursday 04 January 2007 22:16, Reuti wrote: > >> Linda and PVM* need some kind of rsh/ssh between the nodes, and I >> didn't get a clue up to now to convince Linda to use the PBS TM of >> Torque. > > Torque provides a pbsdsh command that uses the TM interface and > acts like the > various DSH variants. What it doesn't appear to be able to do > (which I've > just discovered) is to be able to only run once per node in the > job. Hmm.. You can run it once per node with the -n option. Trying to simulate rsh would simply mean to map the hostname of the requested machine to an index in the list of granted machines - no big deal. The bigger problem seems to be, that there is no real environment on the nodes where the slave tasks are started. I.e. no environment variables set. -- Reuti >> As you mentioned in your other post about keeping control of >> MPI processes, the similar thing to TM is the qrsh command in SGE, >> which will replace rsh/ssh and SGE is controlling this way these >> spawned processes on the nodes. > > Sounds very similar to pbsdsh in the way it works. > >> I'm also always looking in a cluster setup, without any common rsh/ >> ssh >> between the nodes at all, where users could by accident start >> processes out >> of control of the queuing system on the nodes. > > Exactly. What we do here is a hack in the /etc/profile that checks > for the > existence of $PBS_ENVIRONMENT and kicks them off with a message > about only > being permitted to access the node if you have a job on it. Ugly, > but it > works. > > Newer versions of Torque have a PAM module contributed by Jim > Prewett which > will check the user against the current list of Torque jobs on a > node and > only permit access if they have a job on the node. > > We prefer to only allow access via a PBS jobs which is why we still > use our > hack, but the PAM module might be a handy backstop for us. > > cheers! > Chris > -- > Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager > Victorian Partnership for Advanced Computing http://www.vpac.org/ > Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From amacater at galactic.demon.co.uk Sun Jan 7 03:22:30 2007 From: amacater at galactic.demon.co.uk (Andrew M.A. Cater) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> Message-ID: <20070107112230.GA7654@galactic.demon.co.uk> On Wed, Jan 03, 2007 at 09:51:44AM -0500, Robert G. Brown wrote: > On Wed, 3 Jan 2007, Leif Nixon wrote: > > >"Robert G. Brown" writes: > > > b) If an attacker has compromised a user account on one of these > workstations, IMO the security battle is already largely lost. They > have a choice of things to attack or further evil they can try to wreak. > Attacking the cluster is one of them, and as discussed if the cluster is > doing real parallel code it is likely to be quite vulnerable regardless > of whether or not its software is up to date because network security is > more or less orthogonal to fine-grained code network performance. > Amen, brother :) > > BTW, the cluster's servers were not (and I would not advise that servers > ever be) running the old distro -- we use a castle keep security model > where servers have extremely limited access, are the most tightly > monitored, and are kept aggressively up to date on a fully supported > distro like Centos. The idea is to give humans more time to detect > intruders that have successfully compromised an account at the > workstation LAN level and squash them like the nasty little dung beetles > that they are. > Can I quote you for Security 101 when I need to explain this stuff for senior management ? > > And we didn't do this "willingly" and aren't that likely to repeat it > ourselves. We had some pretty specific reasons to freeze the node > distro -- the cluster nodes in question were the damnable Tyan dual > Athlon systems that were an incredible PITA to stabilize in the first > place (they had multiple firmware bugs and load-based stability issues > under the best of circumstances). Once we FINALLY got them set up with > a functional kernel and library set so that they wouldn't crash, we were > extremely loathe to mess with it. So we basically froze it and locked > down the nodes so they weren't easily accessible except from inside the > department, and then monitored them with xmlsysd and wulfstat in > addition to the usual syslog-ng and friends admin tools. > It is _always_ worth browsing the archives of this list. Somebody, somewhere has inevitably already seen it/done it/get the scars and is able to explain stuff lucidly. I can't recommend this list highly enough both for it's high signal/noise ratio and it's smart people [rgb 1-8 inclusive, for example] > > In general, though, it is very good advice to stay with an updated OS. > My real point was that WITH yum and a bit of prototyping once every > 12-24 months, it is really pretty easy to ride the FC wave on MANY > clusters, where the tradeoff is better support for new hardware and more > advanced/newer libraries against any library issues that one may or may > not encounter depending on just what the cluster is doing. Freezing FC > (or anything else) long past its support boundary is obviously less > desireable. However, it is also often unnecessary. > Fedora Legacy just closed its doors - if you take a couple of months to get your Uebercluster up and running, you're 1/3 of the way through your FC cycle :( It doesn't square. Fedora looks set to lose its way again for Fedora 7 as they merge Fedora Core and Extras and grow to n-000 packages again - the fast upgrade cycle, lack of maintainers and lack of structure do not bode well. They're apparently moving to a 13 month upgrade cycle - so your Fedora odd releases could well be three years apart. The answer is to take a stable distribution, install the minimum and work with it OR build your own custom infrastructure as far as I can see. Neither Red Hat nor Novell are cluster-aware in any detail - they'll support their install and base programs but don't have the depth of expertise to go further :( > On clusters that add new hardware, usually bleeding edge, every four to > six months as research groups hit grant year boundaries and buy their > next bolus of nodes, FC really does make sense as Centos probably won't > "work" on those nodes in some important way and you'll be stuck > backporting kernels or worse on top of your key libraries e.g. the GSL. > Just upgrade FC regularly across the cluster, probably on an "every > other release" schedule like the one we use. > Chances are that anything Red Hat Enterprise based just won't work. New hardware is always hard. > On clusters (or sub-clusters) with a 3 year replacement cycle, Centos or > other stable equivalent is a no-brainer -- as long as it installs on > your nodes in the first place (recall my previous comment about the > "stars needing to be right" to install RHEL/Centos -- the latest release > has to support the hardware you're buying) you're good to go > indefinitely, with the warm fuzzy knowledge that your nodes will update > from a "supported" repo most of their 3+ year lifetime, although for the > bulk of that time the distro will de-facto be frozen except for whatever > YOU choose to backport and maintain. > Absolutely. > > Nowadays, with PXE/Kickstart/Yum (or Debian equivalents, or the OS of > your choice with warewulf, or...) reinstalling OR upgrading a cluster > node is such a non-event in terms of sysadmin time and effort that it > can pretty much be done at will. I've had the pleasure/pain of watching cluster admins from a distance as they worked on a fully commercial cluster from major vendors. For most on this list, its a no-brainer. I wish I had seen the same. > The worst thing that such a strategy might require is a rebuild of user > applications for both distros, but with shared libraries to my own > admittedly anecdotal experience this "usually" isn't needed going from > older to newer (that is, an older Centos built binary will "probably" > still work on a current FC node, although this obviously depends on the > precise libraries it uses and how rapidly they are changing). It's a > bit harder to take binaries from newer to older, especially in packaged > form. There you almost certainly need an rpmbuild --rebuild and a bit > of luck. > I use Debian - I've never had to learn about more than one repository and one distribution for everything I need. What is this "rebuilding" of which you speak :) > Truthfully, cluster installation and administration has never been > simpler. > I think you underestimate your expertise - and the expertise on this list. My mantra is that cluster administration should be simple and straightforward: in reality, it's seldom so. > rgb Andy From ed at eh3.com Sun Jan 7 05:50:21 2007 From: ed at eh3.com (Ed Hill) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <20070107112230.GA7654@galactic.demon.co.uk> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <20070107112230.GA7654@galactic.demon.co.uk> Message-ID: <20070107085021.6af6efab@ernie> On Sun, 7 Jan 2007 11:22:30 +0000 amacater@galactic.demon.co.uk (Andrew M.A. Cater) wrote: > > Fedora Legacy just closed its doors - if you take a couple of months > to get your Uebercluster up and running, you're 1/3 of the way > through your FC cycle :( It doesn't square. Speaking of which, I recently built a tiny (<20 nodes) cluster using: Sun X2200 (2X Opteron 2210) InfiniBand Fedora Core 6 for x86_64 and it was remarkably easy to get MPI working with IB using OpenMPI, libibverbs, and libmthca (the latter two being available in Fedora Extras and installed with 'yum install ...'). I can certainly appreciate how long it takes to build medium to large clusters for larger and more diverse types of users. But I don't see why there is a pressing need to upgrade the compute nodes as soon as a particular Fedora release is no longer current. If your setup is working then its working -- its no less valid just because your Fedora version is one (or perhaps even three) behind the latest. As others have thoughtfully described, cluster security is typically a gateway or other "choke point" that mostly divorces it from the actual compute nodes. So, once you have things working nicely, you should have vanishingly few reasons to waste time chasing after the latest 'n greatest distro version. It least, that's my view... :-) Ed -- Edward H. Hill III, PhD | ed@eh3.com | http://eh3.com/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070107/d6b42f3d/signature.bin From landman at scalableinformatics.com Sun Jan 7 07:06:17 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <20070107112230.GA7654@galactic.demon.co.uk> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <20070107112230.GA7654@galactic.demon.co.uk> Message-ID: <45A10C69.4030908@scalableinformatics.com> Andrew M.A. Cater wrote: > On Wed, Jan 03, 2007 at 09:51:44AM -0500, Robert G. Brown wrote: >> On Wed, 3 Jan 2007, Leif Nixon wrote: >> >>> "Robert G. Brown" writes: >>> >> b) If an attacker has compromised a user account on one of these >> workstations, IMO the security battle is already largely lost. They s/largely/completely/g At least for this user, if they have single factor passwordless login set up between workstation and cluster. Of course if they are using a malware-ridden, keylogger hosting machine, they have ... uh ... somewhat worse things to deal with than just their accounts on the cluster being open to attack. The solution to this is simple. Never let this happen. Which means, don't use a system which is significantly vulnerable to malware or keylogger insertion. It is left as an exercise to the reader to figure out which platforms are more vulnerable. >> have a choice of things to attack or further evil they can try to wreak. >> Attacking the cluster is one of them, and as discussed if the cluster is >> doing real parallel code it is likely to be quite vulnerable regardless >> of whether or not its software is up to date because network security is >> more or less orthogonal to fine-grained code network performance. >> > > Amen, brother :) > >> BTW, the cluster's servers were not (and I would not advise that servers >> ever be) running the old distro -- we use a castle keep security model >> where servers have extremely limited access, are the most tightly >> monitored, and are kept aggressively up to date on a fully supported >> distro like Centos. The idea is to give humans more time to detect >> intruders that have successfully compromised an account at the >> workstation LAN level and squash them like the nasty little dung beetles >> that they are. Yup. Even better is never letting the users log in to admin machines. Provide machines for them to log into, submit and run jobs from. Just not the admin nodes. [...] >> In general, though, it is very good advice to stay with an updated OS. ... on threat-facing systems, yes, I agree. For what I call production cycle shops, those places which have to churn out processing 24x7x365, you want as little "upgrading" as possible, and it has to be tested/functional with everything. Ask your favorite CIO if they would consider upgrading their most critical systems nightly. It all boils down to a CBA (as everything does). Upgrading carries risk, no matter who does it, and how carefully things are packaged. The CBA equation should look something like this: value_of_upgrade = positive_benefits_of_upgrade - potential_risks_of_upgrade And if the value_of_upgrade is not strongly positive, you probably should not do it if you are supplying a service to a user base. Sure, you can do it on your own personal cluster. I appreciate that people on this list do this for their systems. Regardless of this, you need to be of the (somewhat paranoid) mindset when looking at an upgrade, and the potential for loss of time/data/... A (not so great) example would be someone packaging up a recent 2.6.19 kernel with that oh-so-nice ext3-vm interaction which gave us compromised files. It hit mmap based files from what I could see. All you need is an end user with a corner case that happens to tickle the trigger and whammo. You are now spending time fixing their problem (which requires downgrading/upgrading). You have a perfectly valid reason to upgrade threat facing nodes. Keep them as minimal and as up-to-date as possible. The non-threat facing nodes, this makes far less sense. If you are doing single factor authentication, and have enabled passwordless access within the cluster: ssh keys or certificates or ssh-agent based, once a machine that holds these has been compromised, the game is over. Multi-factor authentication for launching cluster runs is still a challenge, as queuing systems may schedule jobs to start at 3am local time, and no one wants to wait around for job start to enter additional factors. You want to test any upgrade, and only upgrade what needs upgrading. Just like other aspects of security 101, threat facing nodes need to be running as little (important) stuff as possible, and need as limited access as you can give them. Upgrades can and do carry their own bugs and security holes, and you really don't want to be chasing those as well. >> My real point was that WITH yum and a bit of prototyping once every >> 12-24 months, it is really pretty easy to ride the FC wave on MANY >> clusters, where the tradeoff is better support for new hardware and more >> advanced/newer libraries against any library issues that one may or may >> not encounter depending on just what the cluster is doing. Freezing FC >> (or anything else) long past its support boundary is obviously less >> desireable. However, it is also often unnecessary. >> > > Fedora Legacy just closed its doors - if you take a couple of months > to get your Uebercluster up and running, you're 1/3 of the way through > your FC cycle :( It doesn't square. Fedora looks set to lose its way > again for Fedora 7 as they merge Fedora Core and Extras and grow to Hmmm. Fedora is the testing framework for RHEL. We know this. I like 6, it looks to be a fine test distro, and has lots of nice things in it. Works on lots of hardware. If I were building a cluster on it, I would not upgrade the compute nodes. Once they are set, unless there is a good reason to upgrade (newer packages that do not add needed or missing features is not a valid reason IMO), I would leave the compute nodes alone. Probably the head node as well. The login nodes are a different story. Upgrade them (security patches) as quickly as possible. > n-000 packages again - the fast upgrade cycle, lack of maintainers and > lack of structure do not bode well. They're apparently moving to a 13 month > upgrade cycle - so your Fedora odd releases could well be three years apart. > The answer is to take a stable distribution, install the minimum and work > with it OR build your own custom infrastructure as far as I can see. > Neither Red Hat nor Novell are cluster-aware in any detail - they'll > support their install and base programs but don't have the depth of > expertise to go further :( Both are happy to sell licenses to the unwary. At the end of the day, if you are going to build a RHEL cluster, use Centos/Scientific Linux unless you absolutely wish to pay RH for security patches. With SuSE, use OpenSuSE. If you are going to settle on Fedora, pick a distro, and remember that it will be out of support in a year, which shouldn't matter to the compute/head node once they are up. >> On clusters that add new hardware, usually bleeding edge, every four to >> six months as research groups hit grant year boundaries and buy their >> next bolus of nodes, FC really does make sense as Centos probably won't >> "work" on those nodes in some important way and you'll be stuck >> backporting kernels or worse on top of your key libraries e.g. the GSL. >> Just upgrade FC regularly across the cluster, probably on an "every >> other release" schedule like the one we use. >> > > Chances are that anything Red Hat Enterprise based just won't work. New > hardware is always hard. Heh. Try to point this out to a purchasing agent on an RFP which demands a) newest possible hardware and b) RHEL 4 support. You get to pick one or the other, not both. Which one do you want? Hint: "b" is far less valuable. The other (not-so-funny) aspect of this is when we deliver new hardware with an OS load that supports the newer hardware and someone wants to pull it back to the "corporate standard". In doing so, they give up stability, performance, and often file system support. Or in the case of our JackRabbit unit, when we deliver 30TB of 5U system and we get the "ext3 is almost as good as xfs" line. Uh.... er.... no. Those who really insist upon this must only want 16TB units with no possibility to ever grow beyond this (we have a design cooked up to show how to do a 1 PB in 4 racks as a single file system, or better, an HA 1 PB in 9 racks as a single file system). 16TB is great for some folks, but it is a fundamental ext3 limit. You need the untried-in-the-real-world ext4 to break that limit. Or xfs and jfs. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From rgb at phy.duke.edu Sun Jan 7 08:41:57 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <20070107112230.GA7654@galactic.demon.co.uk> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <20070107112230.GA7654@galactic.demon.co.uk> Message-ID: On Sun, 7 Jan 2007, Andrew M.A. Cater wrote: >> BTW, the cluster's servers were not (and I would not advise that servers >> ever be) running the old distro -- we use a castle keep security model >> where servers have extremely limited access, are the most tightly >> monitored, and are kept aggressively up to date on a fully supported >> distro like Centos. The idea is to give humans more time to detect >> intruders that have successfully compromised an account at the >> workstation LAN level and squash them like the nasty little dung beetles >> that they are. >> > > Can I quote you for Security 101 when I need to explain this stuff for > senior management ? Sure. If you arrange to get me there and pay me an exorbitant fee, I'll bring along my sucker rod and explain it to them myself on your behalf. (I wouldn't try to extort the exorbitant fee for it except that to my experience, if management isn't paying you $150/hour plus expenses for your expertise, they devalue it. Besides, I've still got to get SOMEBODY to pay for my kids' Nintendo W(here )i(s )i(t) -- once I actually find one to purchase;-) > It is _always_ worth browsing the archives of this list. Somebody, > somewhere has inevitably already seen it/done it/get the scars and is > able to explain stuff lucidly. I can't recommend this list highly enough > both for it's high signal/noise ratio and it's smart people [rgb 1-8 > inclusive, for example] Make that $200/hour...;-) > Fedora Legacy just closed its doors - if you take a couple of months > to get your Uebercluster up and running, you're 1/3 of the way through > your FC cycle :( It doesn't square. Fedora looks set to lose its way > again for Fedora 7 as they merge Fedora Core and Extras and grow to > n-000 packages again - the fast upgrade cycle, lack of maintainers and > lack of structure do not bode well. They're apparently moving to a 13 month > upgrade cycle - so your Fedora odd releases could well be three years apart. > The answer is to take a stable distribution, install the minimum and work > with it OR build your own custom infrastructure as far as I can see. > Neither Red Hat nor Novell are cluster-aware in any detail - they'll > support their install and base programs but don't have the depth of > expertise to go further :( Yeah, well, if RH asked me to be on their board of directors, I could probably do something about a lot of this. Their business plan is very conservative and is "working" in that they are making money and keeping their investors and the community simultaneously at bay, but they are also missing multiple opportunities to really solidify and attack Microsoft head on. Fedora is working very well in many respects for them -- I've been very happy with it since roughly FC 2 (FC 1 sucked, partly because of the emergence of x86_64 and partly for other reasons). I mean what the heck, they're right down the road, right? I could probably drive to board meetings and they could pay me in options...;-) > Chances are that anything Red Hat Enterprise based just won't work. New > hardware is always hard. Yeah, starting right here. There are several boats RH is missing, but this is the biggest one. RH can freeze all sorts of things in any given distro and just maintain them, but the kernel and its associated hardware layer of support tools (and applications that directly access them) are NOT AMONG THEM! Having a fixed release and support cycles in terms of months or years is just silly. Release cycles should be determined by the way the product itself evolves, which in turn is a marvelous and somewhat erratic function of the rapidly changing hardware market and the whims of the toplevel developers (kernel, compiler, main unix libraries, X). Application space needs to be DECOUPLED from some sort of sane base. This is what they haven't yet grokked -- it is long since past time for linux in general to separate into two distinct pieces, e.g. Fedora >>core<< (which should be really minimal but well maintained for a "long time") and Fedora >>applications<< which should be be entirely separate. Precisely the same split should be visible in e.g. RHEL -- a core that is large enough to support commercial applications with aggressive kernel and hardware-layer updates and number of distinct layers of applicationware -- X all by itself (separate from the core for RHEL since servers don't need it and it really isn't desireable there as it's more to validate, more to secure), DB ware, userspace applications in general, etc. With yum, all the work of being able to support partitioned maintenance on the server or workstation itself is DONE, but the num-nums don't seem to realize it. Microsoft would go mad (and go broke) if they tried to enforce a clean rebuild of every application in the Universe for every new OS version they release. And of course they can't -- even though they've been systematically engulfing makers of WinXX software for years as rapidly as antitrust laws permit them to do so there are still so many companies out there that make hardware with device drivers on disk or standalone software packages that they pretty much have to distribute a core OS and leave it up to the user to break the hell out of things with Installshield and battling libraries from ill-built or out of date software packages. This is where RH missed the boat entirely. Faced with a resource problem as they tried to do the undoable and given a space of possible solutions, they opted for one of the simplest, but least efficient, of those solutions. What they NEEDED to do -- and still need to do -- is think long and hard about just how to reorganize support of RHE linuces (and or FC) so it is BOTH efficient enough to remain within their means and the abilities of their software people to deliver AND capable of both staying up to date on the kernel/core across the board. I can then think of all sorts of ways they could choose to layer successive updates of application space. In fact, "Fedora" could refer ONLY to the aggressiveness of updates in the application layer. At any rate, I empirically have found Centos to be nearly useless for roughly 1/2 of each upgrade cycle on whole classes of hardware. On laptops it is a joke (except for one 6 month window perhaps right after it comes out). On x86_64 hardware it has been a crap shoot. Even on i386 hardware, one has the usual problem with this device or that device, especially in a desktop environment where users DO want their onboard video or sound or network to work (on server class hardware and apps it is more likely to work). Even FC makes me wait on laptops and some desktop hardware. THIS is one of two or three places where Lin still suffers relative to Win -- Windows "always" works on any platform you buy because it is "always" preinstalled and vendors experience pain and suffering if it doesn't preinstall in a functional state. Lin requires me to spend a quiet hour of moderately expert time googling and reading stuff from specialized sites to determine which (if any) firewire PCMCIA cards are known to work before I dare to buy one, which cameras are likely to work, which video adapters or sound cards are supported, which motherboard CHIPSETS are known to work. Bitch, bitch, bitch. Sigh. >> Nowadays, with PXE/Kickstart/Yum (or Debian equivalents, or the OS of >> your choice with warewulf, or...) reinstalling OR upgrading a cluster >> node is such a non-event in terms of sysadmin time and effort that it >> can pretty much be done at will. > > I've had the pleasure/pain of watching cluster admins from a distance > as they worked on a fully commercial cluster from major vendors. For > most on this list, its a no-brainer. I wish I had seen the same. Rather than say it is a no-brainer, perhaps it is fairer to say that once one makes a relatively modest investment in training the brain to learn how to use certain well-supported toolsets and ideas, it becomes easy and the investment is paid back tenfold. We're not quite to where we have a "build-a-bear" GUI front end for cluster building or a complete "cluster package" in any of the major distros, as far as I know, although the warewulf folks and maybe the scyld folks and possibly some others are getting there in their own distinct ways. Again this is fairly silly. Installing a cluster in this way and installing a workstation or office LAN in this way (via PXE/KS/Y) are really pretty much the same general task -- they differ only in package selection and possibly -- I say possibly -- in the way workstations or office systems are named. Imagine a Red Hat sales rep walking into an office with a laptop (with a gigE interface, an 8 port gigE switch with cables, and a halfway decent fast disk). He sets up the laptop and "borrows" four or five office desktops and cables them into the switch. He powers them on and sets their BIOS to boot from the network first, with a standard 3 second or whatever timeout. They boot up, and -- magic! -- they are running RHEL-whatever, with ooffice etc installed and ready to run. WinXX is still untouched on their native disks. Everything is bulletproof and automaintaining, with a clear partition between userspace and rootspace, full control over user accounts and access, etc. He removes the systems from the switch and puts them back on their native LAN and reboots them to WinXX, and points out that installing and maintaining Lin is just that easy. He could have them set up with a Lin server that support WinXX clients and Lin clients that boot just that way overnight, and that permit the office staff to gradually convert to Lin as they learn that it is mostly virusproof, that ooffice pretty much just "works" like msoffice, that a browser is a browser and firefox is a decent one, that there are several hundred free nifty desktop games to while away those tedious cubicle hours when nobody's looking intead of three. At a cost of $50/seat and they can get rid of 2/3 of their admin staff at the same time because one admin can easily support 100-200 desktop seats... Hey, I can dream, can't I? >> The worst thing that such a strategy might require is a rebuild of user >> applications for both distros, but with shared libraries to my own >> admittedly anecdotal experience this "usually" isn't needed going from >> older to newer (that is, an older Centos built binary will "probably" >> still work on a current FC node, although this obviously depends on the >> precise libraries it uses and how rapidly they are changing). It's a >> bit harder to take binaries from newer to older, especially in packaged >> form. There you almost certainly need an rpmbuild --rebuild and a bit >> of luck. >> > > I use Debian - I've never had to learn about more than one repository > and one distribution for everything I need. What is this "rebuilding" of > which you speak :) Ha. I remember well the time that we considered Debian in our department and rejected it because its stable distro suffered from precisely the same problem that RHEL/Centos suffer from now. It was very stable, it worked excellently well, and it was way, way behind the hardware curve in libraries and kernel support. It may well be that they've done a better job than RH at recognizing this as a core user requirement in pretty much any environment so that the stable release tracks the kernel and new hardware better (dealing with libraries and dependencies as required). It would be pretty easy to do. It's just unfortunate that Linux has never QUITE managed to turn the corner and create clean layers of separation between the hardware and kernel, the core libraries and compiler, and application space. Hence the need for distributions at all per se, hence the need for distribution "releases" with applications pretty much all rebuilt just for the functional core in question. The weird thing is that in principle, both rpm and apt permit one to do much better. This is really a problem in computer science and software design and OS organization that is SOLVABLE. Packaging schema contain the hooks required to do so, and the open source community has worked out truly awesome methodology for maintaining a >>huge<< collection of packages (I just grabbed images of FC 6 for i386 and x86_64 from Duke's repo, and they ate close to 30 GB of disk for just the binary RPMs!) The problem is all in the partitioning -- creating "independently" maintainable layers. I have modest hopes for HAL -- it was something of a joke previous to FC 5 or 6, but in 6 it actually works perfectly and transparently a lot of time. This is the kind of thing that is necessary -- with enough abstraction it might be possible to maintain a kernel snapshot "indefinitely" by simply updating its collection of modules and hal itself, so that applications "just work" with new hardware without having to upgrade to an unstable/rawhide release. >> Truthfully, cluster installation and administration has never been >> simpler. >> > > I think you underestimate your expertise - and the expertise on this > list. My mantra is that cluster administration should be simple and > straightforward: in reality, it's seldom so. It depends on the paradigm you adopt, and how lucky you are in terms of hardware matching the capabilities of your distro/release. Which perhaps "shouldn't" be a matter of luck, but often is as there is nothing that can protect you from "lemon" hardware but buying from a vendor that will if necessary completely replace it. (Even prototyping won't always reveal a problem -- it just "probably" will.) IF you select hardware from a vendor that guarantees hardware compatibility with any of the current/mainline distros -- and there are several that do -- AND you select one of those mainline (well-supported and automagically installable) distros AND you learn to master its automagic installation techniques, then managing any sort of linux operation from a single machine to an organization-spanning LAN consisting of an arbitrary mixture of servers, workstation/office LANs, and clusters has never been simpler. That is a true statement. A single repo mirror set, a single homemade package repo, and PXE permit a single individual to provide ALL the software installation and maintenance support required by a large company under these circumstances. Individuals can install linux on their own hardware (at their own risk) at will from the repo(s), departments that follow the hardware rules can install and maintain standardized systems any of a number of ways, and in all cases a pro-class distribution updates all of these systems in a fully automatic way e.g. nightly to the current repo update level, making it easy to install new software or update old software. Cluster admins have it even easier, as their (linux distro compatible) nodes are likely to be all IDENTICAL (in groups, at least, over several generations) and homogeneity is the friend of the administrator just as heterogeneity is Evil Incarnate. Give me a switch and cables and a rackful of Penguin boxes (please!:-), one equipped with a row of hot-swappable disks and a tape library, and I'll take my laptop and its currently FC6-full backpack disk and return you a functional cluster in the amount of time required to physically assemble the nodes plus less than a day to (re)install them with a perfectly reasonable cluster configuration, very nearly independent of the number of nodes or racks. Give me a couple or three days and I can probably arrange to install the cluster a couple or three different ways -- diskful, diskless, mixed, scyldified. Not ALL cluster needs would be satisfied by this of course. That's the basic problem described in detail above. If the cluster "required" RHEL/Centos release X so it could run commercial package Y (and it didn't just run anyway on FC6, which it probably would do:-) and the penguin hardware "required" FC6 because older RHEL/Centos kernels just don't support the network device or dual core dual CPU AMD x86_64 BIOS, then yeah, you enter one form of Linux Hell from which there is no easy escape but to not get the unsupported hardware in the first place, no matter how much your users beg for bleeding edge hardware, OR getting your #&!@ software vendor that you are PAYING to REBUILD their damn application for FC6 (and in the process, package the thing up so it autobuilds as RPMs or whatever) at least as well as all the maintainers of the 6000-odd FREE packages in FC6 manage to package them up (grrr) OR backporting kernels and key libraries from FC6 to RHEL/Centos whatever -- maybe, possibly, don't hold your breath. Hell. Yup, then yum and friends, permitted RPM-derived linuces to emerge from the long night of software dependency hell (where Debian had long since stepped into the light). It is time to really focus on hardware dependency hell and conditional provisioning trees, both of which are well within the capabilities of modern packaging systems and the general linux design. Conditional provisioning trees, in particular, could really revolutionize things and perhaps make it possible to get away from the notion of the "complete distribution release". The current paradigm, which worked amazingly well for order of a few hundred packages, does not scale to a few thousand particularly well, and we're well on our way to 10 Kpkg and up distribution releases, which will be a maintenance nightmare under the current scheme. I think, anyway. The future should be interesting... as always. It would be funny, in a sick sort of way, if Windows manages to hold on in the face of linux because it supports LESS software (but all of the hardware, nearly perfectly). Most people don't need more than a few hundred of the ~10 Kpkgs available. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sun Jan 7 12:49:50 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <45A10C69.4030908@scalableinformatics.com> References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <20070107112230.GA7654@galactic.demon.co.uk> <45A10C69.4030908@scalableinformatics.com> Message-ID: On Sun, 7 Jan 2007, Joe Landman wrote: >>> BTW, the cluster's servers were not (and I would not advise that servers >>> ever be) running the old distro -- we use a castle keep security model >>> where servers have extremely limited access, are the most tightly >>> monitored, and are kept aggressively up to date on a fully supported >>> distro like Centos. The idea is to give humans more time to detect >>> intruders that have successfully compromised an account at the >>> workstation LAN level and squash them like the nasty little dung beetles >>> that they are. > > Yup. Even better is never letting the users log in to admin machines. > Provide machines for them to log into, submit and run jobs from. Just > not the admin nodes. That would be the "servers have extremely limited access" part -- as in sysadmins only. > For what I call production cycle shops, those places which have to churn > out processing 24x7x365, you want as little "upgrading" as possible, and > it has to be tested/functional with everything. Ask your favorite CIO > if they would consider upgrading their most critical systems nightly. > > It all boils down to a CBA (as everything does). Upgrading carries > risk, no matter who does it, and how carefully things are packaged. The > CBA equation should look something like this: > > value_of_upgrade = positive_benefits_of_upgrade - > potential_risks_of_upgrade I completely agree with this. As I pointed out earlier in the thread, companies such as banks make "conservative" seem downright radical when it comes to OS upgrades. They have to do a complete, thorough, comprehensive security audit to change ANYTHING on their machines -- as a requirement in federal law, IIRC. To get them to take you seriously, you MUST be prepared to support the OS they install on (once it is successfully audited) forever -- until the hardware itself falls apart into itty-bitty bits. >>> On clusters that add new hardware, usually bleeding edge, every four to >>> six months as research groups hit grant year boundaries and buy their >>> next bolus of nodes, FC really does make sense as Centos probably won't >>> "work" on those nodes in some important way and you'll be stuck >>> backporting kernels or worse on top of your key libraries e.g. the GSL. >>> Just upgrade FC regularly across the cluster, probably on an "every >>> other release" schedule like the one we use. >>> >> >> Chances are that anything Red Hat Enterprise based just won't work. New >> hardware is always hard. > > Heh. Try to point this out to a purchasing agent on an RFP which > demands a) newest possible hardware and b) RHEL 4 support. You get to > pick one or the other, not both. Which one do you want? Hint: "b" is > far less valuable. > > The other (not-so-funny) aspect of this is when we deliver new hardware > with an OS load that supports the newer hardware and someone wants to > pull it back to the "corporate standard". In doing so, they give up > stability, performance, and often file system support. Or in the case > of our JackRabbit unit, when we deliver 30TB of 5U system and we get the > "ext3 is almost as good as xfs" line. Uh.... er.... no. Those who > really insist upon this must only want 16TB units with no possibility to > ever grow beyond this (we have a design cooked up to show how to do a 1 > PB in 4 racks as a single file system, or better, an HA 1 PB in 9 racks > as a single file system). 16TB is great for some folks, but it is a > fundamental ext3 limit. You need the untried-in-the-real-world ext4 to > break that limit. Or xfs and jfs. Proving once again that Joe's company provides a valuable service, because companies like this fill in an important gap between e.g. FC and a customer's conservative needs. However, I'll bet Joe is still just as vulnerable to the other problem -- customer wants to run commercial package X (which "requires" RHEL) but ALSO wants to run it on bleeding edge hardware. I'll bet you really earn your keep on those ones... ;-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Sun Jan 7 13:25:17 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <20070107112230.GA7654@galactic.demon.co.uk> <45A10C69.4030908@scalableinformatics.com> Message-ID: <45A1653D.4060605@scalableinformatics.com> Robert G. Brown wrote: > Proving once again that Joe's company provides a valuable service, Well thank you (the check will be in the mail :) ) > because companies like this fill in an important gap between e.g. FC and > a customer's conservative needs. However, I'll bet Joe is still just as > vulnerable to the other problem -- customer wants to run commercial > package X (which "requires" RHEL) but ALSO wants to run it on bleeding > edge hardware. I'll bet you really earn your keep on those ones... ... and some rather deep scars, traumatic head wounds, and related ... Still have most of my fingers ... Humor aside, we have a little download area where we point our customers to (http://downloads.scalableinformatics.com). Each file there usually represents a solved problem ... though some have caused problems on their own (the areca drivers and xfs bits for RHEL based distros are there ... I am loath to rewrite anyones initrd without asking them, nicely, and giving them a way to recover should it go horribly wrong ... I still want to come up with a good solution ... ). > > ;-) > > rgb > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Sun Jan 7 19:49:55 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:34 2008 Subject: [Beowulf] Any Gaussian users out there? Message-ID: <45A1BF63.4010203@scalableinformatics.com> I found a neat ... feature ... of Linux while getting g03 running in SMP on cluster nodes. Long story, but the folks I am doing this for don't have/want to use Linda. They asked us to help them get g03 operational in SMP parallel. This wasn't painful. Have it integrated into SGE and our SICE interface now as well. Basic idea is that we are getting a kernel exception in the VFS layer only when running with 2 or more CPUs on an SMP node. Shows up only on SuSE 9.3 nodes. The other nodes are RHEL 3 based (2.4 kernel, but hey, its really stable). I don't want to post a nasty-looking trap here. The problem occurs with both xfs and jfs. Haven't had the chance to try ext3 yet, though if the issue is in the vfs layer, I can't see how changing the underlying block device is going to alter the layers (VFS) above it. The net effect of this is that it runs great on the 2.4 based machines, but gets SIGKILLs when running on the 2.6 based SuSE 9.3 machines. Looks like the app is tickling the OS bug. I can repeatably cause this trap, though it seems to occur at "random" places, well, not really. The way Gaussian runs, it has "links" which are binary modules which execute a particular portion of the calculation (its pretty neat really). Each link is read in from the disk. This VFS bug gets triggered regardless of local or remote FS. Any Gaussian users out there see that? Does a kernel upgrade fix it? Inquiring minds want to know ... -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From nixon at nsc.liu.se Mon Jan 8 03:03:00 2007 From: nixon at nsc.liu.se (Leif Nixon) Date: Fri May 9 01:05:35 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: <45A10C69.4030908@scalableinformatics.com> (Joe Landman's message of "Sun, 07 Jan 2007 10:06:17 -0500") References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <20070107112230.GA7654@galactic.demon.co.uk> <45A10C69.4030908@scalableinformatics.com> Message-ID: Joe Landman writes: > Andrew M.A. Cater wrote: >> On Wed, Jan 03, 2007 at 09:51:44AM -0500, Robert G. Brown wrote: >>> On Wed, 3 Jan 2007, Leif Nixon wrote: >>> >>>> "Robert G. Brown" writes: >>>> >>> b) If an attacker has compromised a user account on one of these >>> workstations, IMO the security battle is already largely lost. They > > s/largely/completely/g > > At least for this user, if they have single factor passwordless login > set up between workstation and cluster. Of course. But you want to contain the intrusion to that single user, as far as possible. If your security hinges on no user passwords ever being stolen, you can very easily wind up in a situation that traditionally is said to involve creeks, but not paddles. I have two thick binders sitting on my desk, containing stolen passwords from an impressive range of commercial, academic and military institutions. >>> In general, though, it is very good advice to stay with an updated OS. > > ... on threat-facing systems, yes, I agree. > > For what I call production cycle shops, those places which have to churn > out processing 24x7x365, you want as little "upgrading" as possible, and > it has to be tested/functional with everything. Ask your favorite CIO > if they would consider upgrading their most critical systems nightly. I see this in hospitals a lot. Some healthcare systems can't be patched without reapplying for FDA approval, which is of course a hideously complicated process. So hospitals wind up running software which you can push over with a feather. Theoretically, they should be running on an isolated network ("It's no problem, we have firewalls!!!"), but it only takes a single mistake: somebody plugs in an infected laptop, or somebody misconfigures a VLAN. Our local hospital has fallen over due to worm infestations a couple of times. > It all boils down to a CBA (as everything does). Upgrading carries > risk, no matter who does it, and how carefully things are packaged. The > CBA equation should look something like this: > > value_of_upgrade = positive_benefits_of_upgrade - > potential_risks_of_upgrade With the security benefits being really hard to quantify. > You have a perfectly valid reason to upgrade threat facing nodes. Keep > them as minimal and as up-to-date as possible. The non-threat facing > nodes, this makes far less sense. If you are doing single factor > authentication, and have enabled passwordless access within the cluster: > ssh keys or certificates or ssh-agent based, once a machine that holds > these has been compromised, the game is over. I don't get this. What's the point of having a "secure" frontend if the systems behind it are insecure? OK, there's one big point - hopefully you can buy some time - but other than that? The goal is to be able to contain user level intrusions. If you can do this, the game *isn't* over even if you have an intrusion spreading to a cluster machine. A user level intrusion isn't too hard to deal with, but a cluster-wide root intrusion... isn't much fun. Sure, you can probably reinstall the entire cluster in an hour. To a vulnerable state. Hooray. -- Leif Nixon - Systems expert ------------------------------------------------------------ National Supercomputer Centre - Linkoping University ------------------------------------------------------------ From landman at scalableinformatics.com Mon Jan 8 07:20:34 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:35 2008 Subject: [Beowulf] Which distro for the cluster? In-Reply-To: References: <1d151d3b0612270946m2b09039ct538339e487cad6e8@mail.gmail.com> <200612290939.59593.csamuel@vpac.org> <20061229005749.GA13471@galactic.demon.co.uk> <20070107112230.GA7654@galactic.demon.co.uk> <45A10C69.4030908@scalableinformatics.com> Message-ID: <45A26142.9050408@scalableinformatics.com> I am posting before coffee (PBC) so if I ramble more than usual, my apologies. Leif Nixon wrote: > Joe Landman writes: >>>> b) If an attacker has compromised a user account on one of these >>>> workstations, IMO the security battle is already largely lost. They >> s/largely/completely/g >> >> At least for this user, if they have single factor passwordless login >> set up between workstation and cluster. > > Of course. But you want to contain the intrusion to that single user, > as far as possible. I think there are two different issues. First: security is meant to be an access control and thottle/choke point. Second: is how you view your cluster. Is it "one-big-machine" in some sense (not necessarily Scyld, but with a security model such that if you are on the access node you are on the machine), or is it really a collection of individual machines each with their own administrative domain? One of these models works really well for "cluster" use. > If your security hinges on no user passwords ever > being stolen, you can very easily wind up in a situation that > traditionally is said to involve creeks, but not paddles. Your security model should mirror your intended usage model as indicated above. If you are using a cluster, security is the front door. If you are using something else, the security model may be different. Since we are into analogies, why not look at it like the front of a very exclusive club. If you get in, you are in. If you want, you can even implement different security room to room, which very quickly causes your club members to leave as it gets hard to move room to room. Security is in part about containment. Containment is not necessarily putting a lock on every door, and a different required key or three to unlock. More importantly, security is about minimizing the maximum damage an attack can do. A different lock on every door may stop the casual attacker, but as you have large binders of stolen passwords (the authorites might wish to ask you how you got them :( ), I have some not so nice log files of years of hackers, some script kiddies, and some very good ones, beating on everything but the front door. Put another way, I've been mimicing a few others for the better part of a decade, saying security is a process, not a product. Making a process hard doesn't necessarily make it secure. Making sure that when the process breaks down, and it will, the damage as a result of that breakage is as low as you can make it. > I have two > thick binders sitting on my desk, containing stolen passwords from an > impressive range of commercial, academic and military institutions. > >>>> In general, though, it is very good advice to stay with an updated OS. >> ... on threat-facing systems, yes, I agree. >> >> For what I call production cycle shops, those places which have to churn >> out processing 24x7x365, you want as little "upgrading" as possible, and >> it has to be tested/functional with everything. Ask your favorite CIO >> if they would consider upgrading their most critical systems nightly. > > I see this in hospitals a lot. I see this in every single production cycle shop we have been in. Not just FDA-regulated. So much so that they have a process that involves building a second (or Nth) test machine, called a sandbox, specifically to test things until they believe them to work before deploying them. Back to this in a second. > Some healthcare systems can't be > patched without reapplying for FDA approval, which is of course a > hideously complicated process. So hospitals wind up running software > which you can push over with a feather. Theoretically, they should be > running on an isolated network ("It's no problem, we have > firewalls!!!"), but it only takes a single mistake: somebody plugs in > an infected laptop, or somebody misconfigures a VLAN. Our local > hospital has fallen over due to worm infestations a couple of times. The analogy fails to hold up. Zero-day viruses and malware on fully patched windows systems burns through the desktop/laptop population of many. What is terrifying to me is that my government still mandates/allows the use of systems which are easily compromised in its most sensitive inner reaches. Specifically in the military and related areas. I don't know details, only heard faint mutterings online, but something like this appears to have knocked some portion of government computers in a highly sensitive area offline for several days very recently. As indicated before, security is not a product (e.g. an updated patch), it is a process (minimizing the maximum damage). If you act otherwise, the zero day virus' and malware are going to wreak havoc. Or if you think your systems are secure because you use multifactor access control with long random passwords and secure id cards, you somehow (mistakenly) believe your systems are secure, and you don't pay attention to some ... misfeatures that are being exercised by people of nefarious intent. If all I ever do is send random garbage to port 22 after doing the handshaking, and eventually blow ssh out of the water, it really doesnt matter if you have multifactor authentication running. I would be in as the user running the daemon. Hence privilege separation. Change the code so that if there is a break in, the maximum damage that can be done is done as the sshd_daemon user. Since they are no longer root user, and they are isolated, in their own group, the damage they can do is contained. Minimizing the maximum damage. > >> It all boils down to a CBA (as everything does). Upgrading carries >> risk, no matter who does it, and how carefully things are packaged. The >> CBA equation should look something like this: >> >> value_of_upgrade = positive_benefits_of_upgrade - >> potential_risks_of_upgrade > > With the security benefits being really hard to quantify. Not really. If you have a huge gaping hole that needs patching (OpenSSL off-by-one or weakness), the benefits are easy. Again, it is a process. You test the upgrade, and if it breaks nothing else, you do it. In fact, this suggests (usually) doing upgrades in smaller incremental bits rather than large complex bits. A huge bolus of patches and fixes often has a few new (mis)features (I could name a company here, but they know who they are) which are unfortunately potentially exploitable. To keep risks low, make as few changes as possible. To keep benefits high, update important threat facing things. To keep risks lower, do not introduce more changes than absolutely needed. Patches should not include new (mis)features. >> You have a perfectly valid reason to upgrade threat facing nodes. Keep >> them as minimal and as up-to-date as possible. The non-threat facing >> nodes, this makes far less sense. If you are doing single factor >> authentication, and have enabled passwordless access within the cluster: >> ssh keys or certificates or ssh-agent based, once a machine that holds >> these has been compromised, the game is over. > > I don't get this. What's the point of having a "secure" frontend if > the systems behind it are insecure? OK, there's one big point - > hopefully you can buy some time - but other than that? Its the model of how you use the machine. If you lock all the doors tight with impenetrable seals, and the attacker goes through the weaker windows, those impenetrable seals haven't done much for you. The idea is you minimize the exposed footprint of the machine to threat facing access. This is why lots of the secure sites are disabling USB ports on the motherboards (but mistakenly then running systems which can install keyloggers and other malware ... ). If the USB does not electrically work, it is not a possible attack vector. You can always take the approach of compartmentalization; locking *everything* down. Put those impenetrable seals up. Have one port exposed. Allow no back channels whatsoever. No shared storage. No single factor authentication. This is not a cluster computing model that I have heard of. Would break too many things. Yeah yeah, grid this and that. > The goal is to be able to contain user level intrusions. If you can do I disagree. I think the goal is to minimize the maximum damage. I do not think it is possible to completely contain a smart and resourceful attacker with multiple attack vectors. I know lots of security folks who used to think that their firewalls could, and then watched said resourceful hackers go through them. > this, the game *isn't* over even if you have an intrusion spreading to > a cluster machine. A user level intrusion isn't too hard to deal with, > but a cluster-wide root intrusion... isn't much fun. Sure, you can > probably reinstall the entire cluster in an hour. To a vulnerable > state. Hooray. Again, I disagree. I do not believe patching is a magic solution. A well designed security model that, in the event that the assumptions of the model break down (say all the doors and ports suddenly, magically spring open, because the attacker muttered the appropriate phrase into the wire), still limits the damage that can be done, might be an approach worth considering. Again, I watch (in horror) as military organizations, with some really nice rules and procedures behind them designed to contain and control these bits, proceed to use systems that are known to be insecure by design. If your system can be keylogged, it should never ever be on a network, anywhere. Or change your system so that keylogging is harder/impossible. Security is a process. Like never downloading important information to a laptop only to let it be stolen / lost later on. The current fad is encrypting the disk, and this might prevent some attacks, or slow the rate of information release. Or not. The point is that if the maximum damage an attacker can do is contained or minimized, then you can gather valuable threat infor