From deadline at clusterworld.com Sat Jan 1 07:40:44 2005 From: deadline at clusterworld.com (Douglas Eadline, Cluster World Magazine) Date: Sat, 1 Jan 2005 10:40:44 -0500 (EST) Subject: [Beowulf] Xgrid and Mosix (fwd from john@rudd.cc) In-Reply-To: <20041230122059.GL9221@leitl.org> Message-ID: These two pages are useful when considering Mosix What does not migrate: http://howto.x-tend.be/openMosixWiki/index.php/don't what does migrate: http://howto.x-tend.be/openMosixWiki/index.php/work%20smoothly ClusterWorld just ran a feature on OpenMosix http://www.clusterworld.com/issues/dec-04-preview.shtml Doug On Thu, 30 Dec 2004, Eugen Leitl wrote: > ----- Forwarded message from John Rudd ----- > > From: John Rudd > Date: Wed, 29 Dec 2004 19:37:02 -0800 > To: xgrid-users at lists.apple.com > Subject: Xgrid and Mosix > X-Mailer: Apple Mail (2.619) > > > I see in the archives that someone asked about OpenMosix back in > September ( > http://lists.apple.com/archives/xgrid-users/2004/Sep/msg00023.html ), > but I didn't see any responses. So I thought I'd ask too, but with a > little more detail. > > The thing that I find interesting about the Mosix style distributed > computing environment is that applications do NOT need to be re-written > around them. Mosix abstracts the distributed computing cluster away > from the program and developer in the same way that threads abstract > multi-processing away from the program and developer. Under Mosix, any > program, without having to be written around any special library, > without having to be relinked or recompiled, can be moved off to > another processing node if there are nodes that are significantly less > busy than yours. And, AFAIK, any multi-threaded application can make > use of multiple nodes (with threads being spawned on any host that is > less loaded than the current node). Imagine taking a completely > mundane but multi-threaded application (I'll assume Photoshop is > multi-threaded and use that as an example). Suddenly, without having > to get Adobe to support Xgrid, you can use Xgrid to speed up your > Photoshop rendering. > > It seems to me that a similar set of features could be added to Xgrid. > The threading and processing spawning code within the kernel could be > extended by Xgrid to check for lightly loaded Agents, and move the new > process or thread to that Agent. Only the IO routines would need to > exist on the Client (and even then, maybe not: if every node has > similar filesystem image, then only the UI (for user bound > applications) or primary network interface code (for network > daemons/servers) needs to run on the original Client system). From > what I recall, the mach microkernel already makes some infrastructure > for this type of thing available, it just needs to be utilized, and > done deep enough in the kernel that an application doesn't need to know > about it. > > > Though, that does bring up one consideration: I have a friend who did a > lot of distributed computing work when he was working for Flying > Crocodile (a web hosting company that specialized in porn sites, where > his distributed computing code had to support multiple-millions of hits > per second). His experience there gave him a concern about Mosix style > distributed computing. One of the advantages of something like Beowulf > is that the coder often needs to control what things need to be kept > low latency (must use threads for SMP on the local processor) and what > things can have high latency (can use parallel code on the network), > and the programming interface type of distributed computing gives them > that flexibility. > > The idea that I suggested was something like nice/renice in unix, where > you could specify certain parallelism parameters to a process before > you run it, or after it is already running. For example, instead of > "process priority", you might specify a sort of "process affinity" or > "thread affinity". For process affinity, a low number (which means > high affinity, just like priority and nice numbers) means "when this > process creates a child, it must be kept close to the same CPU as the > one that spawned it". Thread affinity would be the same, but for > threads. A default of zero means "everything must run locally". A > high number means "I can tolerate more latency" (so, "latency > tolerance" would be the opposite of "affinity"). (it occurs to me > after I wrote all of this that it might be easier for the end user to > think in terms of "latency tolerance" instead of "process affinity", > high latency = high number, instead of the opportunity for confusion > that affinity has since the numbers go in the opposite direction ... I > hope all of that made sense) > > A process with a low process affinity (high number) and a high thread > affinity (low number) means that it can spawn new > tasks/processes/applications anywhere in the network, but any threads > for it (or its sub-processes) must exist on the same node as its main > thread. Or, if you want all of the applications to be running on your > workstation/Client, but run their threads all over the network, then > you set a high process affinity (low number), and a low thread affinity > (high number). > > I would have the xgrid command line tool have such a facility (I don't > know if it does already or not, I haven't really done much with xgrid) > similar to both the "nice" and "renice" commands. I would also add a > preference pane that allows the user to set a default process affinity, > a default thread affinity, and a list of applications and default > affinities for each of those applications (so that they can be > exceptions to the default, without the user having to set it via > command line every time). Last, I would add a tool, possibly attached > to the Xgrid tachometer, which would allow me to adjust an affinity > after a program was running. > > The only thing up in the air is the ability to move a running thread > from one node to another while it's running (well, during a context > switch, really). I know a friend of mine at Ga Tech was doing PhD > research on that (portable threads) 10ish years ago, but I don't know > if it got anywhere. But, that would allow someone to lower the number > of an application's affinity while it's running, thus recalling the > threads or processes from a remote Agent to the local Client (the > scenario being I have a laptop that is an Xgrid Client, and I start > running applications that spread out across the network ... then I get > up to leave, so I lower the affinity numbers of everything so that the > tasks and threads come back to my laptop, running slower now that they > have fewer nodes to run upon, but still running (or sleeping, as the > case might be)). > > > So ... all of that leads up to: does anyone know if Xgrid is working on > this type of Application-Transparent Distributed Computing that Mosix, > OpenMosix, and I think OpenSSI have? I think it would be a natural > extension to Xgrid: Apple is trying to make this as "it just works" as > possible, so it seems that it should not only be easy for the sysadmin > to set up the distributed computing cluster, but easy/transparent for > the developer, too (in the same way that threads made Multi-Processing > easier and more abstract for the developer, this type of distributed > computing makes threads not just a multi-processing model, but a > distributed computing model). Ultimately, it even makes distributed > computing easy for the user: they don't need to learn how to re-code a > program (or coerce a vendor into making a distributed version of their > application), any multi-threaded application will use multiple nodes, > and even single-threaded non-distributed applications can be run on > remote nodes. That seems like a powerful "it just works" capability to > me. > > (the main drawback of Mosix, OpenMosix, and OpenSSI from my perspective > is that they're Linux only, specifically developed for the Linux kernel > ... but I'd really love to see something like them available for Mac OS > X) > > Thoughts? > > _______________________________________________ > Do not post admin requests to the list. They will be ignored. > Xgrid-users mailing list (Xgrid-users at lists.apple.com) > Help/Unsubscribe/Update your Subscription: > http://lists.apple.com/mailman/options/xgrid-users/eugen%40leitl.org > > This email sent to eugen at leitl.org > > ----- End forwarded message ----- > -- ---------------------------------------------------------------- Editor-in-chief ClusterWorld Magazine Desk: 610.865.6061 Fax: 610.865.6618 www.clusterworld.com From jrajiv at hclinsys.com Mon Jan 3 03:51:41 2005 From: jrajiv at hclinsys.com (Rajiv) Date: Mon, 3 Jan 2005 17:21:41 +0530 Subject: [Beowulf] grid Message-ID: <010501c4f18a$9742e120$0f120897@PMORND> Dear All, 1. I would like to implement grid computing in Linux. Any suggestions on what opensource package to use. 2. Is there any free grid projects in Windows Regards, Rajiv -------------- next part -------------- An HTML attachment was scrubbed... URL: From zogas at upatras.gr Mon Jan 3 09:07:36 2005 From: zogas at upatras.gr (Stavros E. Zogas) Date: Mon, 3 Jan 2005 19:07:36 +0200 Subject: [Beowulf] MPI vs PVM Message-ID: <000d01c4f1b6$b941b6d0$63ae8c96@zogas> Hi at all I want to set up a beowulf cluster for scientific purposes (In a university) using mainly fortran compilers.I want your help to choose between MPI and PVM Thanks in Advance Stavros -------------- next part -------------- An HTML attachment was scrubbed... URL: From mwill at penguincomputing.com Mon Jan 3 09:52:37 2005 From: mwill at penguincomputing.com (Michael Will) Date: Mon, 3 Jan 2005 09:52:37 -0800 Subject: [Beowulf] MPI vs PVM In-Reply-To: <000d01c4f1b6$b941b6d0$63ae8c96@zogas> References: <000d01c4f1b6$b941b6d0$63ae8c96@zogas> Message-ID: <200501030952.37992.mwill@penguincomputing.com> The short answer is: You definitely want MPI for performance reasons. Michael On Monday 03 January 2005 09:07 am, Stavros E. Zogas wrote: > Hi at all > I want to set up a beowulf cluster for scientific purposes (In a university) > using mainly fortran compilers.I want your help to choose between MPI and > PVM > Thanks in Advance > Stavros > -- Michael Will, Linux Sales Engineer NEWS: We have moved to a larger iceberg :-) NEWS: 300 California St., San Francisco, CA. Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com From kus at free.net Mon Jan 3 10:19:48 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Mon, 03 Jan 2005 21:19:48 +0300 Subject: [Beowulf] grid In-Reply-To: <010501c4f18a$9742e120$0f120897@PMORND> Message-ID: In message from "Rajiv" (Mon, 3 Jan 2005 17:21:41 +0530): >Dear All, > 1. I would like to implement grid computing in Linux. Any >suggestions on what opensource package to use. > 2. Is there any free grid projects in Windows 1) There is not only one "kind" of GRID, for example computational (for example, distribution of batch jobs between batch job systems or "data" grid for data-intensive tasks) Example of relative universal opensource solution - Globus (www.globus.org) One example of solution for batch queues is Sun Grid Engine (see on Sun web-sites). 2) Pls take into account, that it's not the mail list which is responsible for GRID itself. 3) And what is about Windows, I'm not sure that using this OS corresponds to beowulf idea itself, so pls look on sites cited above. Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > >Regards, >Rajiv From lindahl at pathscale.com Mon Jan 3 11:12:45 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Mon, 3 Jan 2005 11:12:45 -0800 Subject: [Beowulf] Re: MPI Implementations for SMP use In-Reply-To: <41C610B0.7070603@uiuc.edu> References: <200412191956.iBJJu25v009271@bluewest.scyld.com> <41C610B0.7070603@uiuc.edu> Message-ID: <20050103191245.GA1734@greglaptop.internal.keyresearch.com> On Sun, Dec 19, 2004 at 05:37:20PM -0600, Isaac Dooley wrote: > Not necessarily. Charm++ uses an abstraction that does not concern the > programmer with the location/node of a given object. That's the problem. I should really use "explicit locality" instead of "perfect locality" -- the perfect was referring to the good performance implications of the programmer never screwing up because he doesn't realize an object isn't local. Explicit locality is harder to use, but helps scaling/performance. BTW, I work with one of the charm++/AMPI authors, so I think I'm pretty familiar with it ;-) -- greg From lusk at mcs.anl.gov Mon Jan 3 11:00:31 2005 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Mon, 03 Jan 2005 13:00:31 -0600 (CST) Subject: [Beowulf] MPI vs PVM In-Reply-To: <000d01c4f1b6$b941b6d0$63ae8c96@zogas> References: <000d01c4f1b6$b941b6d0$63ae8c96@zogas> Message-ID: <20050103.130031.59656021.lusk@localhost> You might also consider "Why are PVM and MPI so Different?" by Bill Gropp and me in the proceedings of the 4th European PVM/MPI Users' Group Meeting, Springer Lecture Notes in Computer Science 1332, 1997, pp. 3--10. In that paper we tried to focus on the sources of the implementation-independent semantic differences. Rusty From rgb at phy.duke.edu Mon Jan 3 12:32:00 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 3 Jan 2005 15:32:00 -0500 (EST) Subject: [Beowulf] MPI vs PVM In-Reply-To: <000d01c4f1b6$b941b6d0$63ae8c96@zogas> References: <000d01c4f1b6$b941b6d0$63ae8c96@zogas> Message-ID: On Mon, 3 Jan 2005, Stavros E. Zogas wrote: > Hi at all > I want to set up a beowulf cluster for scientific purposes (In a university) > using mainly fortran compilers.I want your help to choose between MPI and > PVM > Thanks in Advance > Stavros This is an interesting choice. MPI is likely more "portable" as it was originally designed to be a common interface to big iron supercomputers and is still a nearly universal parallel API for supercomputers of all sorts, including beowulfish clusters. MPI typically has support for native hardware drivers for advanced networks as well, although this may depend somewhate on the particular flavor of MPI you select (there are several available). PVM, OTOH, was from the beginning an open source project for building possibly heterogeneous parallel HPC clusters. There are (IIRC) some drivers available for advanced networks, but support for this sort of thing is a bit spottier. If you've been doing cluster supercomputing for more than a decade (noting that this means from before the "beowulf" project itself got started, or at least was publicized) then chances are pretty decent that you got started with PVM and still prefer it overall today (according to my own VERY informal and anecdotal poll on the issue). If your plan is to use ethernet as a network and to write just one or two simple applications, PVM is a pretty reasonable choice. I provide a PVM project template master/slave application, ready to build, on my personal website here: http://www.phy.duke.edu/~rgb/General/project_pvm.php However, it is in C. However, the fortran interface is very similar, I believe (I haven't willingly used fortran for nearly 20 years at this point) and you could probably hack the template to make it a fortran template with a bit of effort. If you plan to use an advanced network, now or later, I'd have to suggest that you use MPI. It is also worth pointing out that there is a regular MPI column in Cluster World Magazine, while PVM only gets occassional attention from folks like me there (I wrote the template for a CWM column). However, I don't recall that it focusses on Fortran. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From rgb at phy.duke.edu Mon Jan 3 12:39:47 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 3 Jan 2005 15:39:47 -0500 (EST) Subject: [Beowulf] grid In-Reply-To: <010501c4f18a$9742e120$0f120897@PMORND> References: <010501c4f18a$9742e120$0f120897@PMORND> Message-ID: On Mon, 3 Jan 2005, Rajiv wrote: > Dear All, > 1. I would like to implement grid computing in Linux. Any suggestions > on what opensource package to use. SGE is probably the basic tool of choice, although Open Mosix is another possibility. There are also more advanced and complex tools under development for use in large scale grids shared accross domain boundaries that some googling will doubtless turn up. These do things like manage access and authentication over a WAN. > 2. Is there any free grid projects in Windows Don't know. Don't really care, truthfully. It is probable that some of the same open source tools that work under linux have Windows ports. It is equally probable that they won't work as well there, will have bugs, will be "expensive" by the time you finish paying for compilers and libraries, and -- what's the point? You can easily set up a WinXX node with a PXE ethernet interface to dual boot or diskless boot linux (assuming that you ask because you have access to a small pile of windows machines that you can use but cannot completely reinstall as linux). If they have 3-6 GB of free disk, installing dual boot likely makes sense. If they have no free disk to speak of but have adequate physical memory (probably at least 256 MB of main memory) setting them up to run a diskless linux makes sense -- just boot them into linux from the PXE interface, use them as a cluster, and reboot them into WinXX when done, without ever touching the disk or windows installation. rgb > > Regards, > Rajiv -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From hahn at physics.mcmaster.ca Tue Jan 4 07:20:44 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue, 4 Jan 2005 10:20:44 -0500 (EST) Subject: [Beowulf] MPI vs PVM In-Reply-To: <20050103.130031.59656021.lusk@localhost> Message-ID: > You might also consider "Why are PVM and MPI so Different?" by Bill > Gropp and me in the proceedings of the 4th European PVM/MPI Users' Group > Meeting, Springer Lecture Notes in Computer Science 1332, 1997, > pp. 3--10. In that paper we tried to focus on the sources of the > implementation-independent semantic differences. that paper and other really good stuff is here: http://www-unix.mcs.anl.gov/mpi/papers/archive/ From srgadmin at cs.hku.hk Mon Jan 3 20:45:33 2005 From: srgadmin at cs.hku.hk (srg-admin) Date: Tue, 04 Jan 2005 12:45:33 +0800 Subject: [Beowulf] ICPP 2005 CFP: Deadline Extended to 1/10/05 Message-ID: <41DA1F6D.4020203@cs.hku.hk> ************************************************************** ** 1 Week Extension * 1 Week Extension * 1 Week Extension * * ************************************************************** CALL FOR PAPERS - 34th Annual Conference - 2005 International Conference on Parallel Processing (ICPP 2005) http://www.dnd.no/icpp2005 Georg Sverdrups House, Univ. of Oslo, Norway June 14-17, 2005 Sponsored by The International Association for Computers and Communications (IACC) Simula Research Laboratory Norwegian Computer Society In cooperation with The University of Oslo, Norway The Ohio State University, USA Scope --------- The conference provides a forum for engineers and scientists in academia, industry and government to present their latest research findings in any aspects of parallel and distributed computing. Topics of interest include, but are not limited to: * Architecture * Petaflop Computing * Algorithms & Applications * Programming Methodologies * Cluster Computing * Tools * Compilers and Languages * Parallel Embedded Systems * OS and Resource Management * Multimedia & Network Services * Network-Based/Grid Computing * Wireless & Mobile Computing Paper Submission ----------------------- Form of Manuscript: Not to exceed 20 double-spaced, 8.5 x 11-inch pages (including figures, tables and references) in 10-12 point font. Number each page. Include an abstract, five to ten keywords, the technical area(s) most relevant to your paper, and the corresponding author's e-mail address. Electronic Submission: Web-based submissions only. Please see the conference web page for details. Dates and Deadlines ---------------------------- Submission Deadline January 10, 2005, 9pm EST (Final Deadline Extension!) Author Notification March 21, 2005 Final Manuscript Due April 11, 2005 Workshops will be held June 14 and 17, 2005 Proceedings of the conference and workshops will be available at the conference and will be published by the IEEE Computer Society For Further Information Please contact - Professor Olav Lysne, Simula Research Laboratory, olavly at simula.no - Professor Lionel Ni, Hong Kong U. of Sci. & Tech., ni at cs.ust.hk - Professor Jose Duato, U. Politecnica de Valenica, jduato at disca.upv.es - Dr. Wu-chun Feng, Los Alamos National Laboratory, feng at lanl.gov From noel.das at gmail.com Tue Jan 4 02:58:47 2005 From: noel.das at gmail.com (Noel Tanmoy Das) Date: Tue, 4 Jan 2005 16:58:47 +0600 Subject: [Beowulf] search engine Message-ID: <9ebf30c10501040258279c4d11@mail.gmail.com> how can i build a search engine (e.g. something like google) in a beowulf cluster? help wanted. -- noeL From john.hearns at streamline-computing.com Tue Jan 4 02:03:03 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Tue, 04 Jan 2005 10:03:03 +0000 Subject: [Beowulf] grid In-Reply-To: <010501c4f18a$9742e120$0f120897@PMORND> References: <010501c4f18a$9742e120$0f120897@PMORND> Message-ID: <1104832983.8064.6.camel@Vigor45> On Mon, 2005-01-03 at 17:21 +0530, Rajiv wrote: > Dear All, > 2. Is there any free grid projects in Windows Others on the list have recommended looking at Sun Gridengine. There are Windows execution host clients in the future for Gridengine (to make that clear, as far as I know there will be Windows clients, so you can run Windows machines as part of the cluster. The SGE master machine is Solaris/Unix/Linux ) From laytonjb at charter.net Tue Jan 4 07:56:10 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Tue, 04 Jan 2005 10:56:10 -0500 Subject: [Beowulf] Scalapack with Pathscale compiler Message-ID: <41DABC9A.9020903@charter.net> Good morning, I'm looking for a nice SLmake.inc file for building Scalapack using the Pathscale compilers. Does anyone have one they can share? (I've googled around and can't seem to find one). TIA! Jeff From dag at sonsorol.org Tue Jan 4 07:56:51 2005 From: dag at sonsorol.org (Chris Dagdigian) Date: Tue, 04 Jan 2005 10:56:51 -0500 Subject: [Beowulf] grid In-Reply-To: <1104832983.8064.6.camel@Vigor45> References: <010501c4f18a$9742e120$0f120897@PMORND> <1104832983.8064.6.camel@Vigor45> Message-ID: <41DABCC3.8010005@sonsorol.org> Just a clarification ... The Windows execution clients for Grid Engine are not going to be part of the "free" Grid Engine software stack. They will come bundled with the commercially licensed version of SGE sold by Sun Microsystems (they call the product "N1 Grid Engine 6". The last unofficial word I heard on the state of this effort was "end of 2004" -- no word yet on when in '05 we'll see anything. Sun started this trend when SGE went to version 6.0 -- in the current version of "N1 Grid Engine 6" (aka "SGE 6.0u1") there are additional accounting and reporting tools not found in the open source product. I'm pleased that they are doing it this way -- the core Grid Engine codebase remains free and of amazing quality and Sun gets to make some money by selling add-on modules delivering extra functionality presumably to the enterprise folks who need the layered products. -Chris John Hearns wrote: > On Mon, 2005-01-03 at 17:21 +0530, Rajiv wrote: > >>Dear All, >> 2. Is there any free grid projects in Windows > > Others on the list have recommended looking at Sun Gridengine. > > There are Windows execution host clients in the future > for Gridengine (to make that clear, as far as I know there will be > Windows clients, so you can run Windows machines as part of the cluster. > The SGE master machine is Solaris/Unix/Linux ) -- Chris Dagdigian, BioTeam - Independent life science IT & informatics consulting Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E iChat/AIM: bioteamdag Web: http://bioteam.net From landman at scalableinformatics.com Tue Jan 4 08:40:53 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 4 Jan 2005 11:40:53 -0500 Subject: [Beowulf] Scalapack with Pathscale compiler In-Reply-To: <41DABC9A.9020903@charter.net> References: <41DABC9A.9020903@charter.net> Message-ID: <20050104164024.M30777@scalableinformatics.com> Hi Jeff: Yes, stashed somewhere. I'll get it over to you later tonight/tomorrow. joe On Tue, 04 Jan 2005 10:56:10 -0500, Jeffrey B. Layton wrote > Good morning, > > I'm looking for a nice SLmake.inc file for building > Scalapack using the Pathscale compilers. Does anyone > have one they can share? (I've googled around and can't > seem to find one). > > TIA! > > Jeff > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://scalableinformatics.com phone: +1 734 612 4615 From rgb at phy.duke.edu Tue Jan 4 10:01:04 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 4 Jan 2005 13:01:04 -0500 (EST) Subject: [Beowulf] MPI vs PVM In-Reply-To: References: Message-ID: On Tue, 4 Jan 2005, Mark Hahn wrote: > > You might also consider "Why are PVM and MPI so Different?" by Bill > > Gropp and me in the proceedings of the 4th European PVM/MPI Users' Group > > Meeting, Springer Lecture Notes in Computer Science 1332, 1997, > > pp. 3--10. In that paper we tried to focus on the sources of the > > implementation-independent semantic differences. > > that paper and other really good stuff is here: > http://www-unix.mcs.anl.gov/mpi/papers/archive/ While we're on it, there is also: http://www.csm.ornl.gov/pvm/PVMvsMPI.ps from back in 1996. The features of the two haven't really changed since this comparison was written, EXCEPT that vendor support in the form of non-TCP hardware drivers for the major low latency interconnects has perhaps been directed somewhat more at MPI than at PVM although one could argue that this is also a decision of the PVM maintainers. Still, even this isn't all that big a differentiator. There is PVM-GM (for myrinet). SCI (Dolphinics) can also apparently be embedded into PVM, where TCP is used to set up an SCI-based memory map to perform the actual communication. This is described in a nice (and fairly current) white paper on the dolphinics site that reiterates more of the same stuff that I've been saying. That is, PVM is generally preferred by Old Guys Like Me (who started with PVM), people who run on a heterogeneous cluster, and "programmers who prefer its dynamic features". MPI is preferred by people who want the best possible performance (maybe -- I haven't seen a lot of data to support this but am willing to believe it) especially on a homogeneous cluster, people who have code they want to be able to port to a "real supercomputer" (which generally run MPI as noted in the document above which describes their mutual history but which probably WON'T run PVM), and people who started out with IT. As in, whatever you take the time to learn first will likely become your favorite -- both work pretty well and have the essential features required to write efficient message-passing parallel code. At one time there was even talk of merging the two; its sort of a shame that this never was pursued. In fact, here is a lovely project for some bright CPS student (or professor, looking for a class assignment) who might be reading this list. Write (or have your students write) a PVM "emulator" on top of MPI or better, an MPI emulator on top of PVM. Or get REALLY ambitious, and separate the interconnect layer, the message passing layer, and the actual communications calls (wrappers) so that a single tool provides either the PVM API or MPI API. Speaking for myself, I'd like the merged tool to have a PVM-like control interface for managing the "virtual cluster" -- it is one of the things that I think is one of those "dynamic" features people prefer, especially when one learns to use the built in diagnostics. I'd like to have MPI's back-end communications stack, including all the native drivers for low-latency high-bandwidth interconnects. I'd like things like broadcast and multicast (one to many, many to many) communications to be transparently and EFFICIENTLY implemented -- to really exploit the hardware features and/or topology of the cluster without needing to really understand exactly how they work. I'd like a new set of "ps"-like commands for monitoring cluster tasks running under the PVMPI tool, so one doesn't have to run a bproc-like kernelspace tool to be able to simply monitor task execution. I'd like the tool to be >>secure<< and not a bleeding wound on any LAN or WAN virtual cluster, whether or not it is firewalled. And I'd REALLY like a very few features to be added to support making parallel tasks robust -- not necessarily checkpointing, but perhaps a "select"-like facility that can take certain actions if a node dies, including automagically re-adding it (post reboot, for example) and restarting a task on it After all, message passing is message passing. One establishes a "socket" between tasks running on different hosts. One writes to the socket, reads from the socket (where the socket may be "virtual" through e.g. SCI mapped memory). One maintains a table of state information. Everything else is fancy frills and clever calls. Raw sockets (or mapped memory) through MPI or PVM or PVMPI is just a matter of how one wraps all this up and hides it behind an API. Alas, after a bit of preliminary work back in 1996, e.g.: http://www.netlib.org/utk/papers/pvmpi/paper.html Fagg and Dongarra seem to have let this particular project slip. A shame... rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From rgb at phy.duke.edu Tue Jan 4 10:14:42 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 4 Jan 2005 13:14:42 -0500 (EST) Subject: [Beowulf] search engine In-Reply-To: <9ebf30c10501040258279c4d11@mail.gmail.com> References: <9ebf30c10501040258279c4d11@mail.gmail.com> Message-ID: On Tue, 4 Jan 2005, Noel Tanmoy Das wrote: > how can i build a search engine (e.g. something like google) in a > beowulf cluster? help wanted. Wrong cluster type. This is called a "high availability" type cluster, although it certainly shares a lot of features with beowulf or HPC clusters. There are several answers possible here. One is to contact google and buy/rent their engine. It is a very, very good one and for a professional enterprise project that requires an internal/private search engine well worth the cost. A second one (if all you want to do is let people search for stuff you have up on a big website) is to use google for free -- it is fairly trivial to add a google box to any web page. If you want to WRITE an open-source search engine to e.g. COMPETE with google -- well, using google with something like "search engine open source" as the string turns up a list of free and open source tools at e.g. http://www.searchtools.com/tools/tools-opensource.html. I'd look over these projects, pick the best one that has the most active group working on it, and join the project rather than starting your own from scratch. It is very likely that one or more of the projects listed on this page already run on a cluster of some sort, as building and searching a very, very large database is a task with lots of natural parallelism. It is also very nontrivial -- I couldn't begin to tell you exactly how it all works as I don't know. To me google is just plain black magic -- it seems to crossreference EVERYTHING on the web all the way down to fairly deep embedded text (at a guess, well over a petabyte of distributed data) and still returns hits on most searches in a matter of seconds, no matter what the search string and no matter when you use it. It's like a tiny piece of the mind of God... or if you prefer a less blasphemous metaphor derived from "The Lucifer Principle", it is the memory function of the extended neural network that forms the superorganism known as "The Web", where we, and the websites we contribute and maintain, are the neurons themselves. If the human race has a developing collective intelligence, this is a core piece of it. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From mprinkey at aeolusresearch.com Tue Jan 4 08:31:01 2005 From: mprinkey at aeolusresearch.com (Michael T. Prinkey) Date: Tue, 4 Jan 2005 11:31:01 -0500 (EST) Subject: [Beowulf] OpenFOAM Message-ID: Hi everyone, I don't think this has been mentioned here yet. Many of you are CFD users so it may be of interest. FOAM, an object-oriented CFD toolkit, has been recently released under the GPL. It is parallel using MPI (lam) and has a lot of useful CFD features (unstructured mesh, lots of physical models). Perhaps most interesting is its very flexible and concise mechanisms for defining new phyiscal laws and algorithms. The released version includes modules to do LES, multiphase flow, FEM of solids, electromagnetism, and even financial modeling. http://www.openfoam.org I am interested if anyone has experience with OpenFOAM or is currently planning to make use of it. Thanks, Mike Prinkey From eugen at leitl.org Wed Jan 5 12:12:27 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 5 Jan 2005 21:12:27 +0100 Subject: [Beowulf] Re: [Bioclusters] FW: cluster newbie (fwd from dag@sonsorol.org) Message-ID: <20050105201227.GH9221@leitl.org> ----- Forwarded message from Chris Dagdigian ----- From: Chris Dagdigian Date: Wed, 05 Jan 2005 14:43:55 -0500 To: "Clustering, compute farming & distributed computing in life science informatics" Subject: Re: [Bioclusters] FW: cluster newbie Organization: Bioteam Inc. User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.5) Gecko/20041217 Reply-To: "Clustering, compute farming & distributed computing in life science informatics" Hi Nick, I've written about this in the past; you can find stuff in the list archives. Some of the articles and presentations I've done on "bioclusters" are linked off this URL: http://bioteam.net/dag/ - something there may be of use. In general I'd go with BioBrew or the ROCKS cluster kit ("roll") that comes with Grid Engine as the scheduler if you are starting out. If you want to roll your own cluster to have maximum flexibility and really learn the behind the scenes stuff just pick the "best" (or your favorite) componants to match the following requirements: 1. A Linux distribution 2. A Resource manager & scheduler software (Grid Engine, etc.) 3. Software for doing unattended "bare metal" installs and incremental updates (SystemImager, etc.) 4. Management & monitoring packages (ganglia, nagios, bigbrother, etc.) Take what you chose for #1-4 and put all the compute nodes on a private gigabit ethernet switch. There should be a common NFS share for user home directories and maybe for the scheduler system. Pick one node to be the "master" node and connect one of its NICs to the "private" cluster network and the 2nd NIC to the company/department network. This way you and your users only have to login to and deal with the one single "master" node. -Chris Nick D'Angelo wrote: > >All, > >I am sure this has been asked many times before, but what is the preferred >method or perhaps 'best' method of clustering a few 3-5 RedHat or other >Linux flavours to best suit our Bioinfo R and D group? > >I have come across biobrew with their own cd distribution install and also >this group. > >I was going to originally look at Fedora core 2, but that appeared to be >painful due to the kernel re-compile and to be honest, the documentation >appeared to be quite poor, at least what I found. > > Any suggestions? > > Thanks, -- Chris Dagdigian, BioTeam - Independent life science IT & informatics consulting Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E iChat/AIM: bioteamdag Web: http://bioteam.net _______________________________________________ Bioclusters maillist - Bioclusters at bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From eugen at leitl.org Wed Jan 5 13:37:15 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 5 Jan 2005 22:37:15 +0100 Subject: [Beowulf] Re: [Bioclusters] FW: cluster newbie (fwd from gotero@linuxprophet.com) Message-ID: <20050105213715.GK9221@leitl.org> ----- Forwarded message from Glen Otero ----- From: Glen Otero Date: Wed, 5 Jan 2005 13:13:56 -0800 To: "Clustering, compute farming & distributed computing in life science informatics" Subject: Re: [Bioclusters] FW: cluster newbie X-Mailer: Apple Mail (2.619) Reply-To: "Clustering, compute farming & distributed computing in life science informatics" ROCKS doesn't include BLAST, EMBOSS, or ClustalW, but BioBrew does. BioBrew is based on ROCKS, and therefore has the same installation procedure (minus all the CD swapping). A new release of BioBrew that includes recent versions of BLAST, mpiBLAST, EMBOSS, and ClustalW is very near. Installation procedures won't change between the current version and the upcoming release. Glen On Jan 5, 2005, at 12:21 PM, Nick D'Angelo wrote: > After a quick glance at the install process, it looks very slick > indeed. > > However, is the Blast, EMBOSS and ClustalW included or does it need > to be > bundled in? > > > > -----Original Message----- > From: Matt Harrington [mailto:matt at msg.ucsf.edu] > Sent: Wednesday, January 05, 2005 3:07 PM > To: Clustering, compute farming & distributed computing in life > science > informatics > Subject: Re: [Bioclusters] FW: cluster newbie > > > > > I highly recommend ROCKS: > > http://www.rocksclusters.org > > i even use it for non-clustered compute nodes. i've simplified my > Linux > life > around ROCKS for compute servers and Suse for graphics workstations. > > ---Matt > > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters > _______________________________________________ > Bioclusters maillist - Bioclusters at bioinformatics.org > https://bioinformatics.org/mailman/listinfo/bioclusters > > > Glen Otero Ph.D. Linux Prophet _______________________________________________ Bioclusters maillist - Bioclusters at bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From jerry at oban.biosc.lsu.edu Thu Jan 6 12:33:39 2005 From: jerry at oban.biosc.lsu.edu (Jerry Xu) Date: 06 Jan 2005 14:33:39 -0600 Subject: [Beowulf] the solution for qdel fail..... Message-ID: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> Hey, Huang: I found one solution that works for me, maybe you can try it and see whether it works for you. in your pbs script, try to add this "kill -gm 5" syntax between the processor number and your program like this mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram it works for me. Jerry. /********************************************************** Hi, We have a new system set up. The vendor set up the PBS for us. For administration reasons, we created a new queue "dque" (set to default) using the "qmgr" command: create queue dque queue_type=e s q dqueue enabled=true, started=true I was able to submit jobs using the "qsub" command to queue "dque". However, when I use "qdel" to kill a job, the job disappears from the job list shown by "qstat -a", but the executable is still running on the compute nodes. Every time I have to login the corresponding the compute node and kill the running job. I am wondering if I missed something in setting up the queue so that I am unable to kill the job completely using "qdel". Thanks. From wscullin at cct.lsu.edu Thu Jan 6 15:56:37 2005 From: wscullin at cct.lsu.edu (William Scullin) Date: Thu, 06 Jan 2005 17:56:37 -0600 Subject: [Beowulf] the solution for qdel fail..... In-Reply-To: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> References: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> Message-ID: <1105055797.14387.9793.camel@blackflag.cct.lsu.edu> Howdy, The --gm-kill is specific to clusters using myrinet and mostly is there to ensure that slave processes using myrinet's mpi hang up when the master process is done running. The number after the --gm-kill is the timeout in seconds. I am not sure which version, type, or member of the PBS family you are using. If you are using PBS Pro (also probably true for torque and Open PBS), you should be able to place two scripts in /var/spool/PBS/mom_priv/ called prologue and epilogue on every compute node. They must be owned by root and be executable / readable / writable only by root. The prologue script will run before every job and the epilogue script will run after every job. In the epilogue and prologue scripts we use, we clean the nodes of all lingering user processes and do some basic checking of node health. Even if an epilogue script misses a process ??? or a user a user launches a process outside of the queuing system ??? the prologue will still catch it before the next job starts to run. Best, William On Thu, 2005-01-06 at 14:33, Jerry Xu wrote: > Hey, Huang: > > I found one solution that works for me, maybe you can try it and see > whether it works for you. > > in your pbs script, try to add this "kill -gm 5" syntax between the > processor number and your program > > like this > > mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram > > it works for me. > > Jerry. > > /********************************************************** > Hi, > > We have a new system set up. The vendor set up the PBS for us. For > administration reasons, we created a new queue "dque" (set to default) > using the "qmgr" command: > > create queue dque queue_type=e > s q dqueue enabled=true, started=true > > I was able to submit jobs using the "qsub" command to queue "dque". > However, when I use "qdel" to kill a job, the job disappears from the > job list shown by "qstat -a", but the executable is still running on > the compute nodes. Every time I have to login the corresponding the > compute node and kill the running job. > > I am wondering if I missed something in setting up the queue so that I > am unable to kill the job completely using "qdel". > > Thanks. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ William Scullin System Administrator Center for Computation and Technology 342 Johnston Hall Louisiana State University Baton Rouge, Louisiana 70803 voice: 225 578 6888 fax: 225 578 5362 aim: WilliamAtLSU ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From beaneg at umcs.maine.edu Thu Jan 6 16:31:10 2005 From: beaneg at umcs.maine.edu (Glen Beane) Date: Thu, 06 Jan 2005 19:31:10 -0500 Subject: [Beowulf] the solution for qdel fail..... In-Reply-To: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> References: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> Message-ID: <41DDD84E.60800@umcs.maine.edu> a definite solution is to use mpiexec from www.osc.edu/~pw/mpiexec instead of mpirun. This mpiexec is a tm based replacement for mpirun (tm is the PBS task-manager protocol). When tm is used to spawn all the processes instead of ssh/rsh PBS is then aware of all the process that belong to the job and therefore it will properly kill them all in the event of a qdel or hitting the walltime limit. When you use mpirun PBS is only aware of the initial mpirun process since it does not spawn any of the other processes. Glen Beane Advanced Computing Research Lab University of Maine Jerry Xu wrote: > Hey, Huang: > > I found one solution that works for me, maybe you can try it and see > whether it works for you. > > in your pbs script, try to add this "kill -gm 5" syntax between the > processor number and your program > > like this > > mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram > > it works for me. > > Jerry. > > /********************************************************** > Hi, > > We have a new system set up. The vendor set up the PBS for us. For > administration reasons, we created a new queue "dque" (set to default) > using the "qmgr" command: > > create queue dque queue_type=e > s q dqueue enabled=true, started=true > > I was able to submit jobs using the "qsub" command to queue "dque". > However, when I use "qdel" to kill a job, the job disappears from the > job list shown by "qstat -a", but the executable is still running on > the compute nodes. Every time I have to login the corresponding the > compute node and kill the running job. > > I am wondering if I missed something in setting up the queue so that I > am unable to kill the job completely using "qdel". > > Thanks. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From fly at anydata.co.uk Thu Jan 6 16:36:40 2005 From: fly at anydata.co.uk (Fred L Youhanaie) Date: Fri, 07 Jan 2005 00:36:40 +0000 Subject: [Beowulf] the solution for qdel fail..... In-Reply-To: <1105055797.14387.9793.camel@blackflag.cct.lsu.edu> References: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> <1105055797.14387.9793.camel@blackflag.cct.lsu.edu> Message-ID: <41DDD998.40800@anydata.co.uk> Hi, Since you are using PBS you may want to consider mpiexec, http://www.osc.edu/~pw/mpiexec/, it is basically a replacement for mpirun, but with tight integration with PBS, so once you issue qdel for a job it will kill all the subtasks on the remote nodes. It will also do a better resource accounting, e.g. total cpu used by all nodes, and will eliminate the need for ssh/rsh :) Even if you are not using mpi, you can spawn multiple instances of a program with the '--comm=none' option. Cheers f. William Scullin wrote: > Howdy, > > The --gm-kill is specific to clusters using myrinet and mostly is there > to ensure that slave processes using myrinet's mpi hang up when the > master process is done running. The number after the --gm-kill is the > timeout in seconds. > > I am not sure which version, type, or member of the PBS family you are > using. If you are using PBS Pro (also probably true for torque and Open > PBS), you should be able to place two scripts in > /var/spool/PBS/mom_priv/ called prologue and epilogue on every compute > node. They must be owned by root and be executable / readable / writable > only by root. The prologue script will run before every job and the > epilogue script will run after every job. In the epilogue and prologue > scripts we use, we clean the nodes of all lingering user processes and > do some basic checking of node health. > > Even if an epilogue script misses a process ??? or a user a user launches > a process outside of the queuing system ??? the prologue will still catch > it before the next job starts to run. > > Best, > William > > On Thu, 2005-01-06 at 14:33, Jerry Xu wrote: > >>Hey, Huang: >> >> I found one solution that works for me, maybe you can try it and see >>whether it works for you. >> >>in your pbs script, try to add this "kill -gm 5" syntax between the >>processor number and your program >> >>like this >> >>mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram >> >>it works for me. >> >>Jerry. >> >>/********************************************************** >>Hi, >> >>We have a new system set up. The vendor set up the PBS for us. For >>administration reasons, we created a new queue "dque" (set to default) >>using the "qmgr" command: >> >>create queue dque queue_type=e >>s q dqueue enabled=true, started=true >> >>I was able to submit jobs using the "qsub" command to queue "dque". >>However, when I use "qdel" to kill a job, the job disappears from the >>job list shown by "qstat -a", but the executable is still running on >>the compute nodes. Every time I have to login the corresponding the >>compute node and kill the running job. >> >>I am wondering if I missed something in setting up the queue so that I >>am unable to kill the job completely using "qdel". >> >>Thanks. From jrajiv at hclinsys.com Fri Jan 7 04:25:15 2005 From: jrajiv at hclinsys.com (Rajiv) Date: Fri, 7 Jan 2005 17:55:15 +0530 Subject: [Beowulf] Grid Documents Message-ID: <01a501c4f4b3$f1900f50$0f120897@PMORND> Dear All, I am confused with Grid and HPC. Could you provide some good document sites on grid computing which would help me undestand grid better. Regards, Rajiv -------------- next part -------------- An HTML attachment was scrubbed... URL: From rynge at isi.edu Fri Jan 7 10:44:32 2005 From: rynge at isi.edu (Mats Rynge) Date: Fri, 7 Jan 2005 10:44:32 -0800 Subject: [Beowulf] Grid Documents In-Reply-To: <01a501c4f4b3$f1900f50$0f120897@PMORND> References: <01a501c4f4b3$f1900f50$0f120897@PMORND> Message-ID: <20050107184432.GA858@isi.edu> * Rajiv [2005-01-07 17:55:15 +0530]: > I am confused with Grid and HPC. Could you provide some good document > sites on grid computing which would help me undestand grid better. "The Anatomy of the Grid" http://www.globus.org/research/papers/anatomy.pdf "What is the Grid? A Three Point Checklist" http://www-fp.mcs.anl.gov/~foster/Articles/WhatIsTheGrid.pdf Grid is a very hyped up word right now, and you will find a lot of other definitions (pretty much depending on what company/product is being sold). -- Mats Rynge USC/Information Sciences Institute - Center for Grid Technologies From jerry at oban.biosc.lsu.edu Mon Jan 10 07:49:12 2005 From: jerry at oban.biosc.lsu.edu (Jerry Xu) Date: 10 Jan 2005 09:49:12 -0600 Subject: [Beowulf] the solution for qdel fail..... In-Reply-To: <1105055797.14387.9793.camel@blackflag.cct.lsu.edu> References: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> <1105055797.14387.9793.camel@blackflag.cct.lsu.edu> Message-ID: <1105372151.15841.5.camel@strathmill.biosc.lsu.edu> Hi, William, Thank for your information. Just in case somebody still need it for openPBS configuration, here is my epilogue file.it shall be located in $pbshome/mom_priv/ for each node and it need to be set as executable and owned by root. Some others many have better epilogue scripts... /*****************************************************/ echo '------------clean up------------' echo running pbs epilogue script # set key variables USER=$2 NODEFILE=/var/spool/pbs/aux/$1 echo echo killing processes of user $USER on the batch nodes for node in `cat $NODEFILE` do echo Doing node $node su $USER -c "ssh $node skill -KILL -u $USER" done echo Done /****************************************************/ On Thu, 2005-01-06 at 17:56, William Scullin wrote: > Howdy, > > The --gm-kill is specific to clusters using myrinet and mostly is there > to ensure that slave processes using myrinet's mpi hang up when the > master process is done running. The number after the --gm-kill is the > timeout in seconds. > > I am not sure which version, type, or member of the PBS family you are > using. If you are using PBS Pro (also probably true for torque and Open > PBS), you should be able to place two scripts in > /var/spool/PBS/mom_priv/ called prologue and epilogue on every compute > node. They must be owned by root and be executable / readable / writable > only by root. The prologue script will run before every job and the > epilogue script will run after every job. In the epilogue and prologue > scripts we use, we clean the nodes of all lingering user processes and > do some basic checking of node health. > > Even if an epilogue script misses a process ? or a user a user launches > a process outside of the queuing system ? the prologue will still catch > it before the next job starts to run. > > Best, > William > > On Thu, 2005-01-06 at 14:33, Jerry Xu wrote: > > Hey, Huang: > > > > I found one solution that works for me, maybe you can try it and see > > whether it works for you. > > > > in your pbs script, try to add this "kill -gm 5" syntax between the > > processor number and your program > > > > like this > > > > mpirun -machinefile $PBS_NODEFILE -np $NPROCS --gm-kill 5 myprogram > > > > it works for me. > > > > Jerry. > > > > /********************************************************** > > Hi, > > > > We have a new system set up. The vendor set up the PBS for us. For > > administration reasons, we created a new queue "dque" (set to default) > > using the "qmgr" command: > > > > create queue dque queue_type=e > > s q dqueue enabled=true, started=true > > > > I was able to submit jobs using the "qsub" command to queue "dque". > > However, when I use "qdel" to kill a job, the job disappears from the > > job list shown by "qstat -a", but the executable is still running on > > the compute nodes. Every time I have to login the corresponding the > > compute node and kill the running job. > > > > I am wondering if I missed something in setting up the queue so that I > > am unable to kill the job completely using "qdel". > > > > Thanks. > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > William Scullin > System Administrator > Center for Computation and Technology > 342 Johnston Hall > Louisiana State University > Baton Rouge, Louisiana 70803 > voice: 225 578 6888 > fax: 225 578 5362 > aim: WilliamAtLSU > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From hahn at physics.mcmaster.ca Tue Jan 11 20:03:26 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue, 11 Jan 2005 23:03:26 -0500 (EST) Subject: [Beowulf] OK, I'll be first: mac mini Message-ID: anyone working on assembling a large cluster of Mac mini's? I hafta believe that some of you bio or montecarlo types are thinking about it. sure, the box only has a G4, and only 100bT, but it's small, fairly cheap and gosh-darn cute! fairly low power, and the input might be DC (gangable?). if not a compute cluster, how about a video wall? radeon 9200 is nothing to sneeze at (faster than my desktops), and one of these would look pretty cute sitting under a mini-DLP projector. I wonder if there are embarassing-enough parallel applications where even the annoying cat5 could be dispensed with (just opt for the internal 54 Mbps wireless card). regards, mark hahn. From dtj at uberh4x0r.org Tue Jan 11 20:49:44 2005 From: dtj at uberh4x0r.org (Dean Johnson) Date: Tue, 11 Jan 2005 22:49:44 -0600 Subject: [Beowulf] OK, I'll be first: mac mini In-Reply-To: References: Message-ID: <1105505384.17557.48.camel@terra> On Tue, 2005-01-11 at 22:03, Mark Hahn wrote: > anyone working on assembling a large cluster of Mac mini's? > > I hafta believe that some of you bio or montecarlo types are > thinking about it. sure, the box only has a G4, and only 100bT, > but it's small, fairly cheap and gosh-darn cute! > > fairly low power, and the input might be DC (gangable?). > if not a compute cluster, how about a video wall? radeon 9200 is > nothing to sneeze at (faster than my desktops), and one of these > would look pretty cute sitting under a mini-DLP projector. > > I wonder if there are embarassing-enough parallel applications where > even the annoying cat5 could be dispensed with (just opt for the > internal 54 Mbps wireless card). > One thing that I didn't see was a good definition of what type of drive is sitting inside of it. Due to the sizes being 40G and 80G, I suspect that they are 2.5" drives, which would be a slight bummer. Firewire800 would have been nice too. The fact that 14 of them are about the same size as a normal tower case is kinda nice. I offer to the public domain the term "mac mini blade bracket". ;-) Personally, if my wife wouldn't beat me, I would get one as a settop box. -- -Dean From eugen at leitl.org Wed Jan 12 02:39:55 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 12 Jan 2005 11:39:55 +0100 Subject: [Beowulf] Mac Mini Monster (fwd from drewmccormack@mac.com) Message-ID: <20050112103955.GK9221@leitl.org> ----- Forwarded message from Drew McCormack ----- From: Drew McCormack Date: Wed, 12 Jan 2005 09:08:05 +0100 To: Apple Scitech Mailing List Subject: Mac Mini Monster X-Mailer: Apple Mail (2.619) Has anyone out there thought about the possibility of building a Mac Mini cluster? I can imagine that for certain applications, that don't do much in the way of communicating, it could be a cheap option. They couldn't require too much in the way of energy, and you can probably stack 50 on a 2m x 1m desk. The idea reminds me a bit of people that build Sony Playstation clusters, except Mac mini is the real deal. OS X under the hood. Hard disk. You name it. Just an idea, Drew ======================================== Dr. Drew McCormack (Kmr. R153) Afd. Theoretische Chemie Faculteit Exacte Wetenschappen Vrije Universiteit Amsterdam De Boelelaan 1083 1081 HV Amsterdam The Netherlands Email da.mccormack at few.vu.nl Web www.maniacalextent.com Telephone +31 20 5987623 Mobile +31 6 48321307 Fax +31 20 5987629 _______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (Scitech at lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org This email sent to eugen at leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From eugen at leitl.org Wed Jan 12 02:40:19 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 12 Jan 2005 11:40:19 +0100 Subject: [Beowulf] Re: Mac Mini Monster (fwd from bryan.jones@m.cc.utah.edu) Message-ID: <20050112104018.GL9221@leitl.org> ----- Forwarded message from Bryan Jones ----- From: Bryan Jones Date: Wed, 12 Jan 2005 01:23:01 -0700 To: Drew McCormack Cc: Apple Scitech Mailing List Subject: Re: Mac Mini Monster X-Mailer: Apple Mail (2.619) Yeah, I talked about it earlier on Slashdot today. If you run the calculations, there are situations where a little bookshelf of 7 or so Mini Macs has a better payoff than a couple of Xserves. Of course this will depend upon ones task and whether or not the Mini Macs have a Level 3 cache..... At the very least, it is a very cheap way to do cluster development. Bryan On Jan 12, 2005, at 1:08 AM, Drew McCormack wrote: >Has anyone out there thought about the possibility of building a Mac >Mini cluster? I can imagine that for certain applications, that don't >do much in the way of communicating, it could be a cheap option. They >couldn't require too much in the way of energy, and you can probably >stack 50 on a 2m x 1m desk. > >The idea reminds me a bit of people that build Sony Playstation >clusters, except Mac mini is the real deal. OS X under the hood. Hard >disk. You name it. > >Just an idea, >Drew Bryan William Jones, Ph.D. bryan.jones at m.cc.utah.edu University of Utah School of Medicine Moran Eye Center Rm 3339A 75 N. Medical Dr. Salt Lake City, Utah 84132 http://prometheus.med.utah.edu/~marclab/ iChat/AIM address: bw_jones at mac.com _______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (Scitech at lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org This email sent to eugen at leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From eugen at leitl.org Wed Jan 12 02:40:30 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 12 Jan 2005 11:40:30 +0100 Subject: [Beowulf] Re: Mac Mini Monster (fwd from tuparev@mac.com) Message-ID: <20050112104030.GM9221@leitl.org> ----- Forwarded message from Georg Tuparev ----- From: Georg Tuparev Date: Wed, 12 Jan 2005 09:50:15 +0100 To: Drew McCormack Cc: Apple Scitech Mailing List Subject: Re: Mac Mini Monster X-Mailer: Apple Mail (2.619) Ha! This is what I call convergence of thoughts ;-) I am just sitting in our Sofia office, drinking the morning coffee and counting how many MacMini's we could fit on top of the Xserve to do file intensive conversions -- e.g. transforming raw images into FITS format (astronomy). Yes, we decided to buy 32 of them for test ;-) gt On Jan 12, 2005, at 9:08 AM, Drew McCormack wrote: >Has anyone out there thought about the possibility of building a Mac >Mini cluster? I can imagine that for certain applications, that don't >do much in the way of communicating, it could be a cheap option. They >couldn't require too much in the way of energy, and you can probably >stack 50 on a 2m x 1m desk. > >The idea reminds me a bit of people that build Sony Playstation >clusters, except Mac mini is the real deal. OS X under the hood. Hard >disk. You name it. > >Just an idea, >Drew > >======================================== > Dr. Drew McCormack (Kmr. R153) > Afd. Theoretische Chemie > Faculteit Exacte Wetenschappen > Vrije Universiteit Amsterdam > De Boelelaan 1083 > 1081 HV Amsterdam > The Netherlands Georg Tuparev Tuparev Technologies Sofijski Geroj 3, Vh.2, 4th Floor, Apt. 27 1612 Sofia, Bulgaria Phone: +359-2-9501505 Mobile: +31-6-55798196 www.tuparev.com _______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (Scitech at lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org This email sent to eugen at leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From eugen at leitl.org Thu Jan 13 03:31:56 2005 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 13 Jan 2005 12:31:56 +0100 Subject: [Beowulf] [Bioclusters] servers for bio web services setup (fwd from michab@dcs.gla.ac.uk) Message-ID: <20050113113155.GQ9221@leitl.org> ----- Forwarded message from Micha Bayer ----- From: Micha Bayer Date: 13 Jan 2005 09:48:52 +0000 To: "bioclusters at bioinformatics.org" Subject: [Bioclusters] servers for bio web services setup Organization: X-Mailer: Ximian Evolution 1.2.2 (1.2.2-5) Reply-To: "Clustering, compute farming & distributed computing in life science informatics" Hi all, I have been asked for advice on hardware by my boss, but I am slightly out of my depth here because I really do software. He wants to set up a bio facility which provides web/grid services (probably Axis or GT3/4) to a substantial user community (UK-wide but with access control, so probably in the region of hundreds or perhaps thousands of potential users). Services will include the usual things things like BLAST, ClustalW, protein structure analysis etc. -- probably a small subset of what EBI offers. The computational back end is likely to be our UK National Grid or similar, but either way he is only providing the server that hosts the middleware and metascheduler. He is wondering what hardware setup setup is best for this. We are probably looking at running the web/grid services out of Tomcat. Would a single high-spec machine be sufficient for this kind of thing? Or would one have several servers doing the same thing in parallel? In which case, what spec should they have and how would they be coordinated? many thanks Micha -- -------------------------------------------------- Dr Micha M Bayer Grid Developer, BRIDGES Project National e-Science Centre, Glasgow Hub 246c Kelvin Building University of Glasgow Glasgow G12 8QQ Scotland, UK Email: michab at dcs.gla.ac.uk Project home page: http://www.brc.dcs.gla.ac.uk/projects/bridges/ Personal Homepage: http://www.brc.dcs.gla.ac.uk/~michab/ Tel.: +44 (0)141 330 2958 _______________________________________________ Bioclusters maillist - Bioclusters at bioinformatics.org https://bioinformatics.org/mailman/listinfo/bioclusters ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From rhamann at uccs.edu Fri Jan 7 13:50:39 2005 From: rhamann at uccs.edu (R Hamann) Date: Fri, 07 Jan 2005 14:50:39 -0700 Subject: [Beowulf] MPICH2 on Scyld In-Reply-To: <20050107184432.GA858@isi.edu> References: <01a501c4f4b3$f1900f50$0f120897@PMORND> <20050107184432.GA858@isi.edu> Message-ID: Hi, I'm trying to (at the last minute) get MPICH2 running on a SCyld cluster, the old version which you get on the free cdrom. The problem is I don't have root on the cluster, so I am working from my home directory. I ssh into the master from outside. I can bpsh to the other nodes in the cluster, but I can't seem to get mpd started on the other nodes using bpsh. It returns the following: Traceback (most recent call last): File "/home/rwhamann/bin/mpd", line 1369, in ? _process_cmdline_args() File "/home/rwhamann/bin/mpd", line 1301, in _process_cmdline_args g.myIP = gethostbyname_ex(g.myHost)[2][0] socket.gaierror: (-2, 'Name or service not known') When I bpsh to the other nodes, my path is correct and which points me to the current MPI programs and current version of python, both in my home directory. Has anyone ever done this before without root access, or can give me some pointers? Ron From jesuspiro at hotmail.com Sat Jan 8 13:41:02 2005 From: jesuspiro at hotmail.com (jesus iglesias) Date: Sat, 8 Jan 2005 22:41:02 +0100 Subject: [Beowulf] thesis regarding grid and clusters Message-ID: Hi everyone, I'm Jesus Iglesias, a recently graduated Telecommunications engineer by a spanish university, In a few weeks time I'm starting a doctoral thesis which topic is the following: "Study of reneweable energies and its relationships with meteorolgy variables based on a distritibuted computer system (grid computing style)" The basic idea is to use a cluster (beowulf style) in a certain city to process simultaneously data coming from an energy station and a meteorolgy station, and then repeat this study with another cluster in another different city. In the end, the objective is to connect these computer clusters (making a grid this way) so as to be able to do this study globally (reaching a number of differents cities in Spain). As a beginner in everything related to clusters and grid computing I'll be asking for help at this list quite often from now onwards. My first question is an obvious one, do you guys think this idea for the distributed computing system is a good one taking into account the scientific study that I'm planning to do? Thanks everyone there in advance, Jesus, from Spain -------------- next part -------------- An HTML attachment was scrubbed... URL: From asabigue at fing.edu.uy Sun Jan 9 05:09:40 2005 From: asabigue at fing.edu.uy (Ariel Sabiguero) Date: Sun, 09 Jan 2005 14:09:40 +0100 Subject: [Beowulf] Cooling vs HW replacement Message-ID: <41E12D14.3000402@fing.edu.uy> Hello all. The following question shall only consider costs, not uptime or reliability of the solution. I need to balance costs of hardware replacement after failures over air conditioning costs. The question arises as most current hardware comes with 3 or more years of warranty. During that period of time Moore twofolded twice hardware performance... is it worth spending money cooling down a cluster or just rebuilding it after it "burns out" (and is at least 4 times slower than state-of-the art)? Is it worth cooling down the room to a Class A Computer room standard or save the money for hardware upgrade after three years? In warm countries keeping 18?C the air inside a room (PC-heated) when outside temperature is 30?C average it becomes pretty expensive to pay electricity bills. It is cheaper to "circulate" 30?C air and have from 40-50?C inside the chassis. Do you have figures or graphs plotting MTBF vs temperature for main system components (memory, CPU, mainboard, HDD) ? Links to this information are highly appreciated! I remember old (40MB RLL disks shipping this information with the device, several pages of printed manual) hardware showing the difference in MTBF vs environment conditions, but nowadays commodity harware does not consider this on the sticker on the top of the device... Regards Ariel PS: if the idea is worth the money, then I would like to study reliability and uptime, but it is not the main concern now. From bill at Princeton.EDU Mon Jan 10 09:12:49 2005 From: bill at Princeton.EDU (Bill Wichser) Date: Mon, 10 Jan 2005 12:12:49 -0500 Subject: [Beowulf] HP 2848 switch woes Message-ID: <41E2B791.6090103@princeton.edu> Trying to install a new cluster of Tyan 2881 mothers with CentOS 3.3, kernel 2.4.21-27.0.1.ELsmp (Opteron). When running through this switch (Firmware:I.08.55, ROM:I.08.04), the system is forced to do a manual install as a failure occurs in what I believe is the initial discovery phase after the kernel boots. When a direct connection is made to the head node, everything proceeds as normal. During the initial booting, after PXE, the system sends a request out to the network asking for it's MAC address. Right before this time, the network card appears to be reset by the OS. This appears to be the normal progression from within the kernel. On a direct cable, the rarp is seen and the compute node receives the info via the head node, right after the network card is reset. Through the switch though, the rarp is never seen by the head node. At first I thought it was something with autodetection and so set the switch up for just Gig. It certainly isn't the case that rarps don't work as the initial tftp boot works fine, the vmlinuz is downloaded and booting proceeds. It only is when during the boot phase when the network card is reset does communication somehow fail. I've set the timeout in the switch for 15 minutes, made sure spanning tree was off, connected the cables to adjacent ports, all to no avail. If anyone has any suggestions I am all ears as I have run out of ideas at this point. HP just suggests updating the firmware, which I have done to no avail. Thanks, Bill From csamuel at vpac.org Mon Jan 10 14:44:43 2005 From: csamuel at vpac.org (Chris Samuel) Date: Tue, 11 Jan 2005 09:44:43 +1100 Subject: [Beowulf] the solution for qdel fail..... In-Reply-To: <1105372151.15841.5.camel@strathmill.biosc.lsu.edu> References: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> <1105055797.14387.9793.camel@blackflag.cct.lsu.edu> <1105372151.15841.5.camel@strathmill.biosc.lsu.edu> Message-ID: <200501110944.46037.csamuel@vpac.org> On Tue, 11 Jan 2005 02:49 am, Jerry Xu wrote: > Hi, William, Thank for your information. Just in case somebody still > need it for openPBS configuration, here is my epilogue file.it shall be > located in $pbshome/mom_priv/ for each node and it need to be set as > executable and owned by root. Some others many have better epilogue > scripts... Hmm, the only thing that worries me about that is that for those of us with SMP clusters it is possible for a user to have two different jobs running on each of the CPUs, so an epilogue script that kills all a users processes on a node would accidentally kill an innocent job. cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From rodmur at maybe.org Tue Jan 11 16:02:39 2005 From: rodmur at maybe.org (Dale Harris) Date: Tue, 11 Jan 2005 16:02:39 -0800 Subject: [Beowulf] 64 bit Xeons? Message-ID: <20050112000239.GB14480@maybe.org> http://www.intel.com/products/server/processors/server/xeon/ So anyone tried the Intel Xeon Processor MP? Is it just a repackaged Itanium? -- Dale Harris rodmur at maybe.org /.-) From bropers at cct.lsu.edu Tue Jan 11 20:18:57 2005 From: bropers at cct.lsu.edu (Brian D. Ropers-Huilman) Date: Tue, 11 Jan 2005 22:18:57 -0600 Subject: [Beowulf] OK, I'll be first: mac mini In-Reply-To: References: Message-ID: <41E4A531.5040003@cct.lsu.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Funny you should mention this... I had daydreams all day thinking the same thoughts. Time, and my budget, will tell. Mark Hahn said the following on 2005-01-11 22:03: | anyone working on assembling a large cluster of Mac mini's? | | I hafta believe that some of you bio or montecarlo types are | thinking about it. sure, the box only has a G4, and only 100bT, | but it's small, fairly cheap and gosh-darn cute! | | fairly low power, and the input might be DC (gangable?). | if not a compute cluster, how about a video wall? radeon 9200 is | nothing to sneeze at (faster than my desktops), and one of these | would look pretty cute sitting under a mini-DLP projector. | | I wonder if there are embarassing-enough parallel applications where | even the annoying cat5 could be dispensed with (just opt for the | internal 54 Mbps wireless card). | | regards, mark hahn. - -- Brian D. Ropers-Huilman .:. Asst. Director .:. HPC and Computation Center for Computation & Technology (CCT) bropers at cct.lsu.edu Johnston Hall, Rm. 350 +1 225.578.3272 (V) Louisiana State University +1 225.578.5362 (F) Baton Rouge, LA 70803-1900 USA http://www.cct.lsu.edu/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQFB5KUxwRr6eFHB5lgRAsNhAJ0R5dsbKKT375AGAmKWCGPQzM/PigCgrs/M oq9gtyF5nbg8Z+5FK67q1Uc= =ysba -----END PGP SIGNATURE----- From gvinodh1980 at yahoo.co.in Tue Jan 11 23:27:08 2005 From: gvinodh1980 at yahoo.co.in (Vinodh) Date: Tue, 11 Jan 2005 23:27:08 -0800 (PST) Subject: [Beowulf] problem with execution of cpi in two node cluster Message-ID: <20050112072708.55002.qmail@web8501.mail.in.yahoo.com> hi, i setup a two node cluster with mpich2-1.0. the name of the master node is aarya the name of the slave node is desktop2 i enabled the passwordless ssh session. in the mpd.hosts, i included the name of both nodes. the command, mpdboot -n 2 works fine. the command, mpdtrace gives the name of both machines. i copied the example program cpi on /home/vinodh/ on both the nodes. mpiexec -n 2 cpi gives the output, Process 0 of 2 is on aarya Process 1 of 2 is on desktop2 aborting job: Fatal error in MPI_Bcast: Other MPI error, error stack: MPI_Bcast(821): MPI_Bcast(buf=0xbfffbf28, count=1, MPI_INT, root=0, MPI_COMM_WORLD) failed MPIR_Bcast(229): MPIC_Send(48): MPIC_Wait(308): MPIDI_CH3_Progress_wait(207): an error occurred while handling an event returned by MPIDU_Sock_Wait() MPIDI_CH3I_Progress_handle_sock_event(1053): [ch3:sock] failed to connnect to remote process kvs_aarya_40892_0:1 MPIDU_Socki_handle_connect(767): connection failure (set=0,sock=1,errno=113:No route to host) rank 0 in job 1 aarya_40878 caused collective abort of all ranks exit status of rank 0: return code 13 but, the other example hellow works fine. let me know, why theres an error for the program cpi. Regards, G. Vinodh Kumar __________________________________ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail From john.hearns at streamline-computing.com Wed Jan 12 03:13:31 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed, 12 Jan 2005 11:13:31 +0000 Subject: [Beowulf] OK, I'll be first: mac mini In-Reply-To: References: Message-ID: <1105528411.4559.19.camel@Vigor45> On Tue, 2005-01-11 at 23:03 -0500, Mark Hahn wrote: > anyone working on assembling a large cluster of Mac mini's? > > I hafta believe that some of you bio or montecarlo types are > thinking about it. sure, the box only has a G4, and only 100bT, > but it's small, fairly cheap and gosh-darn cute! Oooh, nice Does anyone know if they support network booting? > fairly low power, and the input might be DC (gangable?). > if not a compute cluster, how about a video wall? radeon 9200 is > nothing to sneeze at (faster than my desktops), and one of these > would look pretty cute sitting under a mini-DLP projector. http://chromium.sourceforge.net/ but the 100Mb interface will be a big bottleneck. From agrajag at dragaera.net Wed Jan 12 06:10:44 2005 From: agrajag at dragaera.net (Sean Dilda) Date: Wed, 12 Jan 2005 09:10:44 -0500 Subject: [Beowulf] OK, I'll be first: mac mini In-Reply-To: <1105505384.17557.48.camel@terra> References: <1105505384.17557.48.camel@terra> Message-ID: <41E52FE4.9060300@dragaera.net> Dean Johnson wrote: > > > One thing that I didn't see was a good definition of what type of drive > is sitting inside of it. Due to the sizes being 40G and 80G, I suspect > that they are 2.5" drives, which would be a slight bummer. Firewire800 > would have been nice too. The hard drive is a real concern. With the case being so small, its almost certain that they're laptop harddrives, which is a shame. From agrajag at dragaera.net Wed Jan 12 06:14:44 2005 From: agrajag at dragaera.net (Sean Dilda) Date: Wed, 12 Jan 2005 09:14:44 -0500 Subject: [Beowulf] OK, I'll be first: mac mini In-Reply-To: References: Message-ID: <41E530D4.7040204@dragaera.net> Mark Hahn wrote: > > fairly low power, and the input might be DC (gangable?). > if not a compute cluster, how about a video wall? radeon 9200 is > nothing to sneeze at (faster than my desktops), and one of these > would look pretty cute sitting under a mini-DLP projector. It may be a radeon 9200, but it only has 32MB of RAM. A few years ago that would have been decent. But now they're selling video cards with 256MB of RAM, which is much more appealing for a video wall. Another real concern is that there is no expansion slot. So the only way to expand onto the box is through usb or firewire, which is somewhat limited in what you can do. From cluster at hamsta.se Wed Jan 12 08:45:34 2005 From: cluster at hamsta.se (Roger Strandberg /Cluster) Date: Wed, 12 Jan 2005 17:45:34 +0100 Subject: [Beowulf] Cluster Novice Message-ID: <000a01c4f8c6$239be990$d9f443c3@rogercud6b5ksq> Hi I'm a novice in the cluster world. Does there exist any type of cluster that gives redundancy and even a parity check? I don't want to invent the wheel twice :-) I'm intrest is in a "virtual" cpu that can run over N machines and if "virtual" cpu crashes the N-1 takes over. I'm not looking for speed only stableness. Does this exist already or do i need to program one my self? /Roger Strandberg Sweden. -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert at bay13.de Thu Jan 13 01:39:56 2005 From: robert at bay13.de (Robert Depenbrock) Date: Thu, 13 Jan 2005 10:39:56 +0100 (CET) Subject: [Beowulf] OK, I'll be first: mac mini In-Reply-To: Message-ID: On Tue, 11 Jan 2005, Mark Hahn wrote: > anyone working on assembling a large cluster of Mac mini's? > > I hafta believe that some of you bio or montecarlo types are > thinking about it. sure, the box only has a G4, and only 100bT, > but it's small, fairly cheap and gosh-darn cute! Hi! I wonder, if i see it right, someone should be able to pack 6 mini iMacs onto one 19" 1.5U Rack module. Should be great for educational systems. regards Robert Depenbrock -- nic-hdl RD-RIPE http://www.bay13.de/ e-mail: robert at bay13.de Fingerprint: 1CEF 67DC 52D7 252A 3BCD 9BC4 2C0E AC87 6830 F5DD From scheinin at crs4.it Thu Jan 13 05:27:45 2005 From: scheinin at crs4.it (Alan Louis Scheinine) Date: Thu, 13 Jan 2005 14:27:45 +0100 Subject: [Beowulf] [Bioclusters] servers for bio web services setup (fwd from michab@dcs.gla.ac.uk) In-Reply-To: <20050113113155.GQ9221@leitl.org> References: <20050113113155.GQ9221@leitl.org> Message-ID: <41E67751.50504@crs4.it> One machine might be enough. With regard to clustering, what you need is high availability, which is different from Beowulf. One source of high availability information is http://www.linux-ha.org/ Micha Bayer wrote: > The computational back end is likely to be our UK National Grid or > similar, but either way he is only providing the server that hosts the > middleware and metascheduler. He is wondering what hardware setup setup > is best for this. We are probably looking at running the web/grid > services out of Tomcat. > -- Alan Scheinine Centro di Ricerca, Sviluppo e Studi Superiori in Sardegna Center for Advanced Studies, Research, and Development in Sardinia From zogas at upatras.gr Thu Jan 13 10:56:04 2005 From: zogas at upatras.gr (Stavros E. Zogas) Date: Thu, 13 Jan 2005 10:56:04 -0800 Subject: [Beowulf] PVFS or NFS in a Beowulf cluster? Message-ID: <000e01c4f9a1$889f5210$52838c96@MOB> Hi at all I intend to setup a beowulf cluster(16+ nodes) for scientific applications(fortran compilers) in a University department.What am i supposd to use???PVFS or NFS for file system?? Stavros -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at mendel.bio.caltech.edu Thu Jan 13 15:18:27 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Thu, 13 Jan 2005 15:18:27 -0800 Subject: [Beowulf] Re: [Bioclusters] servers for bio web services setup Message-ID: > He wants to set up a bio facility which provides web/grid services > (probably Axis or GT3/4) to a substantial user community (UK-wide but > with access control, so probably in the region of hundreds or perhaps > thousands of potential users). Services will include the usual things > things like BLAST, ClustalW, protein structure analysis etc. -- probably > a small subset of what EBI offers. A couple of things to consider in general: 1. Some of these back end jobs can generate enormously large output files. If you let somebody queue up a 1000 entry fasta file and use the default BLAST format with 50 alignments each to search the nt database - Ugh!! You definitely don't want those coming back through your front end machines if at all possible. You might, for instance, set up the back end nodes to email the results directly. Or to email a page with a link to the results. Unless a job's results are tiny the most you're probably going to want the front end machine to present is a page that looks like: Your XXXXX job finished at 21:09 GMT Results (link) Error messages (link) Other (link) Parameters (link) where all the links go out to different machines, to spread the load around. 2. Even if the result is only a million bytes or so you do not want the users to be loading those pages directly in their browsers. Browsers can take a really long time to open a file like that, but they can typically download it very fast. Have them right click download and then open it in a faster text viewer. (most of the results will be text.) This may not change the load on your server much but it can make a big difference in the end users' perception of the speed of your service. 3. Sanity check everything for valid parameters and expected run times. Let's say you provide an interface to Phylip. Do you really want to let somebody stuff a 200 sequence alignment into DNAPENNY? Not unless you want to lock up the back end machine for the next hundred years. It can be pretty tricky figuring ahead of time how long a job may run, but do the best you can so that at least in some cases the web interface can tell the users up front to change the job parameters. And on the back end absolutely set some maximum CPU time limit for jobs. Better an email "your job was terminated after one hour" than annoyed end users constantly emailing you asking where their jobs went. 4. If at all possible provide the run time parameters back to the end users. People tend to just print the result off the web page and, if the program doesn't echo the parameters when they go back later they can never remember how they ran a particular program. It's also useful for catching bugs in the web interface. 5. If the load is really significant you're going to want at least two, and maybe more, front end web servers. Ie, www.yourservice.org connects at random to www01.yourservice.org, www02.yourservice.org, etc. That will both split the load and reduce the effect of a downed front end server. If all the computation is going out onto a grid these machines won't need much local storage but would presumably need reasonably fast network connections. > Would a single high-spec machine be sufficient for this kind of thing? > Or would one have several servers doing the same thing in parallel? Depends on what the front end server is doing. If it's just shuffling smallish requests off to the end compute nodes it needn't be very large. If it's spooling hundreds of 10 Mb result files per second and then sending those off to the end users interactively it's going to have to be monstruously large (ditto for your network connections). That is, we can't really answer that question specifically until you tell us how much data needs to be stored locally, processed locally, and shipped in and out through the network. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From james.p.lux at jpl.nasa.gov Sun Jan 16 22:16:53 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Sun, 16 Jan 2005 22:16:53 -0800 Subject: [Beowulf] Cooling vs HW replacement References: <41E12D14.3000402@fing.edu.uy> Message-ID: <004001c4fc5c$28ed2f00$32a8a8c0@LAPTOP152422> ----- Original Message ----- From: "Ariel Sabiguero" To: Sent: Sunday, January 09, 2005 5:09 AM Subject: [Beowulf] Cooling vs HW replacement > Hello all. > The following question shall only consider costs, not uptime or > reliability of the solution. > I need to balance costs of hardware replacement after failures over air > conditioning costs. > The question arises as most current hardware comes with 3 or more years > of warranty. During that period of time Moore twofolded twice hardware > performance... is it worth spending money cooling down a cluster or just > rebuilding it after it "burns out" (and is at least 4 times slower than > state-of-the art)? > Is it worth cooling down the room to a Class A Computer room standard or > save the money for hardware upgrade after three years? In warm countries > keeping 18?C the air inside a room (PC-heated) when outside temperature > is 30?C average it becomes pretty expensive to pay electricity bills. It > is cheaper to "circulate" 30?C air and have from 40-50?C inside the chassis. Fascinating system design question.... > > Do you have figures or graphs plotting MTBF vs temperature for main > system components (memory, CPU, mainboard, HDD) ? Such data is very hard to come by, however, a good rule of thumb is that life (MTBF) is halved for every 10 degree (C) temperature rise (Arrhenius equation). I have seen temperature vs MTBF data for disk drives, a google or search of a site such as Seagates should find it. Of course, they do accelerated life testing at elevated temperatures, so there must be some analysis that equates X hours of operation at temperature Y to Z house of operation at temperature Y-30. The real question would be what's the life limiting component... I'd be willing to gamble (based on personal experience with PC failures over the last 20 years) that it's some component in the power supply. > Links to this information are highly appreciated! > I remember old (40MB RLL disks shipping this information with the > device, several pages of printed manual) hardware showing the > difference in MTBF vs environment conditions, but nowadays commodity > harware does not consider this on the sticker on the top of the device... But it is available at mfr's websites, at least for some components. Jim Lux From alvin at Mail.Linux-Consulting.com Sun Jan 16 22:55:18 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Sun, 16 Jan 2005 22:55:18 -0800 (PST) Subject: [Beowulf] Cooling vs HW replacement - fans In-Reply-To: <004001c4fc5c$28ed2f00$32a8a8c0@LAPTOP152422> Message-ID: hi ya On Sun, 16 Jan 2005, Jim Lux wrote: > > Is it worth cooling down the room to a Class A Computer room standard or > > save the money for hardware upgrade after three years? In warm countries > > keeping 18?C the air inside a room (PC-heated) when outside temperature > > is 30?C average it becomes pretty expensive to pay electricity bills. It > > is cheaper to "circulate" 30?C air and have from 40-50?C inside the > chassis. > > Fascinating system design question.... we had some 1Us where the clients put it in a harsh 99% enclosed environment - 150F is the ambient operating ( normal ) temp and running 24x7 - the ide disks ( basically all ) died within 1yr ( cpu/mem/ps all seems fine ) = = circulating hot air will not help = - cooler air needs to come in and hot air must go out = > Such data is very hard to come by, however, a good rule of thumb is that > life (MTBF) is halved for every 10 degree (C) temperature rise (Arrhenius > equation). where the degradation starts from say 25C or 20C .. wherever they spec'd the mtbf starting point > I have seen temperature vs MTBF data for disk drives, a google > or search of a site such as Seagates should find it. those mtbf, some disks are spec'd at a ridulous 1,000,000 mtbf hrs > Of course, they do > accelerated life testing at elevated temperatures, so there must be some > analysis that equates X hours of operation at temperature Y to Z house of > operation at temperature Y-30. The real question would be what's the life > limiting component... I'd be willing to gamble (based on personal experience > with PC failures over the last 20 years) that it's some component in the > power supply. i'd put my $$ on the "cheap" fans dying first - adding better quality and redundant fans seems to work for 1Us in regular operating environment > > Links to this information are highly appreciated! > > I remember old (40MB RLL disks shipping this information with the > > device, several pages of printed manual) hardware showing the > > difference in MTBF vs environment conditions, but nowadays commodity > > harware does not consider this on the sticker on the top of the device... > > But it is available at mfr's websites, at least for some components. yupp c ya alvin From james.p.lux at jpl.nasa.gov Mon Jan 17 07:53:39 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Mon, 17 Jan 2005 07:53:39 -0800 Subject: [Beowulf] Cooling vs HW replacement - fans References: Message-ID: <004a01c4fcac$b7010b90$32a8a8c0@LAPTOP152422> ----- Original Message ----- From: "Alvin Oga" To: "Jim Lux" Cc: "Ariel Sabiguero" ; Sent: Sunday, January 16, 2005 10:55 PM Subject: Re: [Beowulf] Cooling vs HW replacement - fans > > > hi ya > > > Such data is very hard to come by, however, a good rule of thumb is that > > life (MTBF) is halved for every 10 degree (C) temperature rise (Arrhenius > > equation). > > where the degradation starts from say 25C or 20C .. wherever they spec'd > the mtbf starting point > > > I have seen temperature vs MTBF data for disk drives, a google > > or search of a site such as Seagates should find it. > > those mtbf, some disks are spec'd at a ridulous 1,000,000 mtbf hrs For instance, Cheetah X15s claim 1.2 million hrs mtbf (based on failure rates from a raft of drives running together, 250 power on/off cycles/yr, at some case temp).. The statistics they give look like about 1.4 failures per thousand drives per month for 720 power on hrs/mo, 250 power on/off cycles/yr, 20% usage and case temps of around 50C...(they have a chart showing what part of the drive can be at what max temperature and where the cooling air should go...) They give the 1.2 million hour number as using 0.92 m/sec (180 linear ft/min) inlet air at 25C + 5C rise in the enclosure (drive operating in 30C air)... I note that this is fairly cool for most computers in a "desktop" environment. I'm actually quite impressed by this kind of reliability... From ctierney at HPTI.com Mon Jan 17 08:47:32 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Mon, 17 Jan 2005 09:47:32 -0700 Subject: [Beowulf] 64 bit Xeons? In-Reply-To: <20050112000239.GB14480@maybe.org> References: <20050112000239.GB14480@maybe.org> Message-ID: <1105980452.3166.6.camel@hpti10.fsl.noaa.gov> On Tue, 2005-01-11 at 17:02, Dale Harris wrote: > http://www.intel.com/products/server/processors/server/xeon/ > > So anyone tried the Intel Xeon Processor MP? Is it just a repackaged > Itanium? Yes, it is very fast and much cheaper than the Itanium. It is based on the Xeon family of processors. You can still use a EM64T processor in 32-bit mode. It uses the 64-bit extensions that AMD uses for the Opteron. The Itanium has a completely different instruction set. Craig From rgb at phy.duke.edu Mon Jan 17 10:44:15 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 17 Jan 2005 13:44:15 -0500 (EST) Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <41E12D14.3000402@fing.edu.uy> References: <41E12D14.3000402@fing.edu.uy> Message-ID: On Sun, 9 Jan 2005, Ariel Sabiguero wrote: > Hello all. > The following question shall only consider costs, not uptime or > reliability of the solution. > I need to balance costs of hardware replacement after failures over air > conditioning costs. > The question arises as most current hardware comes with 3 or more years > of warranty. During that period of time Moore twofolded twice hardware > performance... is it worth spending money cooling down a cluster or just > rebuilding it after it "burns out" (and is at least 4 times slower than > state-of-the art)? > Is it worth cooling down the room to a Class A Computer room standard or > save the money for hardware upgrade after three years? In warm countries > keeping 18?C the air inside a room (PC-heated) when outside temperature > is 30?C average it becomes pretty expensive to pay electricity bills. It > is cheaper to "circulate" 30?C air and have from 40-50?C inside the chassis. If you circulate 30C air, and have 50+C air inside the chassis, the CPU and memory chips themselves will be at least 10 and more likely 30 or 50 C hotter than that. This will really significantly reduce the lifetime of the components. There is a rule of thumb that every 10F (4.5 C) hotter ambient air temp reduces expected lifetime by a year. You're talking about running some 3x10 F degrees hotter than optimal for a 4+ year lifetime. This could easily reduce the MTBF for your nodes to 1-2 years. However, this "lifetime" thing is going to be highly irregular. All chips are not equal. Some subsystems, especially memory, will flake out (give you odd answers, drop bits) if you habitually run them well above desireable ambient. Some will run for four months, flake out, then break at six months. Some will run for a year and pop. Some will make it to two years, and only a relatively small fraction of your cluster will make it to years 3-4. It is therefore not possible to address "only the costs" without addressing uptime and reliability. Downtime is expensive. Downtime due to a crash can cost you a week's worth of work for the entire cluster for some kinds of problems. Unreliable hardware is AWESOMELY expensive, I know from bitter, personal experience. In addition to the associated downtime, there is all sorts of human time associated with going into the cluster every week or two to pull a downed node, work with it (sometimes for a full day) with spares to identify the blown components, order and replace the blown component, and get it back up. A minimum of say 4 hours per event, and as much as 2-3 DAYS if something isn't broken but is just too flaky -- the system crashes (because of memory running too hot) but it reboots fine when it is cooled and you can't identify a "bad chip" because there isn't one, technically, except when it is under load AND being "cooled" by hot air. Time costs money -- generally more money than either the hardware OR the air conditioning. Besides, AC costs are still only about 1/3 the costs of powering the nodes up themselves as a running expense (depending on the COP of your cooling system, assuming a COP of 3-4). The rest is infrastructure investment in building a properly cooled facility. I'd say make the investment. BTW, you might well find that hardware salespersons will balk at replacing the equipment they sell you under extended service if you don't maintain the recommended ambient air. So you might end up having to pay for a constant stream of hardware out of pocket in addition to the labor and downtime. I just don't think it is worth it. rgb > > Do you have figures or graphs plotting MTBF vs temperature for main > system components (memory, CPU, mainboard, HDD) ? > Links to this information are highly appreciated! > I remember old (40MB RLL disks shipping this information with the > device, several pages of printed manual) hardware showing the > difference in MTBF vs environment conditions, but nowadays commodity > harware does not consider this on the sticker on the top of the device... > > Regards > > Ariel > > PS: if the idea is worth the money, then I would like to study > reliability and uptime, but it is not the main concern now. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From landman at scalableinformatics.com Mon Jan 17 11:04:12 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Mon, 17 Jan 2005 14:04:12 -0500 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: References: <41E12D14.3000402@fing.edu.uy> Message-ID: <41EC0C2C.602@scalableinformatics.com> Robert G. Brown wrote: [...] > BTW, you might well find that hardware salespersons will balk at > replacing the equipment they sell you under extended service if you > don't maintain the recommended ambient air. So you might end up having > to pay for a constant stream of hardware out of pocket in addition to > the labor and downtime. I just don't think it is worth it. It is far better to do the job right at the outset (cooling) than asking the company to bankroll new nodes for you every few months. I am not sure how the warranties are writting in Uraguay, but here in the states, they have a few clauses about not being liable for damage resulting from operating out of manufacturer/vendor suggested norms. That said, I would strongly recommend reasonable cooling on the master node, the disks, the network and other major bits. Replacing administrative infrastructure can be quite painful for end users. Compute nodes should be viewed as being disposable after hardware warranty runs out (usually 1-2 years on most components). Always, always, always have extra memory and motherboards (and CPU fans and power supplies) sitting around in a box somewhere. You will need them. As RGB indicted, downtime is usually costly, at minimum, in terms of time. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 612 4615 From hahn at physics.mcmaster.ca Mon Jan 17 11:32:58 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Mon, 17 Jan 2005 14:32:58 -0500 (EST) Subject: [Beowulf] Cluster Novice In-Reply-To: <000a01c4f8c6$239be990$d9f443c3@rogercud6b5ksq> Message-ID: > Does there exist any type of cluster that gives redundancy and even a parity check? > I don't want to invent the wheel twice :-) in general, this level of checking is done only for bank central-offices, and usually is implemented with lockstep/quorum hardware. for instance, HP NonStop (nee Tandem) computers. at the other extreme, if you simply assume that crashes are easy to detect, and are not interested in the more "byzantine" modes of failure, it's rather easy to set up high-availability (HA) clusters. for instance, you might simply have a small set of servers which elect a master (or load-balance), and if a "heartbeat" fails, some number of servers get turned off. > I'm intrest is in a "virtual" cpu that can run over N machines and if > "virtual" cpu crashes the N-1 takes over. well, it depends on your assumptions. for instance, how do you detect a crash? NonStop provides a much more paranoid view of "failure" than a simpler, software-based approach like STONITH HA clusters. > I'm not looking for speed only stableness. beowulf is about speed, not HA. > Does this exist already or do i need to program one my self? the answer to that is almost always that someone else has already done it. it's a big world. (that doesn't mean that existing wheels are perfect!) regards, mark hahn. From george at galis.org Sun Jan 16 22:33:22 2005 From: george at galis.org (George Georgalis) Date: Mon, 17 Jan 2005 01:33:22 -0500 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <41E12D14.3000402@fing.edu.uy> References: <41E12D14.3000402@fing.edu.uy> Message-ID: <20050117063322.GA26246@sta.local> On Sun, Jan 09, 2005 at 02:09:40PM +0100, Ariel Sabiguero wrote: >The question arises as most current hardware comes with 3 or more years >of warranty. During that period of time Moore twofolded twice hardware >performance... is it worth spending money cooling down a cluster or just >rebuilding it after it "burns out" (and is at least 4 times slower than >state-of-the art)? >Is it worth cooling down the room to a Class A Computer room standard or >save the money for hardware upgrade after three years? In warm countries >keeping 18?C the air inside a room (PC-heated) when outside temperature >is 30?C average it becomes pretty expensive to pay electricity bills. It >is cheaper to "circulate" 30?C air and have from 40-50?C inside the chassis. I don't have numbers or proof, but some experience and well... Use a SAN/NAS (nfs) and keep the disks in a separate room than the CPUs. Disk drives generate a lot of heat, and compared to on board components don't really need cooling, circulated air should largely cover them. Minimize the disk count in the CPU room, use efficient power supplies, and they won't need as much capacity since they aren't driving disks. Much less cooling will be required. That's about all I can say for sure. A site I know was doing that and replacing CPU about every 12 months, per Moor's law. Sorry no real numbers about actual or abusive temperatures, but I would avoid abusive temperatures. If you have 3% failure at 65F at 3 years, and 15% failure at 80F at 3 years, do you really think your production CPUs are going to wait 3 years to start failing? Unpredictable errors and nontrivial diagnostic and repair. ...A failed disk in a hot swap mirrored raid array, is trivial to detect and replace. (careful not to fry your raid hardware!) If you really want to focus on efficiency and engineering, I bet one (appropriately sized) power-supply per 3 or 5 computers is a sweet spot. They could possibly run outside the CPU room too. Regards, // George PS - sorry my smtp doesn't accept mail from uy subnets. most free webmail gets through if you want to contact me directly. -- George Georgalis, systems architect, administrator Linux BSD IXOYE http://galis.org/george/ cell:646-331-2027 mailto:george at galis.org From Johan_Sjoholm at yahoo.se Mon Jan 17 01:55:52 2005 From: Johan_Sjoholm at yahoo.se (Johan Sj=?ISO-8859-1?B?9g==?=holm) Date: Mon, 17 Jan 2005 10:55:52 +0100 Subject: [Beowulf] OK, I'll be first: mac mini In-Reply-To: Message-ID: There is an ongoing probject for that to be done here in Sweden featuring 15 machines. I will get back to you with the progress report. > >> > anyone working on assembling a large cluster of Mac mini's? >> > >> > I hafta believe that some of you bio or montecarlo types are >> > thinking about it. sure, the box only has a G4, and only 100bT, >> > but it's small, fairly cheap and gosh-darn cute! > -- Johan 'John' Sjoholm Chief Architect and Head of Development Building 31 Clustering - http://www.phs.se js at phs.se ~ +46 709 43 33 31 ~ +46 520 17 20 4 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kus at free.net Mon Jan 17 02:09:25 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Mon, 17 Jan 2005 13:09:25 +0300 Subject: [Beowulf] 64 bit Xeons? In-Reply-To: <20050112000239.GB14480@maybe.org> Message-ID: In message from Dale Harris (Tue, 11 Jan 2005 16:02:39 -0800): > >http://www.intel.com/products/server/processors/server/xeon/ > >So anyone tried the Intel Xeon Processor MP? Is it just a repackaged >Itanium? MP in Xeon MP means "multiprocessor", i.e. processors which may work in SMP configurations having more than 2-way (what is marked as DP for Itanium). Xeon MP (because of sharing of system bus by processors) may have conflicts at access to main memory. To decrease of this problems Xeon MP is equipped w/additional cache memory having high capacity. But as usually this can't help in the case your applications are "RAM throughput" -limited and their working set of pages can't fit in the cache hierarchy. It may be often in HPC area. Of course, Xeon MP are much more expensive (than usual Xeon's), and are relative popular in servers for business applications, but not in HPC clusters which usually are based on 2-processor SMP nodes. What is about 64-bit Xeon's - they have codename Nocona and they are compatible w/x86 but not w/Itanium IA-64. Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > >-- >Dale Harris >rodmur at maybe.org >/.-) >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From reuti at staff.uni-marburg.de Mon Jan 17 04:54:13 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 17 Jan 2005 13:54:13 +0100 Subject: [Beowulf] the solution for qdel fail..... In-Reply-To: <200501110944.46037.csamuel@vpac.org> References: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> <1105055797.14387.9793.camel@blackflag.cct.lsu.edu> <1105372151.15841.5.camel@strathmill.biosc.lsu.edu> <200501110944.46037.csamuel@vpac.org> Message-ID: <41EBB575.1040603@staff.uni-marburg.de> You may have a look at another queuingsystem: GridEngine from SUN, which offers better choices of the control of tasks on the slave nodes. It's no problem to shutdown just one of two jobs of the same user on a node. It will just kill the whole process group of one of the tasks, as they are started by a special implementation of rshd/sshd private to each task. This works for MPICH and also MPICH2 (forker and smpd startup method). - Reuti Chris Samuel wrote: > On Tue, 11 Jan 2005 02:49 am, Jerry Xu wrote: > > >>Hi, William, Thank for your information. Just in case somebody still >>need it for openPBS configuration, here is my epilogue file.it shall be >>located in $pbshome/mom_priv/ for each node and it need to be set as >>executable and owned by root. Some others many have better epilogue >>scripts... > > > Hmm, the only thing that worries me about that is that for those of us with > SMP clusters it is possible for a user to have two different jobs running on > each of the CPUs, so an epilogue script that kills all a users processes on a > node would accidentally kill an innocent job. > > cheers, > Chris > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From kums at VerariSoft.Com Mon Jan 17 07:49:58 2005 From: kums at VerariSoft.Com (Kumaran Rajaram) Date: Mon, 17 Jan 2005 09:49:58 -0600 (CST) Subject: [Beowulf] PVFS or NFS in a Beowulf cluster? In-Reply-To: <000e01c4f9a1$889f5210$52838c96@MOB> References: <000e01c4f9a1$889f5210$52838c96@MOB> Message-ID: I would suggest PVFS2. It offers greater bandwidth (proportional to number of I/O nodes dedicated) compared to NFS. Also, you may use PVFS2 native file interface (rather than POSIX I/O interface) which offers flexible parallel I/O interface for scientific workloads. Only caveat, is fault-tolerance support and you might need to employ either software/hardware RAID for disk failures and heartbeat mechanism for server failures. If the environment has less probability of server/disk/netowrk faults, then PVFS2 is a good choice for scientific workloads and parallel applications. -Kums On Thu, 13 Jan 2005, Stavros E. Zogas wrote: > Hi at all > I intend to setup a beowulf cluster(16+ nodes) for scientific > applications(fortran compilers) in a University department.What am i supposd > to use???PVFS or NFS for file system?? > Stavros > From george at galis.org Mon Jan 17 08:01:45 2005 From: george at galis.org (George Georgalis) Date: Mon, 17 Jan 2005 11:01:45 -0500 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <41EB8BC4.6090605@irisa.fr> References: <41E12D14.3000402@fing.edu.uy> <20050117063322.GA26246@sta.local> <41EB8BC4.6090605@irisa.fr> Message-ID: <20050117160145.GA25113@sta.local> On Mon, Jan 17, 2005 at 10:56:20AM +0100, Ariel Sabiguero wrote: > >>If you have 3% failure at 65F at 3 years, and 15% failure >>at 80F at 3 years, >> >What is the source for that figures? >Of course that if it is the case even 80F is too much. > I thought it was clear, I just made up those numbers to illustrate, the hypothetical situation of different temperatures: 10% failure at 3 years means 0.01% chance of failure every day (10% / 3 / 365). Failures are not necessarily skewed at the end of the period, but could be evenly distributed, or skewed toward the beginning of the period (which I think is most often the case, some hardware just lasts while others fail at 6 months). My guess is that at higher temperatures, failures will be evenly distributed across the time period, causing continual maintenance issues -- which are more easily addressed with disk failures, than mainboard and/or cpu. Also, I should clarify, I've not setup a site like this, by experience I really meant exposure. I know the hot room and cold room setup does make a difference though. It may well be advantageous to use slow CPU (ie 1.2 Mhz, and possibly under-clocked) for raid systems in a hot room, to help preserve them. Power supplies vary wildly in quality and efficiency. The point about them being the limiting factor in high temperatures is well taken. The new macmini advertises operating temperature: 50? to 95? F (10? to 35? C) http://www.apple.com/macmini/specs.html (that may not be for continous operation) and their design would make it easy to gang several units on one specially designed power supply (for efficiency). I'm not recommending, I don't know anybody who has touched one, but I would look into them; they do run Linux, I think. // George -- George Georgalis, systems architect, administrator Linux BSD IXOYE http://galis.org/george/ cell:646-331-2027 mailto:george at galis.org From yamasaki at fis.ua.pt Mon Jan 17 11:43:55 2005 From: yamasaki at fis.ua.pt (Yoshihiro Yamasaki) Date: Mon, 17 Jan 2005 19:43:55 +0000 Subject: [Beowulf] (no subject) Message-ID: Does anyone knows how to set mmpich2 for OPTERON under pgf90 and pgcc ( 64 bits ??) including -DDEC_ALPHA and -byteswapio ?? BEST REGARDS, YYAMAZKI From Pat.Delaney at inewsroom.com Fri Jan 7 13:36:40 2005 From: Pat.Delaney at inewsroom.com (Pat Delaney) Date: Fri, 7 Jan 2005 15:36:40 -0600 Subject: [Beowulf] kickstart install using NFS Message-ID: <2F6133743473D04B8685415F8243F4761D3597@madison-msg1.global.avidww.com> Did you ever get an answer to your post?? I'm trying to do the same thing? Pat I'm preparing to install a large number of new nodes using redhat and have planned on using the kickstart option. I have gotten a kickstart file setup just the way I want it with one exception and I can not get it to work. I ultimately want to boot from a floppy and in the kickstart file tell it to get the rpm's from a nfs mount. So far I have: 1.) Booted from a cd and issued the command: linux ks=floppy This is how I built and debugged my ks file. This gives me what I want, except I get to swap CD's during the install. No nfs option at this point. 2.) I added the line nfs --server=my.local.server.com --dir=/redhat and tried the 'linux ks=floppy' again booting from the CD. It continues to get the rpms from the CD. 3.) I built a floppy from mkbootdisk, with the ks.cfg file and at the boot: prompt typed linux ks=floppy. This time it went directly to the resuce boot from the HD. 4.) I then got a recommendation from someone to modify the syslinux.cfg file on the floppy. I tried that and got errors like the following: mount: error 22 mounting ext2 pviotroot: .{stuff deleted} failed: 2 ... Kernel panic: No init found. Try passing init= ... 5.) I've built the system with the CD and my kickstart and made sure I could mount my nfs box and directory once everything was up and I could. I've given just about everyone premission to the directory and the export. I've looked through all the kickstart how-to's, the redhat references and can't find anything wrong. Here is the relevant part of the ks file: install lang en_US langsupport --default en_US.iso885915 en_US.iso885915 keyboard us mouse generic3ps/2 --device psaux skipx rootpw --iscrypted blahblahblah firewall --disabled authconfig --enableshadow --enablemd5 timezone America/Chicago network --bootproto dhcp nfs --server=myserver.name.com --dir=/redhat/RedHat bootloader --location=mbr clearpart --all zerombr yes part / --fstype ext3 --size 5120 part /home --fstype ext3 --size 1024 part swap --size 1024 part /scratch --fstype ext3 --size 1024 --grow The directory tree on the remote machine is: /redhat '-- RedHat |-- RPMS '-- base Are the ks commands echo'd to a file so I can see what is happening, or if there are any errors? I've looked at the anaconda-ks.cfg file and it is a very close replica of my ks.cfg file, with the glaring exception of the nfs command, and some post install stuff. Thanks in advance for any and all help or suggestions. Todd _____ * Previous message: Anyone have information on latest LSU beowulf? * Next message: kickstart install using NFS * Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] _____ More information about the Beowulf mailing list -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at streamline-computing.com Mon Jan 17 00:14:49 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Mon, 17 Jan 2005 08:14:49 +0000 Subject: [Beowulf] [Bioclusters] servers for bio web services setup (fwd from michab@dcs.gla.ac.uk) In-Reply-To: <41E67751.50504@crs4.it> References: <20050113113155.GQ9221@leitl.org> <41E67751.50504@crs4.it> Message-ID: <1105949689.5600.22.camel@Vigor45> On Thu, 2005-01-13 at 14:27 +0100, Alan Louis Scheinine wrote: > One machine might be enough. With regard to clustering, > what you need is high availability, which is different > from Beowulf. > I agree. Depending on the load which the project anticipates, one powerful machine should be enough, speccing dual PSUs and mirrored system disks. Load balancing using a cluster of machines should be considered though, and implemented if the load, or requirements for redundancy, warrant it. From bill at Princeton.EDU Mon Jan 17 05:54:16 2005 From: bill at Princeton.EDU (Bill Wichser) Date: Mon, 17 Jan 2005 08:54:16 -0500 Subject: [Beowulf] HP 2848 switch woes References: <41E2B791.6090103@princeton.edu> Message-ID: <41EBC388.2020809@princeton.edu> Well, I submitted this about a week ago! Not very timely... Maybe it's just my mailer seeing it this morning for the first time. The solution, with help coming from the CentOS mailing list, is that the HP switches come preconfigured with LACP active on every port. When the network interface is reset, the extra delay on the switch port causes the messages to get lost never allowing the head node to respond since it never sees a request. The solution is to disable the LACP on every port. From the switch's config commandline interface: no int all lacp write mem And then the switch functions fine. Bill Bill Wichser wrote: > Trying to install a new cluster of Tyan 2881 mothers with CentOS 3.3, > kernel 2.4.21-27.0.1.ELsmp (Opteron). > > When running through this switch (Firmware:I.08.55, ROM:I.08.04), the > system is forced to do a manual install as a failure occurs in what I > believe is the initial discovery phase after the kernel boots. > > When a direct connection is made to the head node, everything proceeds > as normal. > > During the initial booting, after PXE, the system sends a request out to > the network asking for it's MAC address. Right before this time, the > network card appears to be reset by the OS. This appears to be the > normal progression from within the kernel. > > On a direct cable, the rarp is seen and the compute node receives the > info via the head node, right after the network card is reset. Through > the switch though, the rarp is never seen by the head node. > > At first I thought it was something with autodetection and so set the > switch up for just Gig. It certainly isn't the case that rarps don't > work as the initial tftp boot works fine, the vmlinuz is downloaded and > booting proceeds. It only is when during the boot phase when the > network card is reset does communication somehow fail. > > I've set the timeout in the switch for 15 minutes, made sure spanning > tree was off, connected the cables to adjacent ports, all to no avail. > > If anyone has any suggestions I am all ears as I have run out of ideas > at this point. HP just suggests updating the firmware, which I have > done to no avail. > > Thanks, > > Bill > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pathscale.com Mon Jan 17 13:40:14 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Mon, 17 Jan 2005 13:40:14 -0800 Subject: [Beowulf] HP 2848 switch woes In-Reply-To: <41EBC388.2020809@princeton.edu> References: <41E2B791.6090103@princeton.edu> <41EBC388.2020809@princeton.edu> Message-ID: <20050117214014.GA1260@greglaptop.internal.keyresearch.com> On Mon, Jan 17, 2005 at 08:54:16AM -0500, Bill Wichser wrote: > >On a direct cable, the rarp is seen and the compute node receives the > >info via the head node, right after the network card is reset. Through > >the switch though, the rarp is never seen by the head node. >From your second posting it seems that only 1 rarp is sent. This is a bug. -- greg From reuti at staff.uni-marburg.de Mon Jan 17 13:40:51 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 17 Jan 2005 22:40:51 +0100 Subject: [Beowulf] PVFS or NFS in a Beowulf cluster? In-Reply-To: References: <000e01c4f9a1$889f5210$52838c96@MOB> Message-ID: <1105998051.41ec30e306be0@home.staff.uni-marburg.de> I don't see a contradiction to use both: NFS for the home directories (on some sort of master node with an attached hardware RAID), PVFS2 for a shared scratch space (in case the applications need a shared scratch space across the nodes). So the question is: how big are the input files, what size are the output files and how much scratch space is needed by the applications (local or shared)? - Reuti Quoting Kumaran Rajaram : > > I would suggest PVFS2. It offers greater bandwidth (proportional to > number of I/O nodes dedicated) compared to NFS. Also, you may use PVFS2 > native file interface (rather than POSIX I/O interface) which offers > flexible parallel I/O interface for scientific workloads. Only caveat, is > fault-tolerance support and you might need to employ either > software/hardware RAID for disk failures and heartbeat mechanism for > server failures. If the environment has less probability of > server/disk/netowrk faults, then PVFS2 is a good choice for scientific > workloads and parallel applications. > > -Kums > > On Thu, 13 Jan 2005, Stavros E. Zogas wrote: > > > Hi at all > > I intend to setup a beowulf cluster(16+ nodes) for scientific > > applications(fortran compilers) in a University department.What am i > supposd > > to use???PVFS or NFS for file system?? > > Stavros > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From orion at cora.nwra.com Mon Jan 17 15:53:39 2005 From: orion at cora.nwra.com (Orion Poplawski) Date: Mon, 17 Jan 2005 16:53:39 -0700 Subject: [Beowulf] kickstart install using NFS In-Reply-To: <2F6133743473D04B8685415F8243F4761D3597@madison-msg1.global.avidww.com> References: <2F6133743473D04B8685415F8243F4761D3597@madison-msg1.global.avidww.com> Message-ID: <41EC5003.3010208@cora.nwra.com> Pat Delaney wrote: > Did you ever get an answer to your post?? I'm trying to do the same thing? > > Pat > > > I'm preparing to install a large number of new nodes using redhat and have > planned on using the kickstart option. I have gotten a kickstart file setup > just the way I want it with one exception and I can not get it to work. I > ultimately want to boot from a floppy and in the kickstart file tell it to > get the rpm's from a nfs mount. > > So far I have: > > 1.) Booted from a cd and issued the command: > linux ks=floppy > This is how I built and debugged my ks file. This gives me what I want, > except I get to swap CD's during the install. No nfs option at this point. > I vaguely remember trying this before. I think I ultimately found the floppy/nfs combo incompatible. Instead, get the ks file from the server too: linux ks=nfs::/exported/directory/ks.cfg -- Orion Poplawski System Administrator 303-415-9701 x222 Colorado Research Associates/NWRA FAX: 303-415-9702 3380 Mitchell Lane, Boulder CO 80301 http://www.co-ra.com From siegert at sfu.ca Mon Jan 17 17:40:58 2005 From: siegert at sfu.ca (Martin Siegert) Date: Mon, 17 Jan 2005 17:40:58 -0800 Subject: [Beowulf] kickstart install using NFS In-Reply-To: <41EC5003.3010208@cora.nwra.com> References: <2F6133743473D04B8685415F8243F4761D3597@madison-msg1.global.avidww.com> <41EC5003.3010208@cora.nwra.com> Message-ID: <20050118014058.GB17199@stikine.ucs.sfu.ca> On Mon, Jan 17, 2005 at 04:53:39PM -0700, Orion Poplawski wrote: > Pat Delaney wrote: > >Did you ever get an answer to your post?? I'm trying to do the same thing? > > > >Pat > > > > > >I'm preparing to install a large number of new nodes using redhat and have > >planned on using the kickstart option. I have gotten a kickstart file > >setup > >just the way I want it with one exception and I can not get it to work. I > >ultimately want to boot from a floppy and in the kickstart file tell it to > >get the rpm's from a nfs mount. > > > >So far I have: > > > >1.) Booted from a cd and issued the command: > >linux ks=floppy > >This is how I built and debugged my ks file. This gives me what I want, > >except I get to swap CD's during the install. No nfs option at this point. > > > > I vaguely remember trying this before. I think I ultimately found the > floppy/nfs combo incompatible. > > Instead, get the ks file from the server too: > > linux ks=nfs::/exported/directory/ks.cfg I have the following syslinux.cfg on the floppy: ============================================================ default ks prompt 0 label ks kernel vmlinuz append ks=floppy initrd=initrd.img lang= devfs=nomount ramdisk_size=8192 ========================================================================== and the ks.cfg file starts with ================================================================== ### Language Specification lang en_US langsupport --default en_CA en_CA ### Network Configuration network --bootproto static --device eth3 --ip 172.17.254.1 --netmask 255.255.0.0 --gateway 172.17.0.1 --hostname ks1 --nameserver 172.17.0.1 ### Source File Location nfs --server 172.17.0.1 --dir /usr/local/ks/dist/7.3 ### Ethernet Device Configuration device ethernet 3c59x ... ========================================================================== Works without any problems. -- Martin Siegert Head, HPC at SFU WestGrid Site Manager Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 From jkrauska at cisco.com Mon Jan 17 17:19:17 2005 From: jkrauska at cisco.com (Joel Krauska) Date: Mon, 17 Jan 2005 17:19:17 -0800 Subject: [Beowulf] distcc on Beowulf? Message-ID: <41EC6415.30305@cisco.com> Does anyone have experiences using distcc on a beowulf cluster? I'm in the process of attempting to setup a Scyld cluster to do distcc, but the BProc processing methods and lack of complete mounts on compute nodes is giving me some trouble. (gcc needs to exec cc1) I'm using kernel builds (2.6.10) as a baseline test. I'd love to get builds under 5 minutes and I'm curious if anyone has attempted or achieved similar goals. Thanks, Joel Krauska From redboots at ufl.edu Mon Jan 17 17:56:58 2005 From: redboots at ufl.edu (Paul Johnson) Date: Mon, 17 Jan 2005 20:56:58 -0500 Subject: [Beowulf] P4_SOCKBUFSIZE question Message-ID: <41EC6CEA.6040108@ufl.edu> Im new to clustering. My question is how do I determine the best P4_SOCKBUFSIZE to use? I have NetPipe already installed and working to see if changing P4_SOCKBUFSIZE will work. My cluster is 4 nodes, kernel 2.4.22-1.2149.nptl, using 100mbit ethernet. Nothing too fancy, I just want to get out as much 'performance' as possible. Oh and what would you recommend to increase performance? Any suggestions? Thanks for the help, Paul From rgb at phy.duke.edu Mon Jan 17 21:21:06 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 18 Jan 2005 00:21:06 -0500 (EST) Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <20050117160145.GA25113@sta.local> References: <41E12D14.3000402@fing.edu.uy> <20050117063322.GA26246@sta.local> <41EB8BC4.6090605@irisa.fr> <20050117160145.GA25113@sta.local> Message-ID: On Mon, 17 Jan 2005, George Georgalis wrote: > Also, I should clarify, I've not setup a site like this, by experience I > really meant exposure. I know the hot room and cold room setup does make > a difference though. Most of my experience has been inadvertent. AC's that fail. People that turn off the AC just for a silly reason like it is winter outside and cold (so why would you need air conditioning?). People who are trying to paint in the server room without supervision who helpfully cover the servers with plastic. Failing cooling fans. A purchase decision that (as it turned out) left us with a pile of some of the most temperature-temperamental boxes on the planet. One thing to make clear is that this isn't just about running ambient air a bit warmer than you should. It is about setting up your facility to remove heat. Remember, the nodes GENERATE heat. It can be cold outside, dead of winter, -20 C and with a cold wind blowing and a midsize room with 64 nodes in it will be burning between 6 and 15 kilowatts. That's enough heat to keep a small log cabin chinked with paper towels toasty warm in the middle of winter. We have at this point somewhere between 100-200 nodes in one medium sized, fairly well insulated, room. When the AC fails, we have a time measured in minutes (usually around 15-20) before the room temperature goes from maybe 15C to 30C (on its way through the roof), independent of the temperature outside. No matter WHAT your design, you'll have to have enough AC to be able to remove the heat you are releasing into the room as fast as you release it, and this is by far the bulk of your engineering requirement as far as AC is concerned. So I'm not certain what you are thinking about. You cannot really not have any AC at all, and whatever AC you have will still have to remove all that heat. What you're really comparing, then, is the MARGINAL cost of conditioning the air at a (too) high temperature vs conditioning the air to a safe operating temperature. In my estimation (which could be wrong) the amount you save keeping the room at 30C instead of the far safer 20C will be trivial, maybe $0.05-0.10/watt/year -- a small fraction of your total expenditure on power for the nodes (in the US, roughly $0.60/watt/year), the AC hardware itself (can be anywhere from tens to hundreds of thousands of dollars), and the power required to remove the heat you MUST remove just to keep the room temperature stable at ANY temperature (perhaps $0.20/watt/year). So you are risking all sorts of catastropic meltdown type situations to save maybe $5-20 per node operating cost per year against an inevitable budget for power for the nodes of $100-200 each per year. I don't even think you'll break even on the additional costs of the hardware that breaks from running things hot, let alone the human and downtime costs. To give you an idea of the magnitude of the problem, the ONE TIME our server room overheated for real, reaching 30-35C for an extended period of time (many hours -- the thermal warning system that was supposedly in place but never tested not, actually, working quite the way it was supposed to) we had node crashes galore, and a string (literally) of hardware failures over the next three months -- some immediate and "obviously" due to immediate overheating, some a week later, two weeks later, four weeks later. Nowadays if the room gets hot we respond immediately, typically getting nodes shut down within minutes of a reported failure and incipient temperature spike. When the overheating occurred I had 15 nodes racked that had run perfectly for a year. 3 blew during the event. 4 more failed over the next few months. 2 more failed after that. Power supplies, motherboards, memory chips -- that kind of heat "weakens" components so that forever afterward they are more susceptible to failure, not just during the event. The overheating can just occur one time, for a few minutes, and you'll be cursing and bitching for months and months later dealing with all the stuff that got almost-damaged, including the stuff that isn't actually broken, just bent out of spec so that it fails, sometimes, under load. Also to think about is that server room temperature is rarely uniform. EVEN if you are running it at 20C, there will be places in the room that are 15C (right in front of the output vents) and other places in the room that are 25-30C (right behind the nodes). Any unexpected mixing or circulation of the air in a room running at "30C" and you could have 35-40C ambient air entering some nodes some of the time, and at those temperatures I'd expect failure in a matter of days to weeks, not years. The warmest I'd ever run ambient air is 25C in a workstation environment, 22C in a server/cluster environment (where hot spots are more likely to occur). rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From rgb at phy.duke.edu Mon Jan 17 21:30:57 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 18 Jan 2005 00:30:57 -0500 (EST) Subject: [Beowulf] kickstart install using NFS In-Reply-To: <2F6133743473D04B8685415F8243F4761D3597@madison-msg1.global.avidww.com> References: <2F6133743473D04B8685415F8243F4761D3597@madison-msg1.global.avidww.com> Message-ID: On Fri, 7 Jan 2005, Pat Delaney wrote: > Did you ever get an answer to your post?? I'm trying to do the same > thing? > > Pat > > I'm preparing to install a large number of new nodes using redhat and > have > planned on using the kickstart option. I have gotten a kickstart file > setup > just the way I want it with one exception and I can not get it to work. > I > ultimately want to boot from a floppy and in the kickstart file tell it > to > get the rpm's from a nfs mount. My advice here (and I do this all the time) is: a) Invest in PXE network cards and boot from the network, not from floppy. Most current linuces have to be massaged a bit to boot from floppy any more, and if you are just starting out in this you don't want to be building custom kernels or messing with initrd. b) Use http, not nfs, to get the floppies. There are numerous advantages to this. NFS isn't terribly secure. NFS isn't terribly fast. An http-based repository (perhaps a RH or Fedora mirror) is just gangbusters for both install and post-install maintenance via yum. It isn't horrible difficult to do this any more. I do it at HOME for all my personal workstations and personal cluster there. You need a single server to install whatever you like. The server needs to run tftp (to handle the boot kernel and messages). dhpc (to give out network addresses at boot time and direct the boot loader to a network-based kernel). http (to actually serve the install files). The installation files (rpms) are served read only, and you can verify their retrievability with an ordinary web browser without opening up an NFS port into your server. This is even MORE useful than floppies in so many ways. I have a nice little list of what I can boot on a node. A kickstart install. A "redhat install" where I can select packages. In principle, a "rescue" kernel and image, although the interactive install kernel and image can be used for most rescue purposes if you know what you are doing. A choice of architectures and revisions. A DOS floppy boot image for doing certain chores. memtest86. All via PXE. Most of how to do this is in HOWTOs on the web. I'd personally recommend going with FC2 or even FC3 over RH-anything, but suit yourself. KS is pretty much the same either way. rgb > > So far I have: > > 1.) Booted from a cd and issued the command: > linux ks=floppy > This is how I built and debugged my ks file. This gives me what I want, > except I get to swap CD's during the install. No nfs option at this > point. > > 2.) I added the line > nfs --server=my.local.server.com --dir=/redhat > and tried the 'linux ks=floppy' again booting from the CD. It continues > to > get the rpms from the CD. > > 3.) I built a floppy from mkbootdisk, with the ks.cfg file and at the > boot: > prompt typed linux ks=floppy. This time it went directly to the resuce > boot > from the HD. > > 4.) I then got a recommendation from someone to modify the syslinux.cfg > file on the floppy. I tried that and got errors like the following: > > mount: error 22 mounting ext2 > pviotroot: .{stuff deleted} failed: 2 > ... > Kernel panic: No init found. Try passing init= ... > > 5.) I've built the system with the CD and my kickstart and made sure I > could mount my nfs box and directory once everything was up and I could. > I've given just about everyone premission to the directory and the > export. > > I've looked through all the kickstart how-to's, the redhat references > and > can't find anything wrong. > > Here is the relevant part of the ks file: > > install > lang en_US > langsupport --default en_US.iso885915 en_US.iso885915 > keyboard us > mouse generic3ps/2 --device psaux > skipx > rootpw --iscrypted blahblahblah > firewall --disabled > authconfig --enableshadow --enablemd5 > timezone America/Chicago > network --bootproto dhcp > nfs --server=myserver.name.com --dir=/redhat/RedHat > bootloader --location=mbr > clearpart --all > zerombr yes > part / --fstype ext3 --size 5120 > part /home --fstype ext3 --size 1024 > part swap --size 1024 > part /scratch --fstype ext3 --size 1024 --grow > > The directory tree on the remote machine is: > > /redhat > '-- RedHat > |-- RPMS > '-- base > > Are the ks commands echo'd to a file so I can see what is happening, or > if > there are any errors? I've looked at the anaconda-ks.cfg file and it is > a > very close replica of my ks.cfg file, with the glaring exception of the > nfs > command, and some post install stuff. > > Thanks in advance for any and all help or suggestions. > Todd > > > _____ > > > * Previous message: Anyone have information on latest LSU beowulf? > > * Next message: kickstart install using NFS > > * Messages sorted by: [ date ] > [ > thread ] > > [ subject ] > > [ author ] > > > _____ > > More information about the Beowulf mailing list > > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From rgb at phy.duke.edu Mon Jan 17 21:41:35 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 18 Jan 2005 00:41:35 -0500 (EST) Subject: [Beowulf] PVFS or NFS in a Beowulf cluster? In-Reply-To: References: <000e01c4f9a1$889f5210$52838c96@MOB> Message-ID: On Mon, 17 Jan 2005, Kumaran Rajaram wrote: > > I would suggest PVFS2. It offers greater bandwidth (proportional to > number of I/O nodes dedicated) compared to NFS. Also, you may use PVFS2 > native file interface (rather than POSIX I/O interface) which offers > flexible parallel I/O interface for scientific workloads. Only caveat, is > fault-tolerance support and you might need to employ either > software/hardware RAID for disk failures and heartbeat mechanism for > server failures. If the environment has less probability of > server/disk/netowrk faults, then PVFS2 is a good choice for scientific > workloads and parallel applications. Hmmm, I'd have recommended the usual "it depends". If all you are doing is starting jobs from home directories or mounted project space and writing occasional results back to same (not exactly hammering the disks) and your storage needs are modest -- within the means of a single RAID 5 server (say ~1 TB or less, these days) I'd personally advise sticking with tried-and-true NFS. NFS has all sorts of well-known and in some cases venerable "problems", but it has also been the absolute bulwark of client-server computing in unix environments for some 20 years, the sine qua non of the workstation LAN. Most of its problems are manageable and invisible under ordinary usage. Only if your needs are extreme -- needs for a LARGE filespace, for lots of data parallelism, etc -- would I recommend looking into a non-NFS solution, especially for a first cluster of very modest size. I mean, disk is SO cheap at less than $1/GB. I have more disk in EACH of my home computers than existed in the world when NFS was invented. Setting up a RAID server for a small cluster with 100's of GB costs a total of well under $1000 -- so little you can get two and use one to back up the other, if you'd rather do that than mess with a tape backup unit (which will cost quite a bit MORE than $1000). Setting up NFS on a server takes a few minutes. It would take longer just to find PVFS and read the documentation on HOW to install it... If NFS doesn't work for you, then sure, look beyond it. But start out simple, especially if you are already familiar with NFS servers and clients (likely) and not so familiar with PVFS (also likely). rgb > > -Kums > > On Thu, 13 Jan 2005, Stavros E. Zogas wrote: > > > Hi at all > > I intend to setup a beowulf cluster(16+ nodes) for scientific > > applications(fortran compilers) in a University department.What am i supposd > > to use???PVFS or NFS for file system?? > > Stavros > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From rgb at phy.duke.edu Mon Jan 17 21:50:08 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 18 Jan 2005 00:50:08 -0500 (EST) Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <20050117063322.GA26246@sta.local> References: <41E12D14.3000402@fing.edu.uy> <20050117063322.GA26246@sta.local> Message-ID: On Mon, 17 Jan 2005, George Georgalis wrote: > If you really want to focus on efficiency and engineering, I bet one > (appropriately sized) power-supply per 3 or 5 computers is a sweet spot. > They could possibly run outside the CPU room too. For a smallish cluster, I actually was just communicating with somebody who has just such a cluster -- laid out on open shelves, one OTC PS per shelf, three mobos/shelf, no chassis at all, largish fans blowing right over the shelf mount. All it required is a bit of custom wiring harness to distribute the power on down the shelves. Regarding disks, most computers don't NEED local hard drives any more for many/most computations. So skip the floppy, the HD, any CD drive -- just get lots of memory (to act as a de facto ramdisk), CPU, PXE NIC and video (the latter onboard). This saves power, saves money, gives you fewer components to fail, and leaves you with money to buy better AC. But remember, also -- you MUST remove all the heat that you generate or things will get hotter and hotter as they operate. Putting e.g. PS's outside the room or inside the room just alters where you have to remove the heat from or what components you're going to choose to run hotter. I'll try to talk the owner of the cluster into posting his cluster URL. I really want him to consider writing it up for e.g. CWM. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From josip at lanl.gov Tue Jan 18 08:30:03 2005 From: josip at lanl.gov (Josip Loncaric) Date: Tue, 18 Jan 2005 09:30:03 -0700 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: References: <41E12D14.3000402@fing.edu.uy> Message-ID: <41ED398B.9080306@lanl.gov> At my old job, we had the unfortunate experience of AC failing on the hottest days of the year. Despite providing plenty of circulating fresh 35-40 deg. C air, we lost hardware, mainly disks. In fact, we'd start losing hard drives (even high quality SCSI drives in our servers) any time the ambient temperature approached 30 deg. C. Based on this experience, I'd say that keeping the ambient temperature under about 25-27 deg. C is a good policy. As Robert has pointed out, the cost of lost productivity while the system is down for hard drive replacement and reconstruction, not to mention the manpower required, can make an unreliable system "AWESOMELY expensive." In fact, I'd recommend installing a temperature activated kill switch in any cluster computing room. Remember: dissipating 5-10 KW in a small enclosed space can overheat your expensive cluster within minutes of AC failure, certainly faster than your system administrator can respond to an alarm triggered on a Sunday at 2am. Even a forced shutdown (when ambient temperature exceeds about 30 deg. C for more than a few minutes) is cheaper to fix than replacing and rebuilding several failed hard drives. Sincerely, Josip From alex at DSRLab.com Mon Jan 17 20:10:23 2005 From: alex at DSRLab.com (Alex Vrenios) Date: Mon, 17 Jan 2005 21:10:23 -0700 Subject: [Beowulf] Cluster Novice Message-ID: <200501180406.j0I46mfF001709@bluewest.scyld.com> > -----Original Message----- > > I'm intrest is in a "virtual" cpu that can run over N > machines and if > > "virtual" cpu crashes the N-1 takes over. > > well, it depends on your assumptions. for instance, how do > you detect a crash? NonStop provides a much more paranoid > view of "failure" than a simpler, software-based approach > like STONITH HA clusters. > > > I'm not looking for speed only stableness. > > beowulf is about speed, not HA. > I believe the original poster is looking for the Linux-HA Project home page at http://www.linux-ha.org/ where he can register himself for their email list in the same way he registered here. He may also want to look into the Red Hat Enterprise Linux solution called the Cluster Manager. This is expensive, the above solution is a free download. Hope this helps, From scunacc at yahoo.com Mon Jan 17 21:36:21 2005 From: scunacc at yahoo.com (scunacc) Date: Tue, 18 Jan 2005 00:36:21 -0500 Subject: [Beowulf] Re: distcc on Beowulf? Message-ID: <1106026581.2863.111.camel@scuna-gate> Dear Joel, > Does anyone have experiences using distcc on a beowulf cluster? >From below, it seems you mean specifically a BProc-based cluster. Yes, I've been doing that for some little while now. > I'm in the process of attempting to setup a Scyld cluster to do > distcc, but the BProc processing methods and lack of complete mounts > on compute nodes is giving me some trouble. > (gcc needs to exec cc1) Not sure what you mean by the BProc processing methods. That's not *really* relevant to what you need to achieve. If you mean the way things are configured then I agree. But it's not really a hindrance, - just need to be creative with NFS :-) The lack of mounts, libraries, binaries, etc. is a tad more important. By altering the config files and making sure you have the correct binaries and libraries in a mountable place on the master, you can kludge it to make distcc work. I'll try and fill in with details tomorrow as I have time. However, I wouldn't recommend leaving things that way long term owing to the potential amount of NFS traffic you will generate. The issue is letting distcc be in charge rather than BProc and friends for the remote distribution. If your cluster isn't being used for anything else at the time, it might be OK for you depending upon what you're trying to build. I have been building MPI-based libraries, math libraries, tools, applications etc. this way for a while but on an otherwise quiescent cluster. It's late right now, but I'll try and dig the details out on Tuesday. An alternative that I've also developed for providing a "general-case" alternative to BProc control of a cluster is to boot the cluster in a non-BProc fashion with a custom-built diskless/NFS setup. That gives you the ability to better regulate the "normality" of what the compute nodes see, but again, it's only really a workable solution if a.) You already have access to such a diskless/NFS setup (or could grow one quickly), b.) You have administrative and "time" control over the cluster c.) You have a spare machine with sufficient disk space able to act as a diskless/NFS server. You wouldn't have to alter the clients at all except to get them to remote boot from this alternate server rather than the BProc master. a.) might be tackled with ClusterKnoppix or Quantian or EduOscar or some other cluster livecd if you didn't want to develop your own diskless/NFS master/client setup. Kind regards Derek Jones. From angelv at iac.es Tue Jan 18 00:14:10 2005 From: angelv at iac.es (Angel de Vicente) Date: Tue, 18 Jan 2005 08:14:10 +0000 Subject: [Beowulf] the solution for qdel fail..... In-Reply-To: <200501110944.46037.csamuel@vpac.org> References: <1105043618.11139.5.camel@strathmill.biosc.lsu.edu> <1105055797.14387.9793.camel@blackflag.cct.lsu.edu> <1105372151.15841.5.camel@strathmill.biosc.lsu.edu> <200501110944.46037.csamuel@vpac.org> Message-ID: <16876.50514.454690.981295@guinda.iac.es> Hi Chris, Chris Samuel writes: > On Tue, 11 Jan 2005 02:49 am, Jerry Xu wrote: > > > Hi, William, Thank for your information. Just in case somebody still > > need it for openPBS configuration, here is my epilogue file.it shall be > > located in $pbshome/mom_priv/ for each node and it need to be set as > > executable and owned by root. Some others many have better epilogue > > scripts... > > Hmm, the only thing that worries me about that is that for those of us with > SMP clusters it is possible for a user to have two different jobs running on > each of the CPUs, so an epilogue script that kills all a users processes on a > node would accidentally kill an innocent job. We have a SMP cluster, and to avoid the death of innocent processes we use the script in section "Cleanup of MPICH/PBS jobs" in http://bellatrix.pcl.ox.ac.uk/%7Eben/pbs/ It doesn't always work, and some jobs are left lingering sometimes, but at least it doesn't kill innocents (some day I hope I will have the time to look into it and try to find out why). Hope it helps. Cheers, Angel de Vicente -- ---------------------------------- http://www.iac.es/galeria/angelv/ PostDoc Software Support Instituto de Astrofisica de Canarias From jkrauska at cisco.com Tue Jan 18 01:47:53 2005 From: jkrauska at cisco.com (Joel Krauska) Date: Tue, 18 Jan 2005 01:47:53 -0800 Subject: [Beowulf] Re: distcc on Beowulf? In-Reply-To: <1106026581.2863.111.camel@scuna-gate> References: <1106026581.2863.111.camel@scuna-gate> Message-ID: <41ECDB49.3030801@cisco.com> scunacc wrote: > Yes, I've been doing that for some little while now. Awesome. Good to hear, Derek. I'd love to hear your solution. I got it running myself, and it's looking hopeful, but having some problems doing a kernel build. Your comments about how Bproc and distcc are orthogonal was right on the mark.. distcc was having problems because the compute nodes didn't mount /usr. (Bproc doesn't really come in to the picture at all.) By adding a /usr mount point to /etc/exports and /etc/beowulf/fstab I was able to get distcc running. However as I hinted above, I seem to be running in to a kernel build problem using distcc. make -f scripts/Makefile.build obj=arch/x86_64/ia32 gcc -Wp,-MD,arch/x86_64/ia32/.syscall32.o.d -nostdinc -iwithprefix include -D__KERNEL__ -Iinclude -Wall -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -O2 -fomit-frame-pointer -mno-red-zone -mcmodel=kernel -pipe -fno-reorder-blocks -Wno-sign-compare -fno-asynchronous-unwind-tables -Wdeclaration-after-statement -DKBUILD_BASENAME=syscall32 -DKBUILD_MODNAME=syscall32 -c -o arch/x86_64/ia32/syscall32.o arch/x86_64/ia32/syscall32.c {standard input}: Assembler messages: {standard input}:5: Error: file not found: arch/x86_64/ia32/vsyscall-syscall.so {standard input}:8: Error: file not found: arch/x86_64/ia32/vsyscall-sysenter.so distcc[20537] ERROR: compile arch/x86_64/ia32/syscall32.c on .0 failed make[1]: *** [arch/x86_64/ia32/syscall32.o] Error 1 make: *** [arch/x86_64/ia32] Error 2 These files exist and show up with a bpsh -a ls -l arch/x86_64/ia32/ The above gcc command works perfectly when called locally on the master node, and fails when run under bpsh. The best I can think is that perhaps something in the make or shell environment is missing? (PATH?) I realize x86-64 is relatively new territory, so I might downgrade my cluster to a 32-bit system, since this could be difficult to find others who can reproduce, and I could figure out if it has something to do with building for this arch. Again, any insight in to your methods would be appreciated. Thanks, --joel From Luc.Vereecken at chem.kuleuven.ac.be Tue Jan 18 05:41:36 2005 From: Luc.Vereecken at chem.kuleuven.ac.be (Luc Vereecken) Date: Tue, 18 Jan 2005 14:41:36 +0100 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: References: <41E12D14.3000402@fing.edu.uy> <20050117063322.GA26246@sta.local> <41EB8BC4.6090605@irisa.fr> <20050117160145.GA25113@sta.local> Message-ID: <6.0.1.1.0.20050118140238.01c796c8@arrhenius.chem.kuleuven.ac.be> Hi list, I usually just lurk on this mailinglist, but i think it time to share some experience about not having Cooling... I have been running a cluster (of variable size depending on the season) in an average room without AC for several years. Not by choise, I must say, but my request for AC was rejected, and it took years before the necessary infrastructure was present to move to another room that already had AC installed. It is a horrorstory. During summer (yes, I'm on the sunny side of the building) temperatures in excess of 35? Celcius. During winter, even with the (small) window open while it was freezing outside, I couldn't get the temperature below 20?C. I just could not get rid of the generated heat, despite that this is a chemistry building and the ventilation replaces the air 7 times per hour (or is it 15 times? can't remember). Note that other rooms in this part of the building tend to be chilly in winter because it's so hard to heat them with the ventilation taking out the heated air. The first summer I had a failure rate of over 60%. Some motherboards failed, plenty of powersupplies failed, I had 10 brandnew disks that ran so hot at times i couldn't put my hand on them at these ambient temperatures. 5 of them failed in the first 6 months, the other 5 a few months later. Some CPUs just stopped working. Some memory modules burned out. I have 2 or 3 nodes where i can reproducibly crash certain jobs or get faulty results just depending on the temperature of the room. I found that during the hot season, new computers ran for about 3 months, then started to go awry. In an attempt to get rid of the hot air, I attached flexible airducts to the exhaust of the powersupplies (where most of the hot air comes out) and the ventilation sucked the hot air out directly. This idea actually works pretty well for a DIY solution, especially as we have uber-ventilation given that this room used to be a chemical lab. I might actually implement this also for our new cluster (in an AC-ed room) just to reduce the AC-requirements. On average, it reduced the temperature in the room several degrees, but I had to let go of the idea because I still had to fix nodes too often and handling the airducts was a bit too cumbersome to do on daily basis (some nodes are still attached to this system). The next summer, I wisened up, and preemtively turned off some of the slower nodes. This time failure rate of the remaining nodes was _only_ 45%, but I think this is partly because I just stopped fixing nodes at a certain point (ran out of spare powersupplies )-: ) and left the faulty nodes turned off waiting for the colder season. After more than two years, I now have access to an AC-ed room; I plan on building a completely new cluster, as all the current hardware has been overheating and is prone to produce faulty results because of this despite that the room is cooler now than in summer. My advice: don't even think about trying HW replacement instead of cooling. - Failure rates are horrible at temperatures above 30? ambient: I lost thousands of euros by failures. - Downtime is also killing you: my scientific output has dropped to less than 30% compared to before these heating problems started. With a bit of bad luck, I won't get a new grant because of this. - TCO is horrible due to the operator time: you have to manually walk over to the cluster, take out the node, figure out the problem, fix it, get spare parts or contact the warranty supplier,... this takes too much time especially given the high failure rates. I never worked as hard as during these last 2 years, and as I said, scientific output was still strongly reduced. - You will have difficulty with warranty. Some components you can get replaced without questions as they show no obvious signs of abuse, but I had MoBos with several components blown and blackened. No way you can claim this is not caused by overheating. After you send in a couple of these, you start to get questions... - The biggest killer of all is the non-visible problems. At certain moments I started to get different results on different nodes for the same job. A job would crash at 28?, but run OK at 24?. I get unexplainable outliers when calculating what should be a smooth trend. Rerunning the calculations exactly the same gives different results. You just cant trust your results anymore. OK. Better stop here, as i don't intend to rival rgb in length of a post :-) Anyway: going the HW-replacement road would be, in my view and based on extensive experience, a wrong decission. Luc Vereecken From yudong at hsb.gsfc.nasa.gov Tue Jan 18 07:25:46 2005 From: yudong at hsb.gsfc.nasa.gov (Yudong Tian) Date: Tue, 18 Jan 2005 10:25:46 -0500 Subject: [Beowulf] kickstart install using NFS In-Reply-To: <41EC5003.3010208@cora.nwra.com> Message-ID: You do not even need the floppies. You can just use PXE to boot, and kickstart to install/reinstall OS. Here are my notes of doing that: Installing Linux over Network: PXE, DHCP, TFTP, NFS and Kickstart http://lis.gsfc.nasa.gov/yudong/notes/net-install.txt Regards, Yudong ------------------------------------------------------------ Falun Dafa: The Tao of Meditation (http://www.falundafa.org) ------------------------------------------------------------ Yudong Tian, Ph.D. NASA/GSFC (301) 286-2275 >-----Original Message----- >From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org]On >Behalf Of Orion Poplawski >Sent: Monday, January 17, 2005 6:54 PM >To: Pat Delaney >Cc: beowulf at beowulf.org >Subject: Re: [Beowulf] kickstart install using NFS > > >Pat Delaney wrote: >> Did you ever get an answer to your post?? I'm trying to do the >same thing? >> >> Pat >> >> >> I'm preparing to install a large number of new nodes using >redhat and have >> planned on using the kickstart option. I have gotten a >kickstart file setup >> just the way I want it with one exception and I can not get it >to work. I >> ultimately want to boot from a floppy and in the kickstart file >tell it to >> get the rpm's from a nfs mount. >> >> So far I have: >> >> 1.) Booted from a cd and issued the command: >> linux ks=floppy >> This is how I built and debugged my ks file. This gives me what I want, >> except I get to swap CD's during the install. No nfs option at >this point. >> > >I vaguely remember trying this before. I think I ultimately found the >floppy/nfs combo incompatible. > >Instead, get the ks file from the server too: > >linux ks=nfs::/exported/directory/ks.cfg > > >-- >Orion Poplawski >System Administrator 303-415-9701 x222 >Colorado Research Associates/NWRA FAX: 303-415-9702 >3380 Mitchell Lane, Boulder CO 80301 http://www.co-ra.com >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From yamasaki at fis.ua.pt Tue Jan 18 07:39:06 2005 From: yamasaki at fis.ua.pt (Yoshihiro Yamasaki) Date: Tue, 18 Jan 2005 15:39:06 +0000 Subject: [Beowulf] OPTERON MPICH2 Message-ID: Hi ! Does anyone knows how to set mmpich2 (configure) for OPTERON under pgf90 and pgcc ( 64 bits) including -DDEC_ALPHA and -byteswapio FLAGS?? BEST REGARDS, YYAMAZKI From shaeffer at neuralscape.com Tue Jan 18 10:59:30 2005 From: shaeffer at neuralscape.com (Karen Shaeffer) Date: Tue, 18 Jan 2005 10:59:30 -0800 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <41ED398B.9080306@lanl.gov> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> Message-ID: <20050118185930.GA30223@synapse.neuralscape.com> On Tue, Jan 18, 2005 at 09:30:03AM -0700, Josip Loncaric wrote: > At my old job, we had the unfortunate experience of AC failing on the > hottest days of the year. Despite providing plenty of circulating fresh > 35-40 deg. C air, we lost hardware, mainly disks. In fact, we'd start > losing hard drives (even high quality SCSI drives in our servers) any > time the ambient temperature approached 30 deg. C. Hello, I would certainly agree with the assertion that disk drive MTBF has a strong, nonlinear dependency on operating temperature. While I have not run disks at out of spec temperatures, I did work at Seagate for a few years, where I learned of this very strong dependence. This thread began with the assertion that you do not need to cool disks, but I think this is a very ill-advised strategy. YMMV, Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com From rajesh at petrotel.com Tue Jan 18 11:31:17 2005 From: rajesh at petrotel.com (Rajesh Bhairampally) Date: Tue, 18 Jan 2005 13:31:17 -0600 Subject: [Beowulf] Mosix References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> Message-ID: <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> Hi, I am newbee to cluster computing; so if my question sounds stupid, please excuse me. i am wondering when we have something like mosix (distributed OS available at www.mosix.org ), why we should still develop parallel programs and strugle with PVM/MPI etc. Tough i never used either mosix or PVM/MPI, I am genunely puzzled about it. Can someone kindly educate me? thanks, rajesh From mwill at penguincomputing.com Tue Jan 18 12:29:47 2005 From: mwill at penguincomputing.com (Michael Will) Date: Tue, 18 Jan 2005 12:29:47 -0800 Subject: [Beowulf] Mosix In-Reply-To: <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> Message-ID: <200501181229.48118.mwill@penguincomputing.com> On Tuesday 18 January 2005 11:31 am, Rajesh Bhairampally wrote: > i am wondering when we have something like mosix (distributed OS available > at www.mosix.org ), why we should still develop parallel programs and > strugle with PVM/MPI etc. Because Mosix does not work? This of course is not really true, for some applications Mosix might be appropriate, but what it really does is transparently move processes around in a cluster, not have them become suddenly parallelized. Let's have an example: Generally your application is solving a certain problem, like say taking an image and apply a certain filter to it. You can write a program for it that is not parallel-aware, and does not use MPI and just solves the problem of creating one filtered image from one original image. This serial program might take one hour to run (assuming really large image and really complicated filter). Mosix can help you now run this on a cluster with 4 nodes, which is cool if you have 4 images and still want to wait 1 hour until you see the first result. Now if you want to really filter only one image, but in about 15 minutes, you can program your application differently so that it only works on a quarter of the image. Mosix could still help you run your code with different input data in your cluster, but then you have to collect the four pieces and stitch them together and would be unpleasently surprised because the borders of the filter will show - there was information missing because you did not have the full image available but just a quarter of it. Now when you adjust your code to exchange that border-information, you are actually already on the path to become an MPI programmer, and might as well just run it on a beowulf cluster. So the mpi aware solution to this would be a program that splits up the image into the four quadrants, forks into four pieces that will be placed on four available nodes, communicates the border-data between the pieces and finally collects the result and writes it out as one final image, all in not much more than the 15 minutes expected. Thats why you want to learn how to do real parallel programming instead of relying on some transparent mechanism to guess how to solve your problem. Michael > Tough i never used either mosix or PVM/MPI, I am > genunely puzzled about it. Can someone kindly educate me? > > thanks, > rajesh > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Michael Will, Linux Sales Engineer Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com Visit us at LinuxWorld 2005! Hynes Convention Center, Boston, MA February 15th-17th, 2005 Booth 609 From james.p.lux at jpl.nasa.gov Tue Jan 18 12:52:50 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Tue, 18 Jan 2005 12:52:50 -0800 Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <20050118185930.GA30223@synapse.neuralscape.com> Message-ID: <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> OK.. we're all agreed that running things hot is bad practice.. BUT, it seems we're all talking about "office" or "computer room" environments on problems where a failure in a processor or component has high impact. Say you have an application where you don't need long life (maybe you've got a field site where things have to work for 3 months, and then it can die), but the ambient temperature is, say, 50C. Maybe some sort of remote monitoring platform system. You've got those Seagate drives with the spec for 30C, and some small number will fail every month at that temp (about 0.5% will fail in the three months). But, you'll have to go through all kinds of hassle to cool the 50C down to 30C. Maybe your choice is between "sealed box at 50C" and "vented box at 30C", in a dusty dirty environment, where the reliability impact of sucking in dust is far greater than the increased failure rate due to running hot. You just run at 50C, accepting a 10 times higher (or maybe, only 4-5 times higher) failure rate. You're still down at 5% failure rate over the three months. If you've got half a dozen units, and you write your software/design your system so you can tolerate a single failure without disaster, and you might have a cost effective solution. Yes, it requires more sophistication in writing the software. Dare I say, better software design, something that is fault tolerant? There's also the prospect, not much explored in clusters, but certainly used in modern laptops, etc. of dynamically changing computation rate according to the environment. If the temperature goes up, maybe you can slow down the computations (reducing the heat load per processor) or just turn off some processors (reducing the total heat load of the cluster). Maybe you've got a cyclical temperature environment (that sealed box out in the dusty desert), and you can just schedule your computation appropriately (compute at night, rest during the day). This kind of "resource limited" scheduling is pretty common practice in spacecraft, where you've got to trade power, heat, and work to be done and keep things viable. There are very well understood ways to do it autonomously in an "optimal" fashion, although, as far as I know, nobody is brave enough to try it on a spacecraft, at least in an autonomous way. Say you have a distributed "mesh" of processors (each in a sealed box), but scattered around, in a varying environment. You could move computational work among the nodes according to which ones are best capable at a given time. I imagine a plain with some trees, where the shade is moving around, and, perhaps, rain falls occasionally to cool things off. You could even turn on and off nodes in some sort of regular pattern, waiting for them to cool down in between bursts of work. People (perhaps, some are even on this list) are developing scheduling and work allocation algorithms that could do this kind of thing (or, at least, they SHOULD be). It's a bit different than the classical batch handler, and might require some awareness within the core work to be done. Ideally, the computational task shouldn't care how many nodes are working how fast, or which nodes, but not all applications can be that divorced from knowledge of the computational environment. Jim Lux Flight Communications Systems Section Jet Propulsion Lab ----- Original Message ----- From: "Karen Shaeffer" To: "Josip Loncaric" Cc: Sent: Tuesday, January 18, 2005 10:59 AM Subject: Re: [Beowulf] Cooling vs HW replacement > On Tue, Jan 18, 2005 at 09:30:03AM -0700, Josip Loncaric wrote: > > At my old job, we had the unfortunate experience of AC failing on the > > hottest days of the year. Despite providing plenty of circulating fresh > > 35-40 deg. C air, we lost hardware, mainly disks. In fact, we'd start > > losing hard drives (even high quality SCSI drives in our servers) any > > time the ambient temperature approached 30 deg. C. > > Hello, > > I would certainly agree with the assertion that disk drive MTBF has > a strong, nonlinear dependency on operating temperature. While I have > not run disks at out of spec temperatures, I did work at Seagate for a > few years, where I learned of this very strong dependence. This thread > began with the assertion that you do not need to cool disks, but I > think this is a very ill-advised strategy. > > YMMV, > Karen > -- > Karen Shaeffer > Neuralscape, Palo Alto, Ca. 94306 > shaeffer at neuralscape.com http://www.neuralscape.com > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From alvin at Mail.Linux-Consulting.com Tue Jan 18 13:26:23 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Tue, 18 Jan 2005 13:26:23 -0800 (PST) Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <6.0.1.1.0.20050118140238.01c796c8@arrhenius.chem.kuleuven.ac.be> Message-ID: hi ya luc On Tue, 18 Jan 2005, Luc Vereecken wrote: > The first summer I had a failure rate of over 60%. Some motherboards the normal failure rate is say 5% or so for first 30 days or first year.. - if you lose too much more systems, than it's a vendor parts problem ( where you or they get their parts to build systems ) > failed, plenty of powersupplies failed, I had 10 brandnew disks that ran so > hot at times i couldn't put my hand on them at these ambient temperatures. > 5 of them failed in the first 6 months, the other 5 a few months later. the disks should be coool to the touch ... say no more than 30C for its operating temp ( hddtemp seems to be good measure ) - silly things like a $3 or $15 fan will keep a disk from failing, and use 2 of um to avoid single fan failure problem yyp.. after an AC failure, lots of disks will die within 2-3 months if some died during the ac failure c ya alvin From lindahl at pathscale.com Tue Jan 18 13:33:56 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Tue, 18 Jan 2005 13:33:56 -0800 Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement In-Reply-To: <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <20050118185930.GA30223@synapse.neuralscape.com> <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> Message-ID: <20050118213356.GC2652@greglaptop.internal.keyresearch.com> On Tue, Jan 18, 2005 at 12:52:50PM -0800, Jim Lux wrote: > There's also the prospect, not much explored in clusters, but certainly used > in modern laptops, etc. of dynamically changing computation rate according > to the environment. This is already done; the Pentium 4 dynamically freezes for 1 microsecond at a time when it is too hot. I've also got some AMD Athlon boxes that run 10% slower after I've been running them hard. This is apparently controlled by the BIOS, though, so I don't think it's that configurable enough to be very useful. You could rig something up with a temperature sensor (perhaps lm_sensors) and cpufreq. -- greg From james.p.lux at jpl.nasa.gov Tue Jan 18 14:17:46 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Tue, 18 Jan 2005 14:17:46 -0800 Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <20050118185930.GA30223@synapse.neuralscape.com> <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> <20050118213356.GC2652@greglaptop.internal.keyresearch.com> Message-ID: <002a01c4fdab$8a483740$42f29580@LAPTOP152422> ----- Original Message ----- From: "Greg Lindahl" To: Sent: Tuesday, January 18, 2005 1:33 PM Subject: Re: another radical concept...Re: [Beowulf] Cooling vs HW replacement > On Tue, Jan 18, 2005 at 12:52:50PM -0800, Jim Lux wrote: > > > There's also the prospect, not much explored in clusters, but certainly used > > in modern laptops, etc. of dynamically changing computation rate according > > to the environment. > > This is already done; the Pentium 4 dynamically freezes for 1 > microsecond at a time when it is too hot. I've also got some AMD > Athlon boxes that run 10% slower after I've been running them hard. > > This is apparently controlled by the BIOS, though, so I don't think > it's that configurable enough to be very useful. You could rig > something up with a temperature sensor (perhaps lm_sensors) and > cpufreq. > > -- greg I was thinking about doing the load scheduling at a higher "cluster" level, rather than at the micro "single processor" level... That way you could manage thermal issues in a more "holistic" way.. (like also watching disk drive temps, etc.) Jim From hahn at physics.mcmaster.ca Tue Jan 18 15:50:30 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue, 18 Jan 2005 18:50:30 -0500 (EST) Subject: [Beowulf] PVFS or NFS in a Beowulf cluster? In-Reply-To: <1105998051.41ec30e306be0@home.staff.uni-marburg.de> Message-ID: > I don't see a contradiction to use both: NFS for the home directories (on some > sort of master node with an attached hardware RAID), PVFS2 for a shared scratch > space (in case the applications need a shared scratch space across the nodes). that's certainly attractive. has anyone tried PVFS2 in a *parallel* cluster? that is, for tight-coupled parallel applications, it's quite critical avoid stealing cycles from the MPI worker threads. I'd be curious to know how much of a problem PVFS2 would cause this way. thanks, mark hahn. From rgb at phy.duke.edu Tue Jan 18 16:45:17 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue, 18 Jan 2005 19:45:17 -0500 (EST) Subject: [Beowulf] Mosix In-Reply-To: <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> Message-ID: On Tue, 18 Jan 2005, Rajesh Bhairampally wrote: > Hi, > > I am newbee to cluster computing; so if my question sounds stupid, please > excuse me. > > i am wondering when we have something like mosix (distributed OS available > at www.mosix.org ), why we should still develop parallel programs and > strugle with PVM/MPI etc. Tough i never used either mosix or PVM/MPI, I am > genunely puzzled about it. Can someone kindly educate me? The best education is in the list archives, as this has be discussed and described many times before. In a nutshell -- MOSIX takes jobs submitted on any host in a LAN and distributes them on other LAN hosts to keep global load balanced across the LAN (which might or might not be a classical "cluster"). For relatively simple jobs that have open files and sockets, it creates a virtual network layer for forwarding their I/O back to the originating host so if you open a file on your primary host and then the job migrates, it doesn't crash. It is one of several tools that is suitable for running lots of embarrassingly parallel jobs at once on a pool of systems (others being batch managers like e.g. SGE and/or policy tools like condor, or any of a variety of fairly new gridware). PVM or MPI are parallel communications (message passing) libraries. They facilitate writing truly parallel jobs that are intended a priori to run on a cluster, managing the passing of messages between what amount to task threads running on different nodes. The jobs thus created may well NOT be embarrassingly parallel -- in fact generally they won't be, as PVM is overkill for EP tasks although I've used it for such in the past. That is, MOSIX doesn't parallelize a serial job, it just runs lots of serial jobs (independently) in parallel. PVM or MPI run truly parallel jobs with lots of NON-independent subtasks advancing a computation, with nontrivial communications between the subtasks. Clear? rgb > > thanks, > rajesh > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From atp at piskorski.com Tue Jan 18 22:41:45 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Wed, 19 Jan 2005 01:41:45 -0500 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <20050117063322.GA26246@sta.local> References: <20050117063322.GA26246@sta.local> Message-ID: <20050119064145.GB28329@piskorski.com> On Mon, Jan 17, 2005 at 01:33:22AM -0500, George Georgalis wrote: > Use a SAN/NAS (nfs) and keep the disks in a separate room than the CPUs. > Disk drives generate a lot of heat, and compared to on board components > don't really need cooling, circulated air should largely cover them. This is an oxymoron. If disks generate a lot of heat, then that heat needs to be removed. If you have a small room stuffed with hundreds of those disks, all that heat has to go somewhere... -- Andrew Piskorski http://www.piskorski.com/ From john.hearns at streamline-computing.com Wed Jan 19 04:13:59 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Wed, 19 Jan 2005 12:13:59 -0000 (GMT) Subject: [Beowulf] Mosix In-Reply-To: <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> Message-ID: <10768.81.137.240.21.1106136839.squirrel@webmail.streamline-computing.com> > Hi, > > I am newbee to cluster computing; so if my question sounds stupid, please > excuse me. > > i am wondering when we have something like mosix (distributed OS available > at www.mosix.org ), why we should still develop parallel programs and > strugle with PVM/MPI etc. Tough i never used either mosix or PVM/MPI, I am > genunely puzzled about it. Can someone kindly educate me? > Rajesh, I apologise for a short answer, but I am busy right at the moment! Mosix deals with process migration. A compute-intensive process is stopped, and checkpointed, on the master node of the cluster. It is then transferred and restarted on one of the cluster nodes. So Mosix is good for the situation where you have programs which use a lot of CPU resource on one machine. For the process to transfer, all the machines must be running the same kernel version. PVM/MPI are parallel programming libraries. You can run MPI programs on different types of machines - the intent was that parallel codes could be ported from (say) Crays to IBM machines. If you want to run programs which can split their computation between lots and lots of CPUs then MPI is a good choice. For instance, I am at the moment preparing a system which has over 500 dual-CPU systems to run a big benchmark. MOSIX would not be suitable for getting that number of parallel processes up, running and comunicating. I should make clear that MOSIX has its place - making it simple and easy to run compute intensive jobs, say perhaps bioinformatics searches, codes which do not need inter-node communication. Hope others will add/correct my reply. John Hearns From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Jan 19 05:35:55 2005 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Wed, 19 Jan 2005 14:35:55 +0100 (CET) Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement In-Reply-To: <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> Message-ID: On Tue, 18 Jan 2005, Jim Lux wrote: > If the temperature goes up, maybe you can slow down the computations > (reducing the heat load per processor) or just turn off some > processors (reducing the total heat load of the cluster). I've had the second part (turning off nodes) working about 4 years ago... Back when APM was a reliable way of turning off the power and ACPI was not yet supported in the kernel. At that time also the network drivers were not poluted with hooks for power management, so using ether-wake was also easy to set up (of course, if the BIOS was any good, but then I used to pick the mainboard carefully). The reason for turning off the nodes was also overheating of the computer room. While with those nodes we did not have so many problems as with the dual-Athlons that followed shortly afterwards, I acted on the same principle that was mentioned in this thread: it's better to have things running at 5 degrees (Celsius) lower. At that time we did not have any scheduling system, so the "power management" could not be done very tightly. I have set up the nodes to shut down after 24 hours of being idle; I did not want to have too many down/up cycles as these are just as (or maybe ever more) disturbing to some components. Something obvious, but maybe worth mentioning: the nodes would log somewhere that they shutdown themselves for being idle for too long; I wanted to know when that happened and, even more, to be able to make a difference between nodes that simply crashed/were unplugged/etc. and those that did a graceful shutdown and should be able to wake up in good shape. I did not have the chance to use this too much as we had a sudden increase in computational requirements that lasted several months, then the dual-Athlons came without APM and I wasn't able to reliably control the shutdown and so the whole setup became useless... Things are in better shape today, as IPMI has become more widespread and it can reliably take care of both the shutdown and the wake-up. Too bad that it is still present only on $erver-grade mainboards and even then is most often only an option. You might have noticed that the original message said "just turn off some processors" while I started with "turning off nodes". I would indeed like to be able to shutdown individual CPUs from a SMP node, but this was impossible several years ago; I don't know what the status of hotswap CPU support in the kernel is now. I only hope that the hardware manufacturers will allow future multi-core CPUs to have some cores in standby/low power modes and be able to wake up without disturbing the running cores; and all these with a nice control interface - doing it "automatically" by the CPU depending on load is not so useful IMHO. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: Bogdan.Costescu at IWR.Uni-Heidelberg.De From robl at mcs.anl.gov Wed Jan 19 09:08:05 2005 From: robl at mcs.anl.gov (Robert Latham) Date: Wed, 19 Jan 2005 11:08:05 -0600 Subject: [Beowulf] PVFS or NFS in a Beowulf cluster? In-Reply-To: References: <1105998051.41ec30e306be0@home.staff.uni-marburg.de> Message-ID: <20050119170805.GH4964@mcs.anl.gov> On Tue, Jan 18, 2005 at 06:50:30PM -0500, Mark Hahn wrote: > > I don't see a contradiction to use both: NFS for the home > > directories (on some sort of master node with an attached hardware > > RAID), PVFS2 for a shared scratch space (in case the applications > > need a shared scratch space across the nodes). > > that's certainly attractive. has anyone tried PVFS2 in a *parallel* > cluster? Disclosure: I'm one of the PVFS2 developers. We deliberately designed PVFS2 to perform very well as a fast shared scratch space. Let NFS do what it was designed to do -- serve home directories. To answer your question, I regularly set up PVFS2 volumes on Argonne clusters for benchmarking, testing, and experiments. Ohio Supercomputing Center has PVFS2 deployed on a pretty big cluster. People are using PVFS2 and having good results. If you try it out and you find bugs or issues (and what software doesn't?), the mailing lists have quite a few helpful people. If nodes in your cluster are acting as both IO nodes and compute nodes, you will see a performance hit (I think it's small, but that's somewhat objective). We aren't shy about that fact http://www.pvfs.org/pvfs2/pvfs2-faq.html#sec:howmany-servers More PVFS2 information: http://www.pvfs.org/pvfs2 ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Labs, IL USA B29D F333 664A 4280 315B From epaulson at cs.wisc.edu Tue Jan 18 12:09:27 2005 From: epaulson at cs.wisc.edu (Erik Paulson) Date: Tue, 18 Jan 2005 14:09:27 -0600 Subject: [Beowulf] kickstart install using NFS In-Reply-To: References: <2F6133743473D04B8685415F8243F4761D3597@madison-msg1.global.avidww.com> Message-ID: <20050118200927.GA14827@cobalt.cs.wisc.edu> On Tue, Jan 18, 2005 at 12:30:57AM -0500, Robert G. Brown wrote: > > This is even MORE useful than floppies in so many ways. I have a nice > little list of what I can boot on a node. A kickstart install. A > "redhat install" where I can select packages. In principle, a "rescue" > kernel and image, although the interactive install kernel and image can > be used for most rescue purposes if you know what you are doing. A > choice of architectures and revisions. A DOS floppy boot image for > doing certain chores. memtest86. All via PXE. > > Most of how to do this is in HOWTOs on the web. > Can you give a pointer to a good memtest86/PXE setup? What I would love to have is a memtest86 (or something similar - maybe PC Doctor) that I could periodically have some of my nodes boot and go into diagonstic mode for a while. I've even got everything on serial console, so I could screen scrape and watch for results. memtest spits out so much ANSI crap that it's kind of a mess to do, so I was hoping someone out there has already done it. Are there good alternatives to memtest - maybe something with an easier-to-parse output? It'd also be nice if there was a way to say "Run for an hour", but if need be we've got Baytech PDUs with remote-control power, so I can script a hard reboot if need be. Right now we periodically run a set of scripts that reads and writes files in /dev/shm to "test" memory - we've been able to find nodes with bad memory by using 'cmp' on those files. Thanks, -Erik From mathog at mendel.bio.caltech.edu Tue Jan 18 12:41:16 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Tue, 18 Jan 2005 12:41:16 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement Message-ID: > "Robert G. Brown" wrote: > > I mean, disk is SO cheap at less than $1/GB. That's certainly true for consumer grade disks. "Enterprise" or "Server" grade disks still cost a lot more than that. For instance Maxtor ultra320 drives and Seagate Cheetah drives are both about $4-5/GB. The Western Digital Raptor SATA disks are also claimed to be reliable, and are again, in the $4-5/GB range. (Ie, it isn't just SCSI that makes server disks expensive.) Sure, you can RAID the cheaper ATA/SATA disks and replace them as they fail, but if you're really working them hard, the word from the storage lists is that they will indeed fail. (Let Google be your friend.) Note that our compute nodes' disks are just Western Digital ATA drives, and we've only had one failure out of 20 in 2 years. But we don't push those drives very hard. Under normal conditions they are only used to boot the OS or occasionaly to a new load a database into memory. The disk server uses SCSI disks and is pushed much harder. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From shaeffer at neuralscape.com Tue Jan 18 16:50:09 2005 From: shaeffer at neuralscape.com (Karen Shaeffer) Date: Tue, 18 Jan 2005 16:50:09 -0800 Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement In-Reply-To: <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <20050118185930.GA30223@synapse.neuralscape.com> <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> Message-ID: <20050119005009.GA31750@synapse.neuralscape.com> On Tue, Jan 18, 2005 at 12:52:50PM -0800, Jim Lux wrote: > OK.. we're all agreed that running things hot is bad practice.. BUT, it > seems we're all talking about "office" or "computer room" environments on > problems where a failure in a processor or component has high impact. > > Say you have an application where you don't need long life (maybe you've got > a field site where things have to work for 3 months, and then it can die), > but the ambient temperature is, say, 50C. Maybe some sort of remote > monitoring platform system. > > You've got those Seagate drives with the spec for 30C, and some small number > will fail every month at that temp (about 0.5% will fail in the three > months). But, you'll have to go through all kinds of hassle to cool the 50C > down to 30C. > > Maybe your choice is between "sealed box at 50C" and "vented box at 30C", in > a dusty dirty environment, where the reliability impact of sucking in dust > is far greater than the increased failure rate due to running hot. Hi Jim, Of course, you make excellent points here. I don't know, as I don't work on such problems. If I did, just thinking off the top of my head, I would probably want to examine battery backed ramdisk or NAND flash as having far more interesting characteristics for short duty in harsh environments. This is not my field of endeavor, so I defer to you. (smiles ;) Thanks for your comments. Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com From sbrenneis at surry.net Tue Jan 18 17:25:12 2005 From: sbrenneis at surry.net (Steve Brenneis) Date: Tue, 18 Jan 2005 20:25:12 -0500 Subject: [Beowulf] Mosix In-Reply-To: <200501181229.48118.mwill@penguincomputing.com> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> <200501181229.48118.mwill@penguincomputing.com> Message-ID: <1106097911.4910.32.camel@localhost.localdomain> On Tue, 2005-01-18 at 15:29, Michael Will wrote: > On Tuesday 18 January 2005 11:31 am, Rajesh Bhairampally wrote: > > i am wondering when we have something like mosix (distributed OS available > > at www.mosix.org ), why we should still develop parallel programs and > > strugle with PVM/MPI etc. > > Because Mosix does not work? > > This of course is not really true, for some applications Mosix might be appropriate, > but what it really does is transparently move processes around in a cluster, not > have them become suddenly parallelized. > > Let's have an example: > > Generally your application is solving a certain problem, like say taking an image and apply > a certain filter to it. You can write a program for it that is not parallel-aware, and does not use > MPI and just solves the problem of creating one filtered image from one original image. > > This serial program might take one hour to run (assuming really large image and really > complicated filter). > > Mosix can help you now run this on a cluster with 4 nodes, which is cool if you have 4 > images and still want to wait 1 hour until you see the first result. > > Now if you want to really filter only one image, but in about 15 minutes, you can program your > application differently so that it only works on a quarter of the image. Mosix could still help you > run your code with different input data in your cluster, but then you have to collect the four pieces > and stitch them together and would be unpleasently surprised because the borders of the filter > will show - there was information missing because you did not have the full image available but just > a quarter of it. Now when you adjust your code to exchange that border-information, you are actually > already on the path to become an MPI programmer, and might as well just run it on a beowulf cluster. > > So the mpi aware solution to this would be a program that splits up the image into the four quadrants, > forks into four pieces that will be placed on four available nodes, communicates the border-data between > the pieces and finally collects the result and writes it out as one final image, all in not much more than > the 15 minutes expected. > > Thats why you want to learn how to do real parallel programming instead of relying on some transparent > mechanism to guess how to solve your problem. > > Michael > > Ignoring the inflammatory opening of the above response, I'll just state that its representation of what Mosix does and how it works is neither fair nor accurate. Before message-passing mechanisms arrived, and before the concept of multi-threading was introduced, the favored mechanism for multi-processing and parallelism was the good old fork-join method. That is, a parent process divided the task into small, manageable sub-tasks and then forked child processes off to handle each subtask. When the subtask was complete, the child notified the parent (usually by simply exiting) and the parent joined the results of the sub-tasks into the final task result. This mechanism works quite well on multi-tasking operating systems with various scheduling models. It can be effective on multi-CPU single systems or on clusters of single or multiple CPU systems. Mosix (or at least Open Mosix) handles this kind of parallelism brilliantly in that it will balance the forked child processes around the cluster based on load factors. So your image processing, your Gaussian signal analysis, your fluid dynamics simulations, your parallel software compilations, or your Fibonacci number generations are efficiently distributed while you still maintain programmatic control of the sub-tasking. While the fork-join mechanism is not without a downside (synchronization, for one, as mentioned above), it can be used with a system like Mosix to provide parallelism without the overhead of the message-passing paradigm. Maybe not better, probably not worse, just different. The effect described above in which sub-tasks operate completely independently to produce an erroneous result is really an artifact of poor programming and design skills and cannot be blamed on the task distribution system. Mosix is used regularly to do image processing and other highly parallel tasks. Creating a system like this for Mosix requires no knowledge of a message-passing interface or API, but simply requires a working knowledge of standard multi-processing methods and parallelism in general. One final note: most people consider a Mosix cluster to be a Beowulf as long as it meets the requirements of using commodity hardware and readily available software. Just keeping the record straight. > > > Tough i never used either mosix or PVM/MPI, I am > > genunely puzzled about it. Can someone kindly educate me? > > > > thanks, > > rajesh > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > -- Steve Brenneis From reuti at staff.uni-marburg.de Tue Jan 18 17:53:36 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Wed, 19 Jan 2005 02:53:36 +0100 Subject: [Beowulf] PVFS or NFS in a Beowulf cluster? In-Reply-To: References: Message-ID: <1106099616.41edbda00e494@home.staff.uni-marburg.de> I must admit, that I didn't implement it up to now, because were are still waiting for a new cluster... The idea behind it, is to set aside some of the nodes to be PVFS2 servers only, and leaving the remaining nodes for pure calculations. We have only one application which needs a shared scratch space across the calculation nodes (for which we are at this time using the home directory of the user); others are happy with local scratch space on each calculation node. At the PVFS website is also a description about some speed tests, to get the right amount of PVFS nodes. It depends of course on the used application, and (in our case) how often this special application is used in the cluster. http://www.parl.clemson.edu/pvfs/desc.html Anyway, I would suggest to start with some speed tests to get the best amount of PVFS servers. This way the MPI tasks are not slowed down on the calculation only nodes. Cheers - Reuti Quoting Mark Hahn : > > I don't see a contradiction to use both: NFS for the home directories (on > some > > sort of master node with an attached hardware RAID), PVFS2 for a shared > scratch > > space (in case the applications need a shared scratch space across the > nodes). > > that's certainly attractive. has anyone tried PVFS2 in a *parallel* > cluster? > > that is, for tight-coupled parallel applications, it's quite critical avoid > > stealing cycles from the MPI worker threads. I'd be curious to know how > much > of a problem PVFS2 would cause this way. > > thanks, mark hahn. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From richard at hinditron.com Tue Jan 18 20:42:20 2005 From: richard at hinditron.com (Richard Chang) Date: Wed, 19 Jan 2005 10:12:20 +0530 Subject: [Beowulf] Cluster Novice. I want to know more about user space Message-ID: <005a01c4fde1$69a95330$20acd6d2@laptop> Hi all, Here is my situation. I am really new to clusters and I am assingned the task to learn about it. As maintaining one such cluster will become my bread and butter. Let me start. I want to know how is a cluster viewed from the Desktop of a User. I have to maintain a LINUX Cluster and Is it same as the user logs in to a Standalone Linux Box. Will he see all the nodes as a whole or can he see all the nodes individually. Is he going to see only the master node? which perhaps is the only Node connected to the site network and the rest of the Nodes are connected to the master node, thru some internal network not accessable to the site network. When we login to the Cluster, are we connected to the whole setup or just the master node?. What happens to the abundant Hard disk space we have in all the other nodes, Can the user use it?. If yes how, coz he is logging into the master node only and how can he access the other nodes. If the hard disk space is only used for scratch, then why do we need a 72Gig Hard drive for that matter?. These are some of the issues annoying me. Pls forgive me if I am a little boring and I will be glad if someone can really guide me. Cheers, Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From cflau at clc.cuhk.edu.hk Tue Jan 18 23:53:03 2005 From: cflau at clc.cuhk.edu.hk (John Lau) Date: Wed, 19 Jan 2005 15:53:03 +0800 Subject: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster Message-ID: <1106121183.17565.90.camel@nuts.clc.cuhk.edu.hk> Hi, Have anyone try running MPI programs with MPICH on heterogeneous cluster with both i386 and x86_64 machines? Can I use a i386 binary on the i386 machines while use a x86_64 binary on the x86_64 machines for the same MPI program? I thought they can communicate before but it seems that I was wrong because I got error in the testing. Have anyone try that before? Best regards, John Lau -- John Lau Chi Fai cflau at clc.cuhk.edu.hk Software Engineer Center for Large-Scale Computation From scunacc at yahoo.com Wed Jan 19 03:08:16 2005 From: scunacc at yahoo.com (scunacc) Date: Wed, 19 Jan 2005 06:08:16 -0500 Subject: [Beowulf] Re: distcc on Beowulf? In-Reply-To: <41ECDB49.3030801@cisco.com> References: <1106026581.2863.111.camel@scuna-gate> <41ECDB49.3030801@cisco.com> Message-ID: <1106132895.2863.145.camel@scuna-gate> Dear Joel, > Awesome. Good to hear, Derek. > I'd love to hear your solution. Sounds like you are doing the right thing anyway re: the mounting now. Not sure I could offer more unless you run into other issues. You also want to make sure you are sharing the complete set of libraries you need via the config file libraries section but that's about it. How are you starting your remote distccd's? I found that it was also useful to create /root/.distcc on each node. > I got it running myself, and it's looking hopeful, but having some > problems doing a kernel build. > > Your comments about how Bproc and distcc are orthogonal was right on the > mark.. distcc was having problems because the compute nodes didn't > mount /usr. (Bproc doesn't really come in to the picture at all.) > > By adding a /usr mount point to /etc/exports and /etc/beowulf/fstab I > was able to get distcc running. As I said above... ;-) > However as I hinted above, I seem to be running in to a kernel build > problem using distcc. >... > {standard input}: Assembler messages: > {standard input}:5: Error: file not found: > arch/x86_64/ia32/vsyscall-syscall.so > ... > These files exist and show up with a > bpsh -a ls -l arch/x86_64/ia32/ > > ... > The best I can think is that perhaps something in the make or shell > environment is missing? (PATH?) One thing to check: Where are the intermediate files being created? Is /tmp accessible on all the nodes? Is that being shared (shouldn't be - should be local to each node)? Also, - are the target dirs being mounted r/w on the nodes? If not, then it might be having problems creating the target files. I take it you're doing the kernel build in /usr/src/linux? Is /usr mounted ro or r/w remotely? Just a thought. Kind regards Derek Jones. From rgb at phy.duke.edu Wed Jan 19 17:19:29 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 19 Jan 2005 20:19:29 -0500 (EST) Subject: [Beowulf] kickstart install using NFS In-Reply-To: <20050118200927.GA14827@cobalt.cs.wisc.edu> References: <2F6133743473D04B8685415F8243F4761D3597@madison-msg1.global.avidww.com> <20050118200927.GA14827@cobalt.cs.wisc.edu> Message-ID: On Tue, 18 Jan 2005, Erik Paulson wrote: > On Tue, Jan 18, 2005 at 12:30:57AM -0500, Robert G. Brown wrote: > > > > This is even MORE useful than floppies in so many ways. I have a nice > > little list of what I can boot on a node. A kickstart install. A > > "redhat install" where I can select packages. In principle, a "rescue" > > kernel and image, although the interactive install kernel and image can > > be used for most rescue purposes if you know what you are doing. A > > choice of architectures and revisions. A DOS floppy boot image for > > doing certain chores. memtest86. All via PXE. > > > > Most of how to do this is in HOWTOs on the web. > > > > Can you give a pointer to a good memtest86/PXE setup? What I would love I can probably just send it to you, but I've got to collect it and we are weather-jammed here and I just spent six hours on the road to go 25 miles. Remind me in a day or so if I forget. It is pretty easy, and I'm pretty sure I can send you an image and my tftp setup. rgb > to have is a memtest86 (or something similar - maybe PC Doctor) that > I could periodically have some of my nodes boot and go into diagonstic > mode for a while. I've even got everything on serial console, so I > could screen scrape and watch for results. memtest spits out so much > ANSI crap that it's kind of a mess to do, so I was hoping someone > out there has already done it. Are there good alternatives to memtest - > maybe something with an easier-to-parse output? > > It'd also be nice if there was a way to say "Run for an hour", but if > need be we've got Baytech PDUs with remote-control power, so I can > script a hard reboot if need be. > > Right now we periodically run a set of scripts that reads and writes files > in /dev/shm to "test" memory - we've been able to find nodes with bad memory > by using 'cmp' on those files. > > Thanks, > > -Erik > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From csamuel at vpac.org Wed Jan 19 19:54:14 2005 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 20 Jan 2005 14:54:14 +1100 Subject: [Beowulf] 64 bit Xeons? In-Reply-To: <1105980452.3166.6.camel@hpti10.fsl.noaa.gov> References: <20050112000239.GB14480@maybe.org> <1105980452.3166.6.camel@hpti10.fsl.noaa.gov> Message-ID: <200501201454.16524.csamuel@vpac.org> On Tue, 18 Jan 2005 03:47 am, Craig Tierney wrote: > It uses the 64-bit extensions that AMD uses for the Opteron. I believe that Intel tinkered with a couple of the instructions of the original AMD instruction set, so whilst it's almost identical it's not quite. The major one that I'm aware of is that although it reports it supports an address size of 40 bits it can only handle 36 bits of physical memory. There's more information on differences here: http://www.ussg.iu.edu/hypermail/linux/kernel/0402.3/0276.html also here: http://www.redhat.com/docs/manuals/enterprise/RHEL-3-Manual/release-notes/as-amd64/RELEASE-NOTES-U2-x86_64-en.html#id3938207 and here: http://www.extremetech.com/article2/0,3973,1561875,00.asp?kc=ETRSS02129TX1K0000532 cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From csamuel at vpac.org Wed Jan 19 20:05:35 2005 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 20 Jan 2005 15:05:35 +1100 Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement In-Reply-To: <20050118213356.GC2652@greglaptop.internal.keyresearch.com> References: <41E12D14.3000402@fing.edu.uy> <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> <20050118213356.GC2652@greglaptop.internal.keyresearch.com> Message-ID: <200501201505.37583.csamuel@vpac.org> On Wed, 19 Jan 2005 08:33 am, Greg Lindahl wrote: > This is already done; the Pentium 4 dynamically freezes for 1 > microsecond at a time when it is too hot. The 2.6 Linux kernel can detect these and logs them if you make sure you've enabled: CONFIG_X86_MCE_P4THERMAL when you build your kernel. See ./arch/i386/kernel/cpu/mcheck/p4.c for more details.. Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From csamuel at vpac.org Wed Jan 19 20:14:07 2005 From: csamuel at vpac.org (Chris Samuel) Date: Thu, 20 Jan 2005 15:14:07 +1100 Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement In-Reply-To: <002a01c4fdab$8a483740$42f29580@LAPTOP152422> References: <41E12D14.3000402@fing.edu.uy> <20050118213356.GC2652@greglaptop.internal.keyresearch.com> <002a01c4fdab$8a483740$42f29580@LAPTOP152422> Message-ID: <200501201514.09570.csamuel@vpac.org> On Wed, 19 Jan 2005 09:17 am, Jim Lux wrote: > I was thinking about doing the load scheduling at a higher "cluster" level, > rather than at the micro "single processor" level... That way you could > manage thermal issues in a more "holistic" way.. (like also watching disk > drive temps, etc.) Hmm, well Moab (the next generation of the Maui scheduler from SuperCluster / ClusterResources) supports Ganglia as a source of information, so technically if you could get that temperature stuff into Ganglia (which should be that hard with lm_sensors and gmetric) you could persuade Moab to include that in decisions (such as marking a node that's too hot as 'busy' so as to not put any more jobs there).. Using Moab with Ganglia is documented here: http://www.clusterresources.com/products/mwm/docs/13.5nativerm.shtml There's some information about other possible ways to get that data into Moab here: http://www.clusterresources.com/products/mwm/docs/13.6multirm.shtml The only problem that we've found (which has stopped us using this so far) is that if you're running Ganglia on systems that aren't part of PBS's view of the cluster (such as your login nodes) then Moab starts to put job reservations onto them that can never be fulfilled.. cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From jkrauska at cisco.com Wed Jan 19 16:15:43 2005 From: jkrauska at cisco.com (Joel Krauska) Date: Wed, 19 Jan 2005 16:15:43 -0800 Subject: [Beowulf] transcode Similar Video Processing on Beowulf? Message-ID: <41EEF82F.3020309@cisco.com> Has anyone gotten transcode or one of it's many variants working on a Beowulf environment? Converting MPEG2 to MPEG4 is a very CPU intensive process. Often taking more than real time to process the Video streams. I read a paper a while back about someone doing this using Condor, but I'm wondering if anyone does this regularly on Beowulf. I'd love to learn more about your setup. Thanks, joel From lindahl at pathscale.com Wed Jan 19 23:01:21 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Wed, 19 Jan 2005 23:01:21 -0800 Subject: [Beowulf] 64 bit Xeons? In-Reply-To: <200501201454.16524.csamuel@vpac.org> References: <20050112000239.GB14480@maybe.org> <1105980452.3166.6.camel@hpti10.fsl.noaa.gov> <200501201454.16524.csamuel@vpac.org> Message-ID: <20050120070121.GB1611@greglaptop.greghome.keyresearch.com> On Thu, Jan 20, 2005 at 02:54:14PM +1100, Chris Samuel wrote: > On Tue, 18 Jan 2005 03:47 am, Craig Tierney wrote: > > > It uses the 64-bit extensions that AMD uses for the Opteron. > > I believe that Intel tinkered with a couple of the instructions of the > original AMD instruction set, so whilst it's almost identical it's not quite. > > The major one that I'm aware of is that although it reports it supports an > address size of 40 bits it can only handle 36 bits of physical memory. ... which is because Intel copied it too closely, not because Intel tinkered with it. -- greg From mwill at penguincomputing.com Thu Jan 20 08:18:10 2005 From: mwill at penguincomputing.com (Michael Will) Date: Thu, 20 Jan 2005 08:18:10 -0800 Subject: [Beowulf] Mosix In-Reply-To: <1106097911.4910.32.camel@localhost.localdomain> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> <200501181229.48118.mwill@penguincomputing.com> <1106097911.4910.32.camel@localhost.localdomain> Message-ID: <41EFD9C2.1000405@penguincomputing.com> Steve, his original question was why we still bother with mpi and other parallel programming headaches when instead we could just use Mosix that does things transparently. My response intended to clarify that you still need parallell programming techniques, and your point that you could then also use mosix to have them migrate around (and away from the ressources in the worst case) transparently is true. My point is: There is no automated transparent parallelization of your serial code. My apologies if my answer was not clear enough. Michael Will Steve Brenneis wrote: >On Tue, 2005-01-18 at 15:29, Michael Will wrote: > > >>On Tuesday 18 January 2005 11:31 am, Rajesh Bhairampally wrote: >> >> >>>i am wondering when we have something like mosix (distributed OS available >>>at www.mosix.org ), why we should still develop parallel programs and >>>strugle with PVM/MPI etc. >>> >>> >>Because Mosix does not work? >> >>This of course is not really true, for some applications Mosix might be appropriate, >>but what it really does is transparently move processes around in a cluster, not >>have them become suddenly parallelized. >> >>Let's have an example: >> >>Generally your application is solving a certain problem, like say taking an image and apply >>a certain filter to it. You can write a program for it that is not parallel-aware, and does not use >>MPI and just solves the problem of creating one filtered image from one original image. >> >>This serial program might take one hour to run (assuming really large image and really >>complicated filter). >> >>Mosix can help you now run this on a cluster with 4 nodes, which is cool if you have 4 >>images and still want to wait 1 hour until you see the first result. >> >>Now if you want to really filter only one image, but in about 15 minutes, you can program your >>application differently so that it only works on a quarter of the image. Mosix could still help you >>run your code with different input data in your cluster, but then you have to collect the four pieces >>and stitch them together and would be unpleasently surprised because the borders of the filter >>will show - there was information missing because you did not have the full image available but just >>a quarter of it. Now when you adjust your code to exchange that border-information, you are actually >>already on the path to become an MPI programmer, and might as well just run it on a beowulf cluster. >> >>So the mpi aware solution to this would be a program that splits up the image into the four quadrants, >>forks into four pieces that will be placed on four available nodes, communicates the border-data between >>the pieces and finally collects the result and writes it out as one final image, all in not much more than >>the 15 minutes expected. >> >>Thats why you want to learn how to do real parallel programming instead of relying on some transparent >>mechanism to guess how to solve your problem. >> >>Michael >> >> >> >> > >Ignoring the inflammatory opening of the above response, I'll just state >that its representation of what Mosix does and how it works is neither >fair nor accurate. > >Before message-passing mechanisms arrived, and before the concept of >multi-threading was introduced, the favored mechanism for >multi-processing and parallelism was the good old fork-join method. That >is, a parent process divided the task into small, manageable sub-tasks >and then forked child processes off to handle each subtask. When the >subtask was complete, the child notified the parent (usually by simply >exiting) and the parent joined the results of the sub-tasks into the >final task result. This mechanism works quite well on multi-tasking >operating systems with various scheduling models. It can be effective on >multi-CPU single systems or on clusters of single or multiple CPU >systems. > >Mosix (or at least Open Mosix) handles this kind of parallelism >brilliantly in that it will balance the forked child processes around >the cluster based on load factors. So your image processing, your >Gaussian signal analysis, your fluid dynamics simulations, your parallel >software compilations, or your Fibonacci number generations are >efficiently distributed while you still maintain programmatic control of >the sub-tasking. > >While the fork-join mechanism is not without a downside >(synchronization, for one, as mentioned above), it can be used with a >system like Mosix to provide parallelism without the overhead of the >message-passing paradigm. Maybe not better, probably not worse, just >different. > >The effect described above in which sub-tasks operate completely >independently to produce an erroneous result is really an artifact of >poor programming and design skills and cannot be blamed on the task >distribution system. Mosix is used regularly to do image processing and >other highly parallel tasks. Creating a system like this for Mosix >requires no knowledge of a message-passing interface or API, but simply >requires a working knowledge of standard multi-processing methods and >parallelism in general. > >One final note: most people consider a Mosix cluster to be a Beowulf as >long as it meets the requirements of using commodity hardware and >readily available software. > >Just keeping the record straight. > > > >>>Tough i never used either mosix or PVM/MPI, I am >>>genunely puzzled about it. Can someone kindly educate me? >>> >>>thanks, >>>rajesh >>> >>>_______________________________________________ >>>Beowulf mailing list, Beowulf at beowulf.org >>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >>> >>> >>> From eugen at leitl.org Thu Jan 20 09:04:40 2005 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 20 Jan 2005 18:04:40 +0100 Subject: [Beowulf] Re: Mac Mini Monster (fwd from kstaats@terrasoftsolutions.com) Message-ID: <20050120170440.GB9221@leitl.org> ----- Forwarded message from Kai Staats ----- From: Kai Staats Date: Thu, 20 Jan 2005 09:48:50 -0700 To: Kit Plummer Cc: Apple Scitech Mailing List Subject: Re: Mac Mini Monster Organization: Terra Soft Solutions, Inc. User-Agent: KMail/1.7 Reply-To: kstaats at terrasoftsolutions.com Kit, > Kind of reminds of that brick you guys used to have... Indeed. I was thinking the same thing. The briQ still exists (from its OEM, Total Impact) but has been limited (in my opinion) its keeping up w/the CPU performance curve (for HPC). I will assume that if IBM/Apple can eventually move a 970-based CPU into a laptop, the mini too could harbor a G5. In my mind (similar to the Cluster Node vs the Xserve), a stripped-down mini (sans CD, video, local drive) w/2 gig-e and shared power would be interesting. kai _______________________________________________ Do not post admin requests to the list. They will be ignored. Scitech mailing list (Scitech at lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/scitech/eugen%40leitl.org This email sent to eugen at leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From mwill at penguincomputing.com Thu Jan 20 09:24:03 2005 From: mwill at penguincomputing.com (Michael Will) Date: Thu, 20 Jan 2005 09:24:03 -0800 Subject: [Beowulf] transcode Similar Video Processing on Beowulf? In-Reply-To: <41EEF82F.3020309@cisco.com> References: <41EEF82F.3020309@cisco.com> Message-ID: <41EFE933.3050100@penguincomputing.com> That is indeed interesting, and also, have you looked at cinelerra? http://heroinewarrior.com/cinelerra.php3 It does not use MPI but has it's own client/server videorenderingfarm infrastructure, and could probably be adapted to the beowulf model, with the gui running on the headnode and the compute nodes simply encoding part of the video. Michael Joel Krauska wrote: > Has anyone gotten transcode or one of it's many variants working on a > Beowulf environment? > > Converting MPEG2 to MPEG4 is a very CPU intensive process. Often > taking more than real time to process the Video streams. > > I read a paper a while back about someone doing this using Condor, but > I'm wondering if anyone does this regularly on Beowulf. > > I'd love to learn more about your setup. > > Thanks, > > joel > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pathscale.com Thu Jan 20 09:42:51 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Thu, 20 Jan 2005 09:42:51 -0800 Subject: [Beowulf] Mosix In-Reply-To: <41EFD9C2.1000405@penguincomputing.com> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> <200501181229.48118.mwill@penguincomputing.com> <1106097911.4910.32.camel@localhost.localdomain> <41EFD9C2.1000405@penguincomputing.com> Message-ID: <20050120174251.GB3734@greglaptop.greghome.keyresearch.com> > >Before message-passing mechanisms arrived, and before the concept of > >multi-threading was introduced, Note that message-passing predates MP. Not that anyone really cares about ancient history, but... > >Just keeping the record straight. Amen. -- greg From mark.westwood at ohmsurveys.com Thu Jan 20 00:26:23 2005 From: mark.westwood at ohmsurveys.com (Mark Westwood) Date: Thu, 20 Jan 2005 08:26:23 +0000 Subject: [Beowulf] Cluster Novice. I want to know more about user space In-Reply-To: <005a01c4fde1$69a95330$20acd6d2@laptop> References: <005a01c4fde1$69a95330$20acd6d2@laptop> Message-ID: <41EF6B2F.7080001@ohmsurveys.com> Richard, I manage a small cluster here, so my answers are based on experience of one Beowulf. The cluster runs on SuSE Linux, which is probably irrelevant to any of the answers. We use it for running Fortran codes crunching a lot of numbers in parallel - and intended use does have some influence on the configuration of a cluster. Richard Chang wrote: > Hi all, > Here is my situation. > > I am really new to clusters and I am assingned the task to learn about > it. As maintaining one such cluster will become my bread and butter. > > Let me start. I want to know how is a cluster viewed from the Desktop of > a User. I have to maintain a LINUX Cluster and Is it same as the user > logs in to a Standalone Linux Box. Will he see all the nodes as a whole > or can he see all the nodes individually. Regular users log on to the head node which is just another Linux box. We use grid engine for job scheduling, so users submit their jobs to grid engine, which takes care of placing them onto the compute nodes. Regular users never log on directly to the compute nodes - though I guess we could construct an artificial (for us at least) scenario where this would be useful. > > Is he going to see only the master node? which perhaps is the only Node > connected to the site network and the rest of the Nodes are connected to > the master node, thru some internal network not accessable to the site > network. Yes, the regular user only 'sees' the head node. The cluster has a private network not shared with the office network. I guess we could configure it differently and effectively put all the compute nodes on the office network, but that's kind of moving towards Low Performance Computing and we put a lot of effort into extracting High Performance from the cluster. > > When we login to the Cluster, are we connected to the whole setup or > just the master node?. Best think of it as just logging into the cluster. But like the marketeers say, 'the network is the computer' (or is that 'the computer is the network' ?) so the user gets all the cluster services while running an interactive session on the head node. > > What happens to the abundant Hard disk space we have in all the other > nodes, Can the user use it?. If yes how, coz he is logging into the > master node only and how can he access the other nodes. If the hard disk > space is only used for scratch, then why do we need a 72Gig Hard drive > for that matter?. The cheap and cheerful IDE disks in the compute nodes store O/S etc. Grendel forbid that they would ever be used for swap space during a computation but it does happen sometimes. All the useful, fast SCSI disks are in a RAID array attached to the head node. But this arrangement is quite use-specific, albeit very common. Our big computations do not do a lot of input / output once the initial data has been read from disk and distributed to the compute node's memory. Our cluster is configured for high performance parallel computing, I suppose a cluster which is built for a web-server farm would have a requirement for much faster i/o on all compute nodes. That's getting outside my area of expertise so I will go no further. > > These are some of the issues annoying me. Pls forgive me if I am a > little boring and I will be glad if someone can really guide me. There's a lot of information about all this out there. I like the book 'Beowulf Cluster Computing with Linux' as a good survey of many / most aspects of cluster computing but there are plenty of others available. Then there's google ... Hope some of this is useful Mark > > Cheers, > > Richard > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Mark Westwood Parallel Programmer OHM Ltd The Technology Centre Offshore Technology Park Claymore Drive Aberdeen AB23 8GD United Kingdom +44 (0)870 429 6586 www.ohmsurveys.com From william.dieter at gmail.com Wed Jan 19 21:59:22 2005 From: william.dieter at gmail.com (William Dieter) Date: Thu, 20 Jan 2005 00:59:22 -0500 Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement In-Reply-To: <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <20050118185930.GA30223@synapse.neuralscape.com> <008f01c4fd9f$ac8fd030$32a8a8c0@LAPTOP152422> Message-ID: On Tue, 18 Jan 2005 12:52:50 -0800, Jim Lux wrote: > There's also the prospect, not much explored in clusters, but certainly used > in modern laptops, etc. of dynamically changing computation rate according > to the environment. If the temperature goes up, maybe you can slow down the > computations (reducing the heat load per processor) or just turn off some > processors (reducing the total heat load of the cluster). Maybe you've got > a cyclical temperature environment (that sealed box out in the dusty > desert), and you can just schedule your computation appropriately (compute > at night, rest during the day). For non-real-time systems, it is a similar problem to the heterogeneous load balancing problem, where the load balancer does not have complete control over the load on the machine (e.g., cycle scavenging systems like Condor where a user can sit down and start running jobs.) The difference is that the capacity of the machine changes when the system scales the frequency and voltage rather than due to newly arriving jobs. If the temperature change happens slowly enough, you would probably be able to predict when a rebalancing will be needed before it really is. To work most efficiently the computational task would have to know how to arbitrarily partition the problem, and migrate pieces of work between nodes. A simple approach for applications that are not so easily divided, might be to locally monitor temperature and scale voltage and frequency to keep temperature under a predetermined limit. However, if one machine is warmer than others (maybe it is near a window or far from a vent), it would slow down the entire application. Assuming only one application is running (or at least only one that you care about) there is no point in having some nodes running faster than others. The system could distribute speed scaling information so that all nodes will slow down to the speed of the slowest, keeping them more or less in sync and reducing the global amount of heat generated. With less heat generated the room would cool off, and the warmest node (and all the others) would be able to speed up somewhat. Nodes could also slow down below their advertised speed when they have less work to do. The Transmeta Effcieons already sort of do this. They reduce speed when a certain percentage of the CPU time is idle. Or if the system notices a job is memory bandwidth limited, it could slow down the CPU to match the memory speed. > This kind of "resource limited" scheduling is pretty common practice in > spacecraft, where you've got to trade power, heat, and work to be done and > keep things viable. > > There are very well understood ways to do it autonomously in an "optimal" > fashion, although, as far as I know, nobody is brave enough to try it on a > spacecraft, at least in an autonomous way. > > Say you have a distributed "mesh" of processors (each in a sealed box), but > scattered around, in a varying environment. You could move computational > work among the nodes according to which ones are best capable at a given > time. I imagine a plain with some trees, where the shade is moving around, > and, perhaps, rain falls occasionally to cool things off. You could even > turn on and off nodes in some sort of regular pattern, waiting for them to > cool down in between bursts of work. This would be especially true if each node runs off battery power, for example in a sensor network. In addition to the reliability issues, the amount of energy that can be extracted from the battery is much lower if the battery temperature is too high. Jobs could migrate to a new node just before each node gets too hot, as long as the network is dense enough to still cover the sensed phenomenon. The main limitation would be how fast the nodes can cool off in the low power mode. For example if it takes twice as long for a node to cool as it does to heat up, you would need two idle nodes for each active node. Bill. -- Bill Dieter. Assistant Professor Electrical and Computer Engineering University of Kentucky Lexington, KY 40506-0046 From richard at hinditron.com Thu Jan 20 03:50:48 2005 From: richard at hinditron.com (Richard Chang) Date: Thu, 20 Jan 2005 17:20:48 +0530 Subject: [Beowulf] Cluster Novice. I want to know more about user space References: <005a01c4fde1$69a95330$20acd6d2@laptop> <41EF6B2F.7080001@ohmsurveys.com> Message-ID: <008a01c4fee6$7817dee0$1eadd6d2@laptop> Hi Mark, Thank you for the response and I appreciate your promptness. But Isn't it funny . The guys who sell solutions to the customer are actually non-technical guys. In my case, the marketing people who sold the cluster of nodes did not suggest an external RAID box for the Cluster. They were going ga ga .... over the amount of harddisk real estate each node will have. They just say that it is enough . In my case it will be a 96 node cluster, with each node having a mirrored 72Gig drive. As per them it comes upto 6TB of space for the user. Do you think it is possible. If yes, how? As per me, Like I said, I am new to cluster computing and they are old players in this field. So, maybe they are right!!. BTW.. I will also be using SuSE. So would you mind if I ask you for help about SuSE in the future?. Cheers, Richard ----- Original Message ----- From: "Mark Westwood" To: "Richard Chang" Cc: "Beowulf Mail List" Sent: Thursday, January 20, 2005 1:56 PM Subject: Re: [Beowulf] Cluster Novice. I want to know more about user space > Richard, > > I manage a small cluster here, so my answers are based on experience of > one Beowulf. The cluster runs on SuSE Linux, which is probably irrelevant > to any of the answers. We use it for running Fortran codes crunching a > lot of numbers in parallel - and intended use does have some influence on > the configuration of a cluster. > > Richard Chang wrote: >> Hi all, >> Here is my situation. >> I am really new to clusters and I am assingned the task to learn about >> it. As maintaining one such cluster will become my bread and butter. >> Let me start. I want to know how is a cluster viewed from the Desktop of >> a User. I have to maintain a LINUX Cluster and Is it same as the user >> logs in to a Standalone Linux Box. Will he see all the nodes as a whole >> or can he see all the nodes individually. > Regular users log on to the head node which is just another Linux box. We > use grid engine for job scheduling, so users submit their jobs to grid > engine, which takes care of placing them onto the compute nodes. Regular > users never log on directly to the compute nodes - though I guess we could > construct an artificial (for us at least) scenario where this would be > useful. >> Is he going to see only the master node? which perhaps is the only Node >> connected to the site network and the rest of the Nodes are connected to >> the master node, thru some internal network not accessable to the site >> network. > Yes, the regular user only 'sees' the head node. The cluster has a > private network not shared with the office network. I guess we could > configure it differently and effectively put all the compute nodes on the > office network, but that's kind of moving towards Low Performance > Computing and we put a lot of effort into extracting High Performance from > the cluster. > >> When we login to the Cluster, are we connected to the whole setup or >> just the master node?. > Best think of it as just logging into the cluster. But like the > marketeers say, 'the network is the computer' (or is that 'the computer is > the network' ?) so the user gets all the cluster services while running an > interactive session on the head node. > >> What happens to the abundant Hard disk space we have in all the other >> nodes, Can the user use it?. If yes how, coz he is logging into the >> master node only and how can he access the other nodes. If the hard disk >> space is only used for scratch, then why do we need a 72Gig Hard drive >> for that matter?. > The cheap and cheerful IDE disks in the compute nodes store O/S etc. > Grendel forbid that they would ever be used for swap space during a > computation but it does happen sometimes. All the useful, fast SCSI disks > are in a RAID array attached to the head node. But this arrangement is > quite use-specific, albeit very common. Our big computations do not do a > lot of input / output once the initial data has been read from disk and > distributed to the compute node's memory. Our cluster is configured for > high performance parallel computing, I suppose a cluster which is built > for a web-server farm would have a requirement for much faster i/o on all > compute nodes. That's getting outside my area of expertise so I will go > no further. > >> These are some of the issues annoying me. Pls forgive me if I am a >> little boring and I will be glad if someone can really guide me. > There's a lot of information about all this out there. I like the book > 'Beowulf Cluster Computing with Linux' as a good survey of many / most > aspects of cluster computing but there are plenty of others available. > Then there's google ... > > Hope some of this is useful > > Mark > >> Cheers, >> Richard >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > -- > Mark Westwood > Parallel Programmer > OHM Ltd > The Technology Centre > Offshore Technology Park > Claymore Drive > Aberdeen > AB23 8GD > United Kingdom > > +44 (0)870 429 6586 > www.ohmsurveys.com > From tc at cs.bath.ac.uk Thu Jan 20 09:09:05 2005 From: tc at cs.bath.ac.uk (Tom Crick) Date: Thu, 20 Jan 2005 17:09:05 +0000 Subject: [Beowulf] Writing MPICH2 programs Message-ID: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> Hi, Are there any resources for writing MPICH2 programs? I've found the MPICH2 User's Guide (Argonne National Laboratory), but haven't been able to find any decent material detailing the approaches and methods to writing programs for MPICH2. Thanks and regards, Tom Crick tc at cs.bath.ac.uk http://www.cs.bath.ac.uk/~tc From john.hearns at streamline-computing.com Thu Jan 20 10:29:59 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Thu, 20 Jan 2005 18:29:59 -0000 (GMT) Subject: [Beowulf] transcode Similar Video Processing on Beowulf? In-Reply-To: <41EFE933.3050100@penguincomputing.com> References: <41EEF82F.3020309@cisco.com> <41EFE933.3050100@penguincomputing.com> Message-ID: <10746.81.137.240.21.1106245799.squirrel@webmail.streamline-computing.com> > That is indeed interesting, and also, have you looked at cinelerra? > > http://heroinewarrior.com/cinelerra.php3 > > It does not use MPI but has it's own client/server videorenderingfarm > infrastructure, and could probably be adapted to the beowulf model, with > the gui running on the headnode and the compute nodes simply encoding > part of the video. > Also have a look at Dyny:bolic http://www.dynebolic.org/ This is a bootable-CD distro, which aims at setting up a suite of media production PCs from a bootable CD - so non-technical types can do the setup. It includes (dare I say it!) OpenMOSIX, which should be useful for applications like that. I'd guess that the first PC to be booted up elects itself as the DHCP server (easy script - did I get a DCHP? No? Well then..) and Mosix master node. The website is a little bit TOO funky however, but it should give you some ideas From lusk at mcs.anl.gov Thu Jan 20 10:46:00 2005 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Thu, 20 Jan 2005 12:46:00 -0600 (CST) Subject: [Beowulf] Writing MPICH2 programs In-Reply-To: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> References: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> Message-ID: <20050120.124600.56681754.lusk@localhost> From: Tom Crick > Are there any resources for writing MPICH2 programs? I've found the MPICH2 > User's Guide (Argonne National Laboratory), but haven't been able to find any > decent material detailing the approaches and methods to writing programs for > MPICH2. MPICH2 is one implementation of MPI, which is an API for writing parallel programs. I think what you want is help with writing MPI programs, which then will run on any MPI implementation, including MPICH2. Our own contribution in this area (plug alert!) is the pair of books, "Using MPI" (Gropp, Lusk, and Skjellum), and "Using MPI-2" (Gropp, Lusk, Thakur), There are other good books and book chapters around as well. - Rusty Lusk From rgb at phy.duke.edu Thu Jan 20 11:19:32 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu, 20 Jan 2005 14:19:32 -0500 (EST) Subject: [Beowulf] Writing MPICH2 programs In-Reply-To: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> References: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> Message-ID: On Thu, 20 Jan 2005, Tom Crick wrote: > Hi, > > Are there any resources for writing MPICH2 programs? I've found the MPICH2 > User's Guide (Argonne National Laboratory), but haven't been able to find any > decent material detailing the approaches and methods to writing programs for > MPICH2. Magazine columns -- both linux world and cluster world magazines -- have columns where they regularly/often guide you in the construction of MPI programs of all sorts. Some of these are archived online for free -- google might help you there. The other thing to look for is books. MIT press, in particular, has a fairly nice set of books on using MPI. I don't know if they are MPICH2 specific, though. I also don't know if it matters -- one would expect MPI to be MPI, for the most part, possibly augmented or updated, but with the same basic core and approach to programming. rgb > > Thanks and regards, > > Tom Crick > tc at cs.bath.ac.uk > http://www.cs.bath.ac.uk/~tc > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From hahn at physics.mcmaster.ca Thu Jan 20 16:34:51 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Thu, 20 Jan 2005 19:34:51 -0500 (EST) Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <20050119064145.GB28329@piskorski.com> Message-ID: > > Use a SAN/NAS (nfs) and keep the disks in a separate room than the CPUs. > > Disk drives generate a lot of heat, and compared to on board components > > don't really need cooling, circulated air should largely cover them. > > This is an oxymoron. If disks generate a lot of heat, then that heat > needs to be removed. If you have a small room stuffed with hundreds > of those disks, all that heat has to go somewhere... not only is it an oxymoron, but it's also not true ;) disks do not generate a lot of heat. aside: Maxtor, in particular, has gradually stopped putting ANY useful data into their so-called datasheets. they used to at least give you idle and seek current, for instance. now it's only idle, which seems deliberately perverse ;) Seagate, on the other hand, has improved their documentation - you can actually get 100 pages of spec on the Cheetah, for instance. the quality reminds me of thorough specs that IBM used to publish about their disks. from that doc 12W idle, peak 18W active. that's 10K, and a 2G FC model, which is noticably higher than a SCSI model, and a lot higher than a SATA model, especially 7.2K. 18W for a large, maximally hot disk. disks are usually packed at densities of around 4/U, so a full rack (no space for controllers, etc) is about 3.1KW. as compared to maybe 14KW for a fairly agressive compute rack. (yes, I know there are some vendors who are at least talking about 8-10 disks/U...) more realistically, just use nice modern SATA disks at half that power, and twice the GB/disk (or 1/4 the KW/TB). and as your storage needs grow, the importance of power-savings increases, since you are less likley to keep 2 TB busy than 1 TB. for this particular Cheetah, it looks like you go down to under 5W if you spin down. oh, cool! HGST is still doing very nice specs, 271 pages for 7k400: sleep .7W standby 1 low rpm idle 4.4/6.8/9.0 (depends on load/unload and "low rpm idle" mode random r/w 13W regards, mark hahn. From tc at cs.bath.ac.uk Fri Jan 21 01:52:28 2005 From: tc at cs.bath.ac.uk (Tom Crick) Date: Fri, 21 Jan 2005 09:52:28 +0000 Subject: [Beowulf] Writing MPICH2 programs In-Reply-To: <20050120.124600.56681754.lusk@localhost> References: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> <20050120.124600.56681754.lusk@localhost> Message-ID: <1106301148.13613.24.camel@tcr.cs.bath.ac.uk> On Thu, 2005-01-20 at 18:46, Rusty Lusk wrote: > From: Tom Crick > > > Are there any resources for writing MPICH2 programs? I've found the MPICH2 > > User's Guide (Argonne National Laboratory), but haven't been able to find any > > decent material detailing the approaches and methods to writing programs for > > MPICH2. > > MPICH2 is one implementation of MPI, which is an API for writing > parallel programs. I think what you want is help with writing MPI > programs, which then will run on any MPI implementation, including > MPICH2. Ah ok, I had misunderstood the relevance of MPI and MPICH. Yes, I need help in writing MPI programs! I don't want to write a normal C program for a task and then convert it to work with MPI; it makes sense to design for MPI from the start. > Our own contribution in this area (plug alert!) is the pair of books, > "Using MPI" (Gropp, Lusk, and Skjellum), and "Using MPI-2" (Gropp, Lusk, > Thakur), There are other good books and book chapters around as well. Thanks for the recommendation (and from Robert Brown) - I'd found these books on Amazon when I was searching for resources yesterday, so I'll check them out. Has anyone written any nice MPI test programs that I could use to test my cluster? I've been using the ones given with the MPICH2 distribution, but it'd be good to try some others. Thanks, Tom Crick tc at cs.bath.ac.uk http://www.cs.bath.ac.uk/~tc From eugen at leitl.org Fri Jan 21 03:41:43 2005 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 21 Jan 2005 12:41:43 +0100 Subject: [Beowulf] Cell Architecture Explained Message-ID: <20050121114143.GV9221@leitl.org> A bit sensationalist, but nevertheless interesting. Of course, PS2 CPU was also touted as a scientific workstation back then. Never happened. Link: http://slashdot.org/article.pl?sid=05/01/21/022226 Posted by: CowboyNeal, on 2005-01-21 08:37:00 from the closer-looks dept. IdiotOnMyLeft writes "OSNews features an article written by Nicholas Blachford about the new processor developed by IBM and Sony for their Playstation 3 console. The article goes [1]deep inside the Cell architecture and describes why it is a revolutionary step forwards in technology and until now, the most serious threat to x86. '5 dual core Opterons directly connected via HyperTransport should be able to achieve a similar level of performance in stream processing - as a single Cell. The PlayStation 3 is expected to have have 4 Cells.'" [2]Click Here References 1. http://www.blachford.info/computer/Cells/Cell0.html ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From john.hearns at streamline-computing.com Fri Jan 21 03:46:59 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri, 21 Jan 2005 11:46:59 -0000 (GMT) Subject: [Beowulf] Writing MPICH2 programs In-Reply-To: <1106301148.13613.24.camel@tcr.cs.bath.ac.uk> References: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> <20050120.124600.56681754.lusk@localhost> <1106301148.13613.24.camel@tcr.cs.bath.ac.uk> Message-ID: <10044.81.137.240.21.1106308019.squirrel@webmail.streamline-computing.com> > On Thu, 2005-01-20 at 18:46, Rusty Lusk wrote: >> > Has anyone written any nice MPI test programs that I could use to test > my cluster? I've been using the ones given with the MPICH2 distribution, > but it'd be good to try some others. > For a cluster checkout, I normally run the cpi example, then cpi altered to keep looping forever. Then the Pallas benchmark and hpl running on all nodes. I guess though you are looking for something more interesting. Have a Google for Mandelbrot MPI programs maybe. From hahn at physics.mcmaster.ca Fri Jan 21 06:19:14 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Fri, 21 Jan 2005 09:19:14 -0500 (EST) Subject: [Beowulf] Mosix In-Reply-To: <1106097911.4910.32.camel@localhost.localdomain> Message-ID: > Before message-passing mechanisms arrived, and before the concept of > multi-threading was introduced, the favored mechanism for > multi-processing and parallelism was the good old fork-join method. That nah, threads have been around a very long time - after all separate processes assume you have separate address spaces and thus something like an MMU. it's not like MMU's were on the first computers! > Mosix (or at least Open Mosix) handles this kind of parallelism > brilliantly in that it will balance the forked child processes around sure. Mosix is great. it just doesn't do everything, especially it doesn't introduce parallelism to a serial application, and it provides only one fairly restrictive mechanism for parallelism. stretching the latter boundary obviously leads to inefficiency. > system like Mosix to provide parallelism without the overhead of the > message-passing paradigm. Maybe not better, probably not worse, just "overhead" of message passing? how strange! look, either you have certain communication needs or you don't. Mosix permits a certain kind of communication (in terms of looseness and granularity), which may actually work well for your application. for instance, if your level of parallelism is approximately the same as your number of CPUs, and have parallel chunks which do almost no communication, and they run for long enough, then by all means, fork em off, fire and forget, and Mosix is your best buddy. otoh, many people reagard "real" parallelism to be much more tightly coupled than that. for instance, suppose you're doing a gravity simulation where each star in your virtual cosmos influences the motion of each other star. MPI is what you want, though you can also do it using shared memory (OpenMP). the point though is that you absolutely must think in terms of message passing no matter how your parallelism is implemented, because you have so much communication. message passing is not an overhead, but rather a consequence of what data your problem needs to exchange. if you have a lot of data exchange, and do not think in terms of discrete packets of data collected and sent where needed, your performance will SUCK. if you do not have serious communication, there are other paradigms which may suit you, and which have implementations which may well work efficiently. for instance, some applications expose parallelism in streams, which transform data, usually in a digraph. a regular and pipelinable communication pattern like this just *begs* for an implementation which is tuned for it (ie something involving producer/consumer/buffer models). if your communication is so sparse that you can literally for my $problem (@problems) { exec("application $problem") if (fork() == 0); } well, good for you! what you need is dead easy, and can be done nicely using Mosix (actually, almost any cluster will work well, since even a network of scavenged workstations can do: for my $problem (@problems) { exec("submittoqueue application $problem") if (fork() == 0); } personally, I believe that Mosix is mostly interesting only where communication is minimal, but parallelism is also extremely dynamic. after all, parallelism isn't wildly varying, then any old queue-manager can create a good load-balance (without bothing with migration). ("another level of indirection solves any problem") > distribution system. Mosix is used regularly to do image processing and > other highly parallel tasks. Creating a system like this for Mosix "highly parallel" here means lots of loosely-coupled parallelism. it's really a good idea to distinguish between the *amount* of parallelism and how coupled it is. > One final note: most people consider a Mosix cluster to be a Beowulf as > long as it meets the requirements of using commodity hardware and > readily available software. sort of. purely from the load-balance perspective, Mosix dynamically balances load through migration, and most "normal" beowulfs don't. (though Scyld does, in a very useful sense.) but the real point is that if you use Mosix, and therefore eschew MPI, you are restricting the set of problems which you can efficiently handle. regards, mark hahn. From eugen at leitl.org Fri Jan 21 06:57:06 2005 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 21 Jan 2005 15:57:06 +0100 Subject: [Beowulf] Mosix In-Reply-To: References: <1106097911.4910.32.camel@localhost.localdomain> Message-ID: <20050121145706.GZ9221@leitl.org> On Fri, Jan 21, 2005 at 09:19:14AM -0500, Mark Hahn wrote: > otoh, many people reagard "real" parallelism to be much more tightly coupled > than that. for instance, suppose you're doing a gravity simulation where > each star in your virtual cosmos influences the motion of each other star. > MPI is what you want, though you can also do it using shared memory (OpenMP). > the point though is that you absolutely must think in terms of message > passing no matter how your parallelism is implemented, because you have so > much communication. On a mildly lunatic note, message-passing fits the constraints of the computational physics of this universe very nicely. When I ask memory for a word, I send the address message, and receive a contents message. That's a minimal overhead in terms of signals propagating and gates switching required, and whether you see it, or not, depends on your requirement profile. SCI internode can be in principle very close to accessing physical memory. The latency would be largely relativistic, and depend on the distance. If I have a physical system, there's no way how a part of it can influence simultaneously all others (in a meanigful way, that we can tell apart from random way) in a relativistic universe. Depending on your temporal step size, and the size of the system broadcast or a locally-coupled communication pattern might be appropriate. > message passing is not an overhead, but rather a consequence of what data > your problem needs to exchange. if you have a lot of data exchange, and > do not think in terms of discrete packets of data collected and sent where > needed, your performance will SUCK. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From hahn at physics.mcmaster.ca Fri Jan 21 07:28:01 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Fri, 21 Jan 2005 10:28:01 -0500 (EST) Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: Message-ID: > > I mean, disk is SO cheap at less than $1/GB. > > That's certainly true for consumer grade disks. "Enterprise" it also includes midline/nearline disks, which are certainly not "consumer-grade" (whatever that means). > or "Server" grade disks still cost a lot more than that. For this is a very traditional, glass-house outlook. it's the same one that justifies a "server" at $50K being qualitatively different from a commodity 1U dual at $5K. there's no question that there are differences - the only question is whether the price justifies those differences. > instance Maxtor ultra320 drives and Seagate Cheetah drives > are both about $4-5/GB. The Western Digital Raptor > SATA disks are also claimed to be reliable, and are > again, in the $4-5/GB range. (Ie, it isn't just SCSI pricewatch says $2-3/GB for Raptors. I'd question whether Raptors are anything other than a boutique product. certainly they do not represent a paradigm shift (enterprise-but-sata-not-scsi). > that makes server disks expensive.) the real question is whether "server" disks make sense in your application. what are the advantages? 1. longer warranty - 5yrs vs typical 3ys for commodity disks. this rule is currently being broken by Seagate. the main caveat is whether you will want that disk (and/or server) in 3-5 years. 2. higher reliability - typically 1.2-1.4M hours, and usually specified under higher load. this is a very fuzzy area, since commodity disks often quote 1Mhr under "lower" load. 3. very narrow recording band, higher RPM, lower track density. these are all features that optimize for low and relatively consistent seek performance. in fact, the highest RPM disks actually *don't* have the highest sustained bandwidth - "consumer" disks are lower RPM, but have higher recording density and bandwidth. 4. SCSI or FC. always has been and apparently always will be significantly more expensive infrastructure than PATA was or SATA is. so really, you have to work to imagine the application that perfectly suits a "server" disk. for instance, you can obtain whatever level of reliability you want from raid, rather than ultra-premium-spec disks. is your data access pattern really one which requires a disk optimized for seeks? > Sure, you can RAID the cheaper ATA/SATA disks and > replace them as they fail, but if you're really > working them hard, the word from the storage lists is that > they will indeed fail. (Let Google be your friend.) under what circumstances will you have a 100% duty cycle? first, you need some server that really is 24x7 (let's imagine that all visa card transactions worldwide update a DB on your server). OK, you might well do that using a "server" disk, since DB logs have fairly uniform wear, and constant activity. but Visa would and does use many distributed/ replicated/raided servers. > load a database into memory. The disk server uses SCSI disks > and is pushed much harder. I've looked at the duty cycle of our servers, and am impressed how low it is. even, for instance, on a server that is head node and sole fileserver for a cluster of 70ish diskless duals, the duty cycle is quite low. I believe that people overestimate the duty cycle of their servers. I've also seen servers whose duty cycle got far more reasonable when, for instance, filesystems were mounted with noatime. and for that matter, some mail packages configure their spool directories for all synchronous operation without noticing that the filesystem journals. in summary: there is a place for super-premium disks, but it's just plain silly to assume that if you have a server, it therefore needs SCSI/FC. you need to look at your workload, and design the disk system based on that, using raid for sure, and probably most of your space on 5-10x cheaper SATA-based storage. regards, mark hahn. From josip at lanl.gov Fri Jan 21 09:03:18 2005 From: josip at lanl.gov (Josip Loncaric) Date: Fri, 21 Jan 2005 10:03:18 -0700 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: References: Message-ID: <41F135D6.9020700@lanl.gov> Mark Hahn wrote: > > aside: Maxtor, in particular, has gradually stopped putting ANY > useful data into their so-called datasheets. I noticed the same thing recently about Maxtor. If this trend continues, we'll be getting less technical information about some commodity computer parts than we normally get about bottled water. Western Digital still supplies technical data on their drives, and Hitachi (formerly IBM) drives advertise particularly low power dissipation. I haven't checked Seagate lately, so Mark's comment is welcome: > Seagate, on the other hand, has improved their documentation - > you can actually get 100 pages of spec on the Cheetah, for instance. Sincerely, Josip From atp at piskorski.com Fri Jan 21 10:38:30 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Fri, 21 Jan 2005 13:38:30 -0500 Subject: [Beowulf] Cell Architecture Explained In-Reply-To: <20050121114143.GV9221@leitl.org> References: <20050121114143.GV9221@leitl.org> Message-ID: <20050121183830.GA64426@piskorski.com> On Fri, Jan 21, 2005 at 12:41:43PM +0100, Eugen Leitl wrote: > > A bit sensationalist, but nevertheless interesting. Of course, PS2 CPU was > also touted as a scientific workstation back then. Never happened. > > Link: http://slashdot.org/article.pl?sid=05/01/21/022226 > 1. http://www.blachford.info/computer/Cells/Cell0.html Making the attempt to figure out and explain the available Cell info is perhaps admirable, but as far as I can tell given my own very modest understanding of processor architectures, design, and fabrication, the author of that report is no expert. And I wouldn't trust anyone BUT an expert to give a hardware analysis of a brand new, un-released chip based on sketchy and/or purposely obfuscated data. Now, if the folks doing the MIT RAW chip cared to write an analysis of Cell, THEN I'd pay attention: http://cag.csail.mit.edu/raw/ -- Andrew Piskorski http://www.piskorski.com/ From mathog at mendel.bio.caltech.edu Fri Jan 21 11:09:41 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Fri, 21 Jan 2005 11:09:41 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement Message-ID: > > > or "Server" grade disks still cost a lot more than that. For > > this is a very traditional, glass-house outlook. it's the same one > that justifies a "server" at $50K being qualitatively different > from a commodity 1U dual at $5K. there's no question that there > are differences - the only question is whether the price justifies > those differences. The MTBF rates quoted by the manufacturers are one indicator of disk reliability, but from a practical point of view the number of years of warranty coverage on the disk is a more useful metric. The manufacturer has an incentive to be sure that those disks with a 5 year warranty really will last 5 years. Unclear to me what their incentive is to support the MTBF rates since only a sustained and careful testing regimen over many, many disks could challenge the manufacturer's figures. And who would run such an analysis??? Buy the 5 year disk and you'll have a working disk, or a replacement for it, for 5 years. In some uses it would clearly be cheaper to use (S)ATA disks and replace them as they fail, so long as they don't fail 4x faster than the Cheetahs. Google around for "disk reliability" though and you'll find some real horror stories about disk failure rates in, for instance, SCSI -> ATA RAID arrays. > > the real question is whether "server" disks make sense in your application. > what are the advantages? > > 1. longer warranty - 5yrs vs typical 3ys for commodity disks. > this rule is currently being broken by Seagate. the main caveat > is whether you will want that disk (and/or server) in 3-5 years. Generally yes, we do want that disk to still be working at 5 years. Cannot predict whether or not the hardware will have been replaced before then. > > 2. higher reliability - typically 1.2-1.4M hours, and usually > specified under higher load. this is a very fuzzy area, since > commodity disks often quote 1Mhr under "lower" load. Exactly. It's very, very hard to figure out just how much reliability one is trading for the lower price. Anecdotally, for heavy disk usage, it's apparently a lot. Anecdotally, for low disk usuage, ATA disks aren't all that reliable either. > > 3. very narrow recording band, higher RPM, lower track density. > these are all features that optimize for low and relatively > consistent seek performance. in fact, the highest RPM disks actually > *don't* have the highest sustained bandwidth - "consumer" disks are > lower RPM, but have higher recording density and bandwidth. Right. On the other hand, anecdotal evidence suggests that an application like, for instance, a busy Oracle database running on top of RAID - ATA storage will result in a very high rate of disk failure, whereas the equivalent RAID - SCSI/FC Cheetah solution will not suffer an equivalent disk failure rate. Again, from Google results, not personal experience. Well, not much personal experience, we do have a 4 disk FC Raid in one Sun server and have not lost a disk yet (coming up on 2 years). My personal experience with ATA disks in servers has been limited. A smallish Solaris server configured with "cutting edge, large capacity" ATA disks failed an IBM and the replacement Western Digital in 1 month each. Backing way off on the capacity and going to older 40Gb IBM ATA disks did the trick, with no further disk failures in 3 years. > > 4. SCSI or FC. always has been and apparently always will be > significantly more expensive infrastructure than PATA was > or SATA is. Agreed. I'd be perfectly happy to buy SATA or PATA disks _IF_ they were as reliable as the more expensive SCSI or FC disks. It would help a lot to have some objective measure of that. When Seagate starts selling 5 year SATA disks I'll consider buying them. > > so really, you have to work to imagine the application that > perfectly suits a "server" disk. for instance, you can > obtain whatever level of reliability > you want from raid, rather than ultra-premium-spec disks. In theory. In practice local experience (another lab) was that the RAID - ATA solution failed, twice, and was unable to rebuild from what was left, with all data lost. Maybe that was the controller or just a really bad set of disks. I wasn't there to witness the teeth gnashing and finger pointing. This wasn't a tier one storage vendor (Sun, EMC, HP, etc.) so they saved some money. Or did they??? There's also a school of thought that RAID arrays should be "disk scrubbed" frequently (all blocks on all disks read) to force hardware failures and block remapping to occur early enough so that the redundant information present in the array can rebuild from what's left. As opposed to a worst case where the data is written once, not touched for a year, and then fails unrecoverably when a read hits multiple bad blocks. > is your data > access pattern really one which requires a disk optimized for seeks? On the beowulf not so much. Most of the workload has been configured so that the compute nodes have their data cached in memory and only read the disks hard when booting up and the first time they read their databases. On the Sun Oracle server, much more so. > > under what circumstances will you have a 100% duty cycle? Probably never? But where in between 100% and 0% is the cutover point where increased disk failure rate costs just equal the savings from using cheaper disks? > > in summary: there is a place for super-premium disks, but it's just plain > silly to assume that if you have a server, it therefore needs SCSI/FC. > you need to look at your workload, and design the disk system based on > that, using raid for sure, and probably most of your space on 5-10x > cheaper SATA-based storage. I'd be a lot more comfortable buying the cheaper disks if there was some objective measure for an accurate prediction of their actual longevity. I tend to look at it from the other direction. A disk failure on the head node is a much bigger deal than a disk failure on the compute nodes. Also the number of disks involved is likely to be less for the former than the latter. That is, one might have 10 disks in a RAID on the head node but 70 ATA disks out on the compute nodes. So it might cost a couple of thousand more to use the most reliable disks available on the head node, but it's most likely worth it to avoid having to replace those critical components. Conversely, the number of compute nodes isn't usually critical so there's not as much reason to pay for more expensive disks there. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hahn at physics.mcmaster.ca Fri Jan 21 12:04:24 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Fri, 21 Jan 2005 15:04:24 -0500 (EST) Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: Message-ID: > an analysis??? Buy the 5 year disk and you'll have a working > disk, or a replacement for it, for 5 years. and my real point was that everyone should ask whether they really want/need that before "paying for quality". one reason not to is that 3 or 5 years from now, disks will be much better. when the improvement curve is steep, it's not to your advantage to "invest" in a more long-lived product. remember, the cost difference is not just a few percent, but at least 4x. so you get 4x as much storage, even if it only lasts 60% as long. so bank some of it, and you can replace all your disks every ~2 years or so (*and* appreciate the improvements in disks.) or just use raid, appreciate the same or higher reliability, more space, and probably higher performance. > they don't fail 4x faster than the Cheetahs. Google around > for "disk reliability" though and you'll find some real horror > stories about disk failure rates in, for instance, > SCSI -> ATA RAID arrays. unfortunately, anecdotal evidence is nearly useless outside of sociology ;) > Generally yes, we do want that disk to still be working at 5 years. interesting. I don't mind if things survive past 3 years, but don't generally plan to use them, at least not in their original purpose. there's just too much to be gained by upgrading after 3 years. > My personal experience with ATA disks in servers has been limited. hah! this *is* the only really reliable part of anecdotal evidence - the demographic of respondents ;) > help a lot to have some objective measure of that. When Seagate > starts selling 5 year SATA disks I'll consider buying them. http://info.seagate.com/mk/get/AMER_WARRANTY_0704_JUMP > There's also a school of thought that RAID arrays should be > "disk scrubbed" frequently (all blocks on all disks read) > to force hardware failures and block remapping to occur early > enough so that the redundant information present > in the array can rebuild from what's left. As opposed to a worst > case where the data is written once, not touched for a year, > and then fails unrecoverably when a read hits multiple bad blocks. it's all about what failure modes you're expecting, with what probability. if you want scrubbing, it means you're expecting some sort of silent media degredation. that's not unreasonable, and it might even be sane to expect that more on commodity disks rather than premium ones. (IMO mainly because commodity densities are so much higher, and premium disks are clearly designed to trade poor density for higher robustness.) but maybe you should scrub premium disks as well, since if you really haven't touched some part of the disk, you really don't have any data on the particular disks's reliability. it would be quite interesting if one could obtain some sort of quality measure from the disk while in use. I notice that ATA supports a read-long command, for instance, which claims to actually give you the raw block *and* the ecc associated. SMART also provides some numbers (as well as self-tests) that might be useful here. but silently crumbling media is not the only failure mode! I'm not even sure it's a common one. I see more temperature and vibration-related troubles, for instance. > On the Sun Oracle server, much more so. out of curiosity, is the machine otherwise well-configured? for instance, does it have a sane amount of ram (1GB/cpu is surely minimal these days, and for a DB, lots more is often a good idea.) or is the DB actually quite small, but incredibly update-heavy? > > under what circumstances will you have a 100% duty cycle? > > Probably never? But where in between 100% and 0% is the cutover > point where increased disk failure rate costs just equal the > savings from using cheaper disks? complicated, for sure. but since premium disks are ~5x more expensive and certainly not 5x more reliable, it's probably worth pondering... > the most reliable disks available on the head node, but it's most > likely worth it to avoid having to replace those critical components. > Conversely, the number of compute nodes isn't usually critical > so there's not as much reason to pay for more expensive disks there. except that raid of commodity disks can trivially match the reliability of un-raided premium disks. overall, I don't criticise people who use high-quality disks in critical and heavily-loaded choke-points. indeed, I do it myself. what bothers me the often unquestioned assumption that using premium disks is always better. it's not for nodes. it's not for big storage. it's probably not even for user-level filesystems (/home and the like). but for a non-replicated fileserver that provides PXE/kernel/rootfs for 1000 diskless nodes, well, duh! the real point is that raid and server replication make it easy to design around critical-and-overloaded hotspots. regards, mark hahn. From rgb at phy.duke.edu Fri Jan 21 12:10:31 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri, 21 Jan 2005 15:10:31 -0500 (EST) Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: On Fri, 21 Jan 2005, David Mathog wrote: > > 2. higher reliability - typically 1.2-1.4M hours, and usually > > specified under higher load. this is a very fuzzy area, since > > commodity disks often quote 1Mhr under "lower" load. > > Exactly. It's very, very hard to figure out just how much reliability > one is trading for the lower price. Anecdotally, for heavy disk > usage, it's apparently a lot. Anecdotally, for low disk usuage, > ATA disks aren't all that reliable either. Has anyone observed that a megahour is 114 years? Has anyone observed that this is so ludicrous a figure as to be totally meaningless? Show me a single disk on the planet that will run, under load, for a mere two decades and I'll bow down before it and start sacrificing chickens. Humans don't live a megahour MTBF. Disks damn sure don't. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From hahn at physics.mcmaster.ca Fri Jan 21 12:51:12 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Fri, 21 Jan 2005 15:51:12 -0500 (EST) Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: Message-ID: > Humans don't live a megahour MTBF. Disks damn sure don't. that's an attractive analogy, but I think it misses the fact that a disk is mostly in a steady-state. yes, there's a modest process of wear, and even some more exotic things like electromigration. but humans, by contrast, are always teetering on the edge of failure. I'm tiping back in my chair right now, courting a broken neck. I'm about to go out for my 4pm latte, which requires crossing a street. none of my disks are doing foolish and risky things like this - most of them are just sitting there, some not even spinning, most occasionally stirring themselves to glide a head across the disk. I at least, think of a seek as about as stressful as taking a breath (which is not to deny that my breaths and a disks seeks are both, eventually, going to come to an end...) one of my clusters has 96 nodes, each with a commodity disk in it. 10^6/(24*365.2425) = 114.07945862452115147242 years for each disk, and 1.18832769400542866117 years for the whole cluster. since the cluster has good cooling, and the disks not much used, I only expect about 1.2 failures per year. we're about to buy a cluster with 1536 nodes; assuming the new machineroom being built for it works out, we should expect about 1 failure per month. fortunately, I favor disk-free booting (PXE, NFS root), so even if we have 10x the failure rate, and it takes a week to replace each disk, we shouldn't have any kind of problem. another new facility will be 200TB of nearline storage. if we did it with 1.4e6 hr, 147GB SCSI disks, I'd expect to go 1022 hrs between failures. I'd prefer to use 500 GB SATA disks, even if they're 1e6 hrs, since that will let me go 2500 hours between failures (not to mention saving around 5KW of power!) regards, mark hahn. From jerry at oban.biosc.lsu.edu Fri Jan 21 12:19:14 2005 From: jerry at oban.biosc.lsu.edu (Jerry Xu) Date: 21 Jan 2005 14:19:14 -0600 Subject: [Beowulf] send back output from local node Message-ID: <1106338754.15340.14.camel@strathmill.biosc.lsu.edu> Hi, People: I have a small question regarding PBS in beowulf , I setup my walltime and cput time both 12:00 Hours for my queue. My program itself can run more than 12 hours and output results periodically. All my nodes are exact same. But I met situation that some nodes send back results much much more output than others.Thus it makes my programs very inefficient, say, I used 32 nodes, but 16 nodes will give me back enough data but 16 nodes only feedback very little results, I know that each computing node actually save the output in local storage and send back the output to the master node later according to some protocols (that i donot know). Since these nodes are same, I assume some nodes hold the results in local and did not send them back. My question is, how can I make sure that all the computing nodes can send the output that stored in their local storage back to the master when walltime or cput time is reached...... Is there any people ever met the similar situation and provide some suggestion? Thanks, Jerry From mathog at mendel.bio.caltech.edu Fri Jan 21 13:26:20 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Fri, 21 Jan 2005 13:26:20 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement Message-ID: > On Fri, 21 Jan 2005, Robert G. Brown wrote > > > > 2. higher reliability - typically 1.2-1.4M hours, and usually > > > specified under higher load. > Has anyone observed that a megahour is 114 years? Has anyone observed > that this is so ludicrous a figure as to be totally meaningless? Show > me a single disk on the planet that will run, under load, for a mere two > decades and I'll bow down before it and start sacrificing chickens. ROTFLMAO David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From josip at lanl.gov Fri Jan 21 14:06:07 2005 From: josip at lanl.gov (Josip Loncaric) Date: Fri, 21 Jan 2005 15:06:07 -0700 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: <41F17CCF.4080701@lanl.gov> Robert G. Brown wrote: > >>> 2. higher reliability - typically 1.2-1.4M hours, and usually >>> specified under higher load. this is a very fuzzy area, since >>> commodity disks often quote 1Mhr under "lower" load. > > Has anyone observed that a megahour is 114 years? Has anyone observed > that this is so ludicrous a figure as to be totally meaningless? Show > me a single disk on the planet that will run, under load, for a mere two > decades and I'll bow down before it and start sacrificing chickens. > > Humans don't live a megahour MTBF. Disks damn sure don't. All of the above is true on the "per sample" basis. Moreover, with the product cycles measured in months rather than years, none of the MTBF figures could possibly be based on actual MTBF measurements. Instead, manufacturers use composite statistics, computed from mid-life component failure rates, then quote MTBF as the reciprocal of this number. This practice results in good MTBF numbers, but it amounts to stating that the life expectancy of a 10-year-old kid is 5000 years based on the 99.98% probability that the kid will survive the next year (these numbers are quoted from IEEE Spectrum, Sept. 2004, see http://www.spectrum.ieee.org/WEBONLY/publicfeature/sep04/0904age.html). Both humans and machines fall apart at higher rates in infancy, as well as with age, when built-in redundancy wears thin due to accumulated damage. The disk drive MTBF number does not apply to drives that fail fairly quickly, nor to failure rates of old/heavily used drives. If, somewhat questionably, human life expectancy is taken as a guide, disk manufacturers' MTBF numbers ought to be de-rated by about a factor of 50-70 to make practical sense (e.g. an 1.4M hour MTBF drive might last some 25,000 hours) -- but even this applies only under nominal conditions, where the above-mentioned statistical MTBF estimate is not wildly inaccurate. In other words, a drive may last several years at 20 deg. C ambient temperature. Still, this says nothing about its durability at 40+ deg. C. Given that in many systems failure rates increase exponentially with temperature, e.g. doubling for every 10 degree increase, I would avoid baking a drive unless it was specifically designed for high temperature operation (if such drives even exist). Sincerely, Josip From lindahl at pathscale.com Fri Jan 21 14:15:06 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Fri, 21 Jan 2005 14:15:06 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: <20050121221506.GE3454@greglaptop.internal.keyresearch.com> On Fri, Jan 21, 2005 at 03:10:31PM -0500, Robert G. Brown wrote: > Has anyone observed that a megahour is 114 years? Has anyone observed > that this is so ludicrous a figure as to be totally meaningless? Show > me a single disk on the planet that will run, under load, for a mere two > decades and I'll bow down before it and start sacrificing chickens. > > Humans don't live a megahour MTBF. Disks damn sure don't. That's not what MTBF means. A device has 3 phases in its life: infant mortality, middle age, and old age. If you draw the failure rate, it looks like a bathtub: F R \ / a a \ / i t \ / l e \_______________________________/ infant middle-age old-age The MTBF comes from the failure rate in middle age. It does not say when old age starts. The MTBF is usually much longer than the start of old age, because most disks survive to old age. And yes, a megahour is the right scale for MTBF: that just means that 1 in 1400 disks dies per month in middle age. If middle age lasts 3 years, then 2.6% of disks will fail in middle age. -- greg From James.P.Lux at jpl.nasa.gov Fri Jan 21 14:58:47 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Fri, 21 Jan 2005 14:58:47 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: <6.1.1.1.2.20050121144754.0432d650@mail.jpl.nasa.gov> At 11:09 AM 1/21/2005, David Mathog wrote: > > > > > or "Server" grade disks still cost a lot more than that. For > > > > this is a very traditional, glass-house outlook. it's the same one > > that justifies a "server" at $50K being qualitatively different > > from a commodity 1U dual at $5K. there's no question that there > > are differences - the only question is whether the price justifies > > those differences. > >The MTBF rates quoted by the manufacturers are one indicator >of disk reliability, but from a practical point of view the number >of years of warranty coverage on the disk is a more useful metric. > >The manufacturer has an incentive to be sure that those disks >with a 5 year warranty really will last 5 years. Unclear >to me what their incentive is to support the MTBF rates since only >a sustained and careful testing regimen over many, many disks could >challenge the manufacturer's figures. And who would run such >an analysis??? Buy the 5 year disk and you'll have a working >disk, or a replacement for it, for 5 years. While MTBFs of the disk may seem unrealistic (as was pointed out, nobody is likely to run a single disk for 100+ years), but they are a "common currency" in the reliability calculation world, as are "Failures in Time" (FIT) which is the number of failures in a billion (1E9) hours of operation. What would be very useful (and is something that does get analysis for some customers, who care greatly about this stuff) is to compare the MTBF of a widget determined by calculation and analysis (look up the component reliabilities, calculate the probability of failure for the ensemble) with the MTBF of the same widget determined by test (run 1000 disk drives for months). Especially if you run what are called "accelerated life tests" at elevated temperatures or higher duty factors. MTBFs are also used because they're easier to understand and handle than things like "reliability", which winds up being .999999999, or failure rates per unit time, which wind up being very tiny numbers (unless "unit time" is a billion hours). And, if I were asked to estimate the reliability of a PC, I'd want to get the MTBF numbers for all the assemblies, and then I could calculate a composite MTBF, which might be surprisingly short. If I then had to calculate how many PC failures I'd get in a cluster of 1000 computers, it would be appallingly short. To a first order, an ensemble of 1000 units, each with an MTBF of 1E6 hours will have an MTBF of only 1000 hours, which isn't all that long....and if the MTBF of those units is only 1E5 hours, because you're running them 25 degrees hotter than expected, only a few days will go by before you get your first failure. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From eugen at leitl.org Fri Jan 21 21:27:45 2005 From: eugen at leitl.org (Eugen Leitl) Date: Sat, 22 Jan 2005 06:27:45 +0100 Subject: [Beowulf] Fwd: turing cluster (fwd from jmnemonic@gmail.com) Message-ID: <20050122052744.GY9221@leitl.org> ----- Forwarded message from Michael Dinsmore ----- From: Michael Dinsmore Date: Fri, 21 Jan 2005 18:52:35 -0500 To: hpc at lists.apple.com Subject: Fwd: turing cluster Reply-To: Michael Dinsmore A friend put me on to this; I haven't seen any public announcements, but thought it'd be of interest. Not many details, but: 640 Xserve cluster, connected via Myrinet. http://www.cse.uiuc.edu/turing/ -- jmnemonic at gmail.com _______________________________________________ Do not post admin requests to the list. They will be ignored. Hpc mailing list (Hpc at lists.apple.com) Help/Unsubscribe/Update your Subscription: http://lists.apple.com/mailman/options/hpc/eugen%40leitl.org This email sent to eugen at leitl.org ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From rgb at phy.duke.edu Sun Jan 23 08:30:30 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sun, 23 Jan 2005 11:30:30 -0500 (EST) Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: On Fri, 21 Jan 2005, Mark Hahn wrote: > > Humans don't live a megahour MTBF. Disks damn sure don't. > > that's an attractive analogy, but I think it misses the fact that > a disk is mostly in a steady-state. yes, there's a modest process > of wear, and even some more exotic things like electromigration. > but humans, by contrast, are always teetering on the edge of failure. > I'm tiping back in my chair right now, courting a broken neck. > I'm about to go out for my 4pm latte, which requires crossing a street. > none of my disks are doing foolish and risky things like this - > most of them are just sitting there, some not even spinning, most > occasionally stirring themselves to glide a head across the disk. > I at least, think of a seek as about as stressful as taking a breath > (which is not to deny that my breaths and a disks seeks are both, > eventually, going to come to an end...) > > one of my clusters has 96 nodes, each with a commodity disk in it. > 10^6/(24*365.2425) = 114.07945862452115147242 years for each disk, > and 1.18832769400542866117 years for the whole cluster. since the > cluster has good cooling, and the disks not much used, I only expect > about 1.2 failures per year. > > we're about to buy a cluster with 1536 nodes; assuming the new machineroom > being built for it works out, we should expect about 1 failure per month. Let's examine your point, and mine, seriously. I was quite serious about sacrificing chickens to 20 year old disks. There aren't any (even in environments where people try to run them this long, which are admittedly quite rare). I ran a 10 MB IBM disk (one of the best money could buy -- IBM built arguably the best/most reliable disks in the world at the time) in my original IBM PC for close to a decade before it died, but die it did. I've run a handful (six or seven) of disks out to maybe eight years out of maybe a hundred that I've tried to keep in service that long. Less than 10%. Those disk do endure a certain amount of wear at a fairly predictable, fairly steady rate, and, like the Deacon's One-Hoss Shay, at some point whether or not they break down, they wear out. So what I was really addressing is that the "rate of failure" measured at some point in the disk's lifetime is a lousy predictor of estimated lifetime. In fact, it is NOT a predictor of a disk's expected lifetime in any sense of the term derived from lifetime statistics. One has to know the distribution of failures, not the rate of failure at some point, to determine the mean lifetime. It's just calculus -- the rate is the slope of the function we are interested in evaluated at some specific point (across some specific delta, given that it is a rate). At best this yields a linear approximation of the function in a Taylor series -- probably one that optimistically omits the number of disks that die "immediately" (the constant term at the beginning of said Taylor series) at that. So let's think again about humans. In the USA, humans have a mean lifetime in the ballpark of 70+ years, a nice human-long time because humans are actually amazing stable, self-repairing dynamical entities. Damn few mechanical constructs in nature retain individual, functional form for this long, including constructs engineered by humans using the best of current technology. However, this datum alone is not that useful to e.g. insurance actuarialists. Instead they look at the distribution. Humans are initially quite likely to die before they are born -- lots of eggs fail to implant, lots of pregnancies terminate in miscarriage. Humans are relatively likely to die in your first two years, when their immune (self-repair) system is weak and any defects in their manufacture process are exposed to a hostile world. Then they enter a stretch where the probability of failure is quite low overall, with modest peaks around the teens followed by a long period where it is very low indeed (death pretty much only by accident -- "failures" are sufficiently rare to be considered tragedies and not at all to be expected) until the internal cellular repair mechanisms themselves start to age and actual failures start to occur more often than accidents, around age 40-something. There is then a gradual ramping up of failure rate until (eventually) nobody gets out alive and damn few humans live to see their 100th year. One human in a hundred million might live to 114 years (a Megahour). Note that even this statistical picture isn't detailed enough to be useful to actuarialists. If you use drugs in your veins, are in a military platoon serving in a little village in the most hostile part of Iraq, are poor, are rich, have access to good health care or not -- all of these change your risk of failure. A human that gets just the right amount of exercise, has the best medical care, doesn't smoke, drinks just the right amount, follows a mediterranean diet, has good genes, and avoids risks can expect (on average) to outlive one that does the opposite of all of the above by a good long time (pass me them french fries to go with my beer:-). Now during their safest years, if you examine the number of humans that fail per year over some nice short baseline you might find that they, too, have a good deal more than "1 million hours MTBF", especially if you specifically exclude accidental death (which is the most likely cause of death after you are perhaps 2 until you are in your 40's). This is comforting -- it is why I don't expect to see ANY failures of the humans in my physics classes, in spite of the fact that there are hundreds of them, where I absolutely expect to see failures among the hundreds of disks in my cluster. In fact, if we saw as many failures among the humans of my acquaintance (who tend to be in the sweet spot of their expected lifetimes) as we all do in disks, we'd be screaming our heads off about epidemics, war, and mayham and would live trembling in fear. The human race would be at risk of not living long enough to perpetuate itself -- how many disks make it to 13 years (age of puberty)? Enough so that two disks could cuddle together and produce a mess of little floppy disklets to replace the hundreds of disks that died well before then? I doubt it, unless they produce litters of them with a short gestational period... So I reiterate -- MTBF for hard disks, as reported by the manufacturer, is a nearly useless number. What matters isn't a rate determined under controlled conditions during a particularly favorable period in a disk's lifetime, one that more or less excludes birth defects, accidental death, and the tremendous variability of load and environmental conditions (where a machine room that has a transient failure of AC can be thought of as being sent to Iraq in the aforementioned infantry platoon). This is especially true when it is perfectly obvious that the MTBF of disks averaged over their ENTIRE lifetime is NOT 1 Mhour, which would imply either that roughly 1/2 the disks make it to 114 years still operating or that the distribution is highly skewed so that some disks last for a millenium or ten while the rest die young. However, manufacturers (for obvious reasons) do not present us with a graph of actual observed failure from all causes (which we could use to do a true risk assessment). They present us with an obviously globally false number that is almost unbearably optimistic and cheery. Almost makes me want to be a disk... I personally think that the more useful statistic is the true actuarial one implicit in the following observation. It used to be that nearly all hard disks on the planet had one of two warranties. "Server" class SCSI disks (this is descriptive, not judgemental or intended to provoke a flame:-) carried five year warranties, presumably because manufacturers subjected them to a more rigorous in-house quality assessment before selling them, effectively removing more of the ones with birth defects from the population before sending them forth. "Consumer" class IDE disks carried three year warranties, because they sold them with less testing and hence there were more DOA's and first six week failures. A year or two ago, consumer disk warranties were dropped to a year by nearly all the disk manufacturers. If you wanted a three year disk you had to pay a premium price for it and select a "special edition" disk. Now I personally think that what has happened is obvious. Disk is one of the only components in a computer that carried a 3 year warranty or better, and it gets harder and harder to engineer a high quality/reliable disk as density etc keeps ramping up. Everything gets smaller, there are more points of failure, the net data load goes up, the average heat generated goes up (not linearly, but up). Even though "MTBF" by their optimistic assessment methodology remains low, the actual probability of failure from all causes is embarrassingly high. Now in >>my<< opinion what this really means is that the probability of current consumer disks getting to 3 years of actual lifetime under load has gone down to the point where they simply cannot make money on the margin they charge per disk if they have to replace all the failures. If you look at the marginal difference in cost of the "special edition" versions (perhaps 10% of retail), compare it to the cost of a warranty replacement to the manufacturer (perhaps 50% of retail) you can guess that they are anticipating that ballpark of one disk in five fails between the end of year one and the end of year three. Some unknown number will also fail in year one -- perhaps enough to bring the total three year failure rate to 1/4. That's a believable number to me, based on my personal anecdotal experience. In my direct experience with consumer disks, I see roughly 50% failures within five to six years. I've already experienced one admittedly anecdotal disk failure out of three put into my household RAID within its first year of operation, and that IS a special edition 3 year warranty disk -- it is sitting boxed downstairs ready to ship back. I've also experienced two failures over three years (out of three disks) in this RAID's predecessor and something like five disk failures (out of ten disks over five years) in the household's various workstations. Anecdotal sure, but I'll bet they are not atypical. These workstation disks are nearly idle -- they do some work when the system spins up, then just sit. My household RAID isn't exactly hammered by its five whole users, either, even with me as one of the users. The disks are exposed to power failures, cosmic rays, etc. So it goes. However, I do experience relatively few failures of disks (that aren't DOA during burn-in) for the first six months or even year of operation -- the recent RAID disk (failed at about eight months) is an exception, not the rule. Maybe it would even average out to 1 million hours MTBF (over the first three months of post burn-in operation) who knows? > another new facility will be 200TB of nearline storage. if we did it > with 1.4e6 hr, 147GB SCSI disks, I'd expect to go 1022 hrs between failures. > I'd prefer to use 500 GB SATA disks, even if they're 1e6 hrs, since that > will let me go 2500 hours between failures (not to mention saving around > 5KW of power!) And >>I'd<< expect that you can go 1022 hours between failures in the first three months of operation, maybe 900 hours between failures in the second three months of operation, maybe 800 hours between failures in the third three months of operation, and downhill from there... Or some other curve -- I don't know what the decay curve is, and I doubt that the manufacturers will tell you (or that it would be real-world accurate if they did tell you). At a guess it is somewhat s-shaped with an initial spike, a relatively flat period and then a more rapid exponential starting in a year or two. We're only seeing "MTBF rates" reported as the most optimistic slope of that initial flat period. The one kind of operation that COULD tell you very accurately indeed what the curve looks like would be somebody like Dell that offers standard three (or more) year onsite service on entire systems, including disks, for a fee. Disk insurance salesmen, in other words. Their databases would let you determine the curve quite precisely, at least for their choice of hardware manufacturer(s). In fact, their databases are doubtless accurate enough that they can very deliberately choose the best manufacturers in the specific sense that they cost the least (for a given storage size) integrated over all warranty and service obligations -- they MUST be accurate enough that they recover costs and make a profit, or they'll fire their actuarial database folks and start over. rgb > > regards, mark hahn. > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From lindahl at pathscale.com Sun Jan 23 14:35:48 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sun, 23 Jan 2005 14:35:48 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> On Sun, Jan 23, 2005 at 11:30:30AM -0500, Robert G. Brown wrote: > So I reiterate -- MTBF for hard disks, as reported by the manufacturer, > is a nearly useless number. It is useful if you use it for what it's meant to be used for: the failure rate in the bottom of the bathtub. I don't know why you were thinking of using it for anything else, like disk lifetime, or infant mortality. I have found that my actual failure rates have been 2X-3X the manufacturer's number, but you always have to worry about dust, power surges, and excess heat incidents in real machine rooms. MTBF for just about everything is computed the same way, and most gizmos have the same bathtub-shaped failure curve. > They present us with an obviously globally false number that is > almost unbearably optimistic and cheery. Operator error, I'm afraid. -- greg From rgb at phy.duke.edu Sun Jan 23 22:57:16 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Mon, 24 Jan 2005 01:57:16 -0500 (EST) Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> Message-ID: On Sun, 23 Jan 2005, Greg Lindahl wrote: > On Sun, Jan 23, 2005 at 11:30:30AM -0500, Robert G. Brown wrote: > > > So I reiterate -- MTBF for hard disks, as reported by the manufacturer, > > is a nearly useless number. > > It is useful if you use it for what it's meant to be used for: the > failure rate in the bottom of the bathtub. I don't know why you were > thinking of using it for anything else, like disk lifetime, or infant > mortality. I have found that my actual failure rates have been 2X-3X > the manufacturer's number, but you always have to worry about dust, > power surges, and excess heat incidents in real machine rooms. I think >>everybody<< finds that actual failure rates are (at least) 2x-3x the mfr number, and finds that it varies wildly in time and with environmental conditions and with plain old luck. That's why (and what I mean by stating that) mfr MTBF quotations are optimistic and cheery. If you've developed Kentucky Windage for their numbers that makes them useful to you, that's great, but you've got a LOT of experience on which to base that correction, and can still get burned by the fact that actual failures are (at best) not terribly uniformly distributed -- the "lemon" phenomenon of manufacturing, also known as "the box of disks that fell from the truck during shipping". Otherwise, what I was basically doing is describing the bathtub (which might, in fact, be more of a kitchen sink with a quite small flat region, given that the testing cannot, obviously, take long enough to define a proper tub floor). That is, we don't really know much about the bathtub size or shape for any drive except (perhaps) for whatever we can infer from the mfr warranty on the particular drive in question, and even THAT is bent out of ideal shape by the actual conditions (such as the particular case it is mounted in and how good its ventilation is and the temperature of the ambient air and how hard it is being run). As was pointed out by Karen (and I agree) the mfr warranty period is perhaps a better number for most people to pay attention to than MTBF as it is the only number that actually costs mfrs money when a disk "prematurely" fails and the only number that does you any good if you buy a hundred disks -- or even just one -- that turn out to be from a "bad batch". Being a cynic, I cannot keep from thinking of the dozens of ways an overgood MTBF number could be "cooked" by a mfr, the near certainty that nobody will ever do anything like a study that could refute it if they pulled it out of thin air, and the lack of financial incentives to make it pessimistic or even acurate. Maybe they are all perfectly honest and drive failure rates are really just 1%/year or thereabouts (on the bathtub floor) and I just never noticed it, or was unlucky, or beat the disks to death by using them in actual computers that only rarely used the disks at all;-) With a warranty, though, while I still care I care less -- I still have to hassle with the replacement but I don't have to buy the disk over again, even if it is just one drive in 100 in a year. Even the warranty period and marginal cost is a less than perfect predictor. I'll bet that in the consumer marketplace they don't actually have to make good on more than two potential warranty claims out of three for three year drives -- RMA is a PITA and probably daunts many a should-be claimant after a 1 year system warranty expires, or they are sold the systems and never told that the drives have a three year warranty. Dropping the warranty on most OTC disks to 1 year sends a pretty negative signal to me, at least, as does the explicit marginal cost of adding back the missing two years. The dollar amounts imply that the MANUFACTURERS are expecting a whole lot more than 1% of ANY batch of disks to fail per year, even out there on the bathtub floor. Who should I believe -- the MTBF or the money? > MTBF for just about everything is computed the same way, and most gizmos > have the same bathtub-shaped failure curve. I'm reminded of a line in a statistics book I once read (I can't remember which one, alas) in which the author had just done a lengthy analysis of failure rates and probability and arrived at a mathematically proven and statistically sound conclusion based on the observations and premises, who then ended up his argument with "but everybody >>knows<< that things go wrong more often than >>that<<" or something similar. His point (I think) was that statistics are lovely but use your gut and your head as well -- reality check time. I tend to think more in terms of warranties and Murphy than in mfr's MTBF, especially when MTBF is a number with absolutely no financial penalty attached to it derived from measurements that (necessarily) are not in the actual context of most usage. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From lindahl at pathscale.com Sun Jan 23 23:14:14 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sun, 23 Jan 2005 23:14:14 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> Message-ID: <20050124071413.GA1493@greglaptop.greghome.keyresearch.com> On Mon, Jan 24, 2005 at 01:57:16AM -0500, Robert G. Brown wrote: > Otherwise, what I was basically doing is describing the bathtub Didn't look like that to me, but I just read your rants, I wasn't the guy who wrote them. > As was pointed out by Karen (and I agree) the mfr warranty period is > perhaps a better number for most people to pay attention to than MTBF I disagree. The warranty period tells you about disk lifetime. The MTBF tells you about the failure rate in the bottom of the bathtub. These are nearly independent quantities; I already pointed out that the fraction of disks which fail in the bottom of the bathtub is small, even if you multiply it by 2X or 3X. So the major factor in the price and length of that warranty is the lifetime. Lifetime and MTBF are simply different measures. Depending on what you are thinking about, you pay attention to one or the other or neither. > His point (I think) was that statistics are lovely > but use your gut and your head as well -- reality check time. Yes. And I have yet to see anything in your complaint that is anything but misinterpretation on your part. Reality check time, indeed. You can't use MTBF by itself as a measure of quality, period, so complaining that it isn't a good single item to measure disk quality is, well, operator error. -- greg From jimlux at earthlink.net Mon Jan 24 06:58:33 2005 From: jimlux at earthlink.net (Jim Lux) Date: Mon, 24 Jan 2005 06:58:33 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> Message-ID: <004701c50225$2d76e7e0$32a8a8c0@LAPTOP152422> ----- Original Message ----- From: "Robert G. Brown" To: "Greg Lindahl" Cc: Sent: Sunday, January 23, 2005 10:57 PM Subject: Re: [Beowulf] Re: Cooling vs HW replacement > On Sun, 23 Jan 2005, Greg Lindahl wrote: > > I think >>everybody<< finds that actual failure rates are (at least) > 2x-3x the mfr number, and finds that it varies wildly in time and with > environmental conditions and with plain old luck. That's why (and what > I mean by stating that) mfr MTBF quotations are optimistic and cheery. Such may be the case, but as Greg pointed out, they are a standard measure of reliability of a device. Actually, I'd trust the MTBF and other reliability data more than the warranty, and here's why: 1) Warranty terms are economics and marketing driven. They're set based (partially) on what the manufacturer thinks is a reasonable expenditure for warranty replacements. If Brand A offers a 2 year warranty and Brand B offers a 3 year warranty, on the same drive, at the same price, people will buy Brand B, improving Brand B's short term revenue (potentially at some downstream cost a couple years from now). The "lemon" phenomenon is probably more model based than serial number based, given the consistency of modern manufacturing processes. (ISO 9000 compliance means that you'll produce the same icky piece of hardware in exactly the same way each time) 2) A warranty is a mere contractual detail, subject to negotiation between vendor and customer. Of course, we, as retail customers, tend not to have much negotiating power here, but I'd imagine that the sales agreement between, for instance, Dell and Seagate, has very different warranty terms, even if the drives are identical. 3) Most people never collect on warranties, even if the equipment fails. The sellers are well aware of this. Otherwise why would there be "lifetime" warranties on things, which clearly will fail or become useless eventually. "Extended service contracts" are a huge money maker for just this reason. 4) If a mfr sells a product that, for some reason, has problems, they don't usually adjust the warranty. If its an expensive item (like a car), they may have a perverse incentive to acknowledge that the problem exists, because it would trigger a flood of warranty requests from purchasers whose units haven't failed yet. This sometimes results in in huge class-action lawsuits and things like lemon laws. Google for "1.8T Passat Sludge" 4) An MTBF specification is a testable, verifiable number. If I put out a procurement for disk drives, and I require an MTBF of, say, 1,000,000 hours in a particular environment, the vendor has to meet that requirement, and demonstrate that it has done so in some way (in this case: by a combination of "similarity", "analysis", and "test", since they obviously couldn't test the delivered article to death). At some point, the vendor is going to sign a piece of paper that says that "this shipment meets all requirements as specified in ...". > > As was pointed out by Karen (and I agree) the mfr warranty period is > perhaps a better number for most people to pay attention to than MTBF as > it is the only number that actually costs mfrs money when a disk > "prematurely" fails and the only number that does you any good if you > buy a hundred disks -- or even just one -- that turn out to be from a > "bad batch". Being a cynic, I cannot keep from thinking of the dozens > of ways an overgood MTBF number could be "cooked" by a mfr, the near > certainty that nobody will ever do anything like a study that could > refute it if they pulled it out of thin air, and the lack of financial > incentives to make it pessimistic or even acurate. The financial incentive is that if they deliver a product stated to meet a particular MTBF spec, and they don't, they are committing a fraud, which has substantial penalties (not just financial) associated with it. Putting an unrealistic warranty on something is mere marketing, and the only penalty is a possible financial one, the size of which is determined by many things other than failure rates, and further, which is far into the future, long after this year's (quarter's) bonuses related to shareholder return and revenue have been distributed. Maybe they are all > perfectly honest and drive failure rates are really just 1%/year or > thereabouts (on the bathtub floor) and I just never noticed it, or was > unlucky, or beat the disks to death by using them in actual computers > that only rarely used the disks at all;-) With a warranty, though, > while I still care I care less -- I still have to hassle with the > replacement but I don't have to buy the disk over again, even if it is > just one drive in 100 in a year. > If you really are concerned about failure rates, then get a reliability engineer to look at their data and make a "real life" assessment. Properly evaluating the data and interpreting it is non-trivial. Especially if you're buying hundreds or thousands of disks, you should be putting hard requirements in your procurement spec for reliability. You could even offer to buy them with NO warranty at a discount, since the majority of the cost of a failure falls on you anyway (time and hassle). This is definitely where buying your hardware at the corner computer store (or at a "big-box" store) isn't a good thing. These retailers don't have the skill set, nor the incentive, to properly assess what are fairly arcane things. Jim Lux From josip at lanl.gov Mon Jan 24 08:58:47 2005 From: josip at lanl.gov (Josip Loncaric) Date: Mon, 24 Jan 2005 09:58:47 -0700 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <004701c50225$2d76e7e0$32a8a8c0@LAPTOP152422> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> <004701c50225$2d76e7e0$32a8a8c0@LAPTOP152422> Message-ID: <41F52947.70507@lanl.gov> Jim Lux wrote: > > Actually, I'd trust the MTBF and other reliability data more than the > warranty, and here's why: I agree -- but I wished I had two more numbers: percentage lost to infant mortality, and possibly the overall life expectancy. This would describe the "bathtub" failure rate graph in a way that I can apply in practice, while MTBF alone is only a partial description. Life expectancy for today's drives is probably longer than the useful life of a computer cluster (3-4 years, but see below). Therefore, midlife MTBF numbers should be a good guide of how many disk replacements the cluster may need annually. However, infant mortality can be a *serious* problem. Once you install a bad batch of drives and 40% of them start to go bad within months, you've got an expensive problem to fix (in terms of the manpower required), regardless of what the warranty says. Manufacturers are starting to address this concern, but in ways that are very difficult to compare. For example, Maxtor advertises "annualized return rate <1%" which presumably relates to the number of drives returned for warranty service, but comparing Maxtor's numbers to anyone else's is mere guesswork. Even if manufacturers were to truthfully report their overall warranty return experience, this would not prevent them from releasing a bad batch of drives every now and then. Only those manufacturers that routinely fail to meet industry's typical reliability get reputations bad enough to erode their financial position -- so I suspect that average warranty return percentages (for surviving manufacturers) would turn out to be virtually identical -- and thus not very significant for cluster design decisions. Until a better solution is found, we can only make educated guesses -- and share anecdotal stories about bad batches to avoid... Sincerely, Josip P.S. Drives are designed for particular markets: expensive server drives (->SCSI) are designed to be worked hard 24/7 and rarely spun down; cheap desktop drives (->ATA) are designed for light workloads 10-12 hr/day and more start/stop cycles. Their respective MTBF figures assume these different workloads. Moreover, target component lifespan for cheap drives is 5 years minimum, so this should describe their life expectancy -- assuming that a particular batch does not have a design defect creating high infant mortality. If a cluster is good for 3-4 years and its drives for 5, there will be some rise in the number of drive replacements needed towards the end, but probably still within reason. This is as it should be: it makes no economic sense to overdesign components which will be replaced after 3-4 years anyway. Mature consumer products usually reach this balance of component reliabilities. We all know what happens with cars: they work for years with modest maintenance, but then all seems to go wrong at once, and it's time to get a new one. From James.P.Lux at jpl.nasa.gov Mon Jan 24 10:09:50 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Mon, 24 Jan 2005 10:09:50 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <41F52947.70507@lanl.gov> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> <004701c50225$2d76e7e0$32a8a8c0@LAPTOP152422> <41F52947.70507@lanl.gov> Message-ID: <6.1.1.1.2.20050124100047.04160488@mail.jpl.nasa.gov> At 08:58 AM 1/24/2005, Josip Loncaric wrote: >Jim Lux wrote: >>Actually, I'd trust the MTBF and other reliability data more than the >>warranty, and here's why: > >I agree -- but I wished I had two more numbers: percentage lost to infant >mortality, and possibly the overall life expectancy. This would describe >the "bathtub" failure rate graph in a way that I can apply in practice, >while MTBF alone is only a partial description. > >Life expectancy for today's drives is probably longer than the useful life >of a computer cluster (3-4 years, but see below). Therefore, midlife MTBF >numbers should be a good guide of how many disk replacements the cluster >may need annually. > >However, infant mortality can be a *serious* problem. Once you install a >bad batch of drives and 40% of them start to go bad within months, you've >got an expensive problem to fix (in terms of the manpower required), >regardless of what the warranty says. The Seagate documentation actually had some charts in there with expected failure rates, by month, for the first few months. >Manufacturers are starting to address this concern, but in ways that are >very difficult to compare. For example, Maxtor advertises "annualized >return rate <1%" which presumably relates to the number of drives returned >for warranty service, but comparing Maxtor's numbers to anyone else's is >mere guesswork. Indeed.. annualized return rate is an economic planning number, good if you're a retailer or consumer manufacturer trying to estimate how much to allow for, but hardly a testable specification. (I would imagine, though, that they can, if necessary, back up their <1% return rate with documentation...). If I were a HP making millions of consumer computers, that's the spec I'd really want to see. If I were a cluster builder or server farm operator with concerns about failure rates over a 3 year design life, then I'd want to see real reliability data. >Even if manufacturers were to truthfully report their overall warranty >return experience, this would not prevent them from releasing a bad batch >of drives every now and then. Only those manufacturers that routinely >fail to meet industry's typical reliability get reputations bad enough to >erode their financial position -- so I suspect that average warranty >return percentages (for surviving manufacturers) would turn out to be >virtually identical -- and thus not very significant for cluster design >decisions. Precisely true... >Until a better solution is found, we can only make educated guesses -- and >share anecdotal stories about bad batches to avoid... > Or, spend some time with the full reliability data and make a "calculated" guess. This is kind of what separates the big companies from the small ones. The big ones have the resources to do meaningful tests (i.e. pull 1000 units off the line and life test them), the small ones don't. It's interesting.. to a certain extent this discussion reflects the change in Beowulfery.. from making use of commodity consumer equipment (because it's cheap and living within the limitations.. interconnect bandwidth, etc.) to far more specialized cluster computing, where you're looking at the details of node reliability, infrastructure issues, etc. Part of it is the scale of clusters has increased. It used to be 4 or 8 computers in the typical cluster, and even with kind of crummy reliability, it worked ok. The failures weren't so common that you couldn't run a big job, and the impact of having to replace a machine wasn't so huge. Now, though, with 1000 nodes, the reliability becomes much more important, because the failure rate is multiplied by 1000, instead of 8. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From seth at integratedsolutions.org Thu Jan 20 10:36:48 2005 From: seth at integratedsolutions.org (Seth Bardash) Date: Thu, 20 Jan 2005 11:36:48 -0700 Subject: another radical concept...Re: [Beowulf] Cooling vs HW replacement In-Reply-To: Message-ID: <200501201836.j0KIar911133@integratedsolutions.org> Here's another radical concept: (Not meant as an advertisement just an explanation of good engineering practices) Purchase, build or upgrade your systems so they can run properly at full load in the ambient condition you presently have. We design all our system with 15 degrees C headroom at FULL LOAD at 25 degrees C ambient. We then test and burn-in every system we build to make sure that ALL OF THEM meet this spec. We test every machine using cpuburn: http://pages.sbcglobal.net/redelm/ (run 2 copies simultaneously for dual's) and make sure they operate properly and run the processors and MB's at reasonable temperatures. We install the latest i2c and lm sensors code and modify the sensors.conf to give easily read output so we can monitor temperatures while the machine is under full load. We monitor the temp every 10 seconds for a minimum of 4 hours to allow the systems to stabilize and then take readings to make sure the systems meet spec. We ran into a problem with a set of dual Opteron 246's in one machine out of 84 we were building for a cluster that had excellent cooling. We spoke at length with AMD and they provided some spare Opteron 246's for testing and took back the pair that was running 10 degrees hotter than all the other processors we were installing. It turned out that the processors that were running hot had heat-spreaders on them that were not exactly flat (they were slightly concave). We would not have discovered the problem if we had not done all the up front work required to install and run the software for temperature testing. In existing machines you can look on the web for better heatsink-fan assemblies than those presently installed and extend the temperature range of an existing machine. Although this requires being careful when upgrading a machine the pay-off can eliminate most full load temperature problem. Some of the best heatsink fan assemblies can be had from: http://www.selectcool.com http://www.micforg.co.jp/en/index.html http://www.swiftnets.com/ http://www.zalman.co.kr/ Make sure when upgrading a machine you use the best available thermal compound between the CPU and the HSF. We only use Artic Silver 5 and the results are well worth the extra $0.25 per CPU. The total cost to fix a poorly cooled system with a better HSF is usually about $20 to $30 per CPU. You also might want to change the cooling fans in the system with higher volume fans to get the heat out of the box. Load throttling should never be necessary on a well designed and well built machine. Just my 2 cents..... Seth Bardash Integrated Solutions and Systems 1510 North Gate Road Colorado Springs, CO 80921 719-495-5866 719-495-5870 Fax 719-337-4779 Cell http://www.integratedsolutions.org Failure can not cope with perseverance! -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.1 - Release Date: 1/19/2005 From agrajag at dragaera.net Fri Jan 21 06:41:46 2005 From: agrajag at dragaera.net (Sean Dilda) Date: Fri, 21 Jan 2005 09:41:46 -0500 Subject: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster In-Reply-To: <1106121183.17565.90.camel@nuts.clc.cuhk.edu.hk> References: <1106121183.17565.90.camel@nuts.clc.cuhk.edu.hk> Message-ID: <41F114AA.50002@dragaera.net> John Lau wrote: > Hi, > > Have anyone try running MPI programs with MPICH on heterogeneous cluster > with both i386 and x86_64 machines? Can I use a i386 binary on the i386 > machines while use a x86_64 binary on the x86_64 machines for the same > MPI program? I thought they can communicate before but it seems that I > was wrong because I got error in the testing. > > Have anyone try that before? I've not tried it, but I can think of a few good reasons why you'd want to avoid it. Lets say you want to send some data that's stored in a long from the x86_64 box to the x86 box. Well, on the x86_64 box, a long takes up 8 bytes. But on the x86 box, it only takes 4 bytes. So, chances are some Bad Stuff(tm) is going to happen if you try to span an MPI program across architectures like that. On the other hand, the x86_64 box will run x86 code without a problem. So i suggest running x86 binaries (and mpich) libraries on all of the boxes. While I haven't tested it myself, I can't think of any reason why that wouldn't work. From george at galis.org Thu Jan 20 12:38:54 2005 From: george at galis.org (George Georgalis) Date: Thu, 20 Jan 2005 15:38:54 -0500 Subject: Contractors Re: [Beowulf] Cooling vs HW replacement In-Reply-To: <41E12D14.3000402@fing.edu.uy> References: <41E12D14.3000402@fing.edu.uy> Message-ID: <20050120203854.GA30894@sta.local> Thanks all for this interesting thread. I'd like to bring it to the next level, construction. How does one find and select an AC architect? Can anyone provide referrals? I also welcome contractor solicitation from vendors who can service a data center (center of building) and small office, near Hartford CT. Thanks, // George -- George Georgalis, systems architect, administrator Linux BSD IXOYE http://galis.org/george/ cell:646-331-2027 mailto:george at galis.org From csamuel at vpac.org Thu Jan 20 15:45:46 2005 From: csamuel at vpac.org (Chris Samuel) Date: Fri, 21 Jan 2005 10:45:46 +1100 Subject: [Beowulf] 64 bit Xeons? In-Reply-To: <20050120070121.GB1611@greglaptop.greghome.keyresearch.com> References: <20050112000239.GB14480@maybe.org> <200501201454.16524.csamuel@vpac.org> <20050120070121.GB1611@greglaptop.greghome.keyresearch.com> Message-ID: <200501211045.49241.csamuel@vpac.org> On Thu, 20 Jan 2005 06:01 pm, Greg Lindahl wrote: > ... which is because Intel copied it too closely, not because Intel > tinkered with it. Aha - thanks for the clarification! There doesn't seem to be any definitive place that says what the differences are, I had to google around quite a bit to find those bits. :-( This is why the Beowulf list is so good, there are so many knowledgeable folks here willing to help. Thanks Donald (et. al?) for running it! cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Systems & Network Admin Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available URL: From sbrenneis at surry.net Thu Jan 20 17:08:14 2005 From: sbrenneis at surry.net (Steve Brenneis) Date: Thu, 20 Jan 2005 20:08:14 -0500 Subject: [Beowulf] Mosix In-Reply-To: <20050120174251.GB3734@greglaptop.greghome.keyresearch.com> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> <200501181229.48118.mwill@penguincomputing.com> <1106097911.4910.32.camel@localhost.localdomain> <41EFD9C2.1000405@penguincomputing.com> <20050120174251.GB3734@greglaptop.greghome.keyresearch.com> Message-ID: <1106269693.4976.20.camel@localhost.localdomain> On Thu, 2005-01-20 at 12:42, Greg Lindahl wrote: > > >Before message-passing mechanisms arrived, and before the concept of > > >multi-threading was introduced, > > Note that message-passing predates MP. Not that anyone really cares > about ancient history, but... I was actually unaware of that, so in the interest of straight record-keeping, thanks. Multi-processing goes back to the early sixties (I go back to the late sixties, but that's another story). The fact that message-passing predates that is pretty amazing. > > > >Just keeping the record straight. > > Amen. > > -- greg > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Steve Brenneis From sbrenneis at surry.net Thu Jan 20 17:43:32 2005 From: sbrenneis at surry.net (Steve Brenneis) Date: Thu, 20 Jan 2005 20:43:32 -0500 Subject: [Beowulf] Mosix In-Reply-To: <41EFD9C2.1000405@penguincomputing.com> References: <41E12D14.3000402@fing.edu.uy> <41ED398B.9080306@lanl.gov> <026f01c4fd94$48f32d20$5c00a8c0@rajeshdesktop> <200501181229.48118.mwill@penguincomputing.com> <1106097911.4910.32.camel@localhost.localdomain> <41EFD9C2.1000405@penguincomputing.com> Message-ID: <1106271811.4976.23.camel@localhost.localdomain> On Thu, 2005-01-20 at 11:18, Michael Will wrote: > Steve, his original question was why we still bother with mpi and other > parallel programming > headaches when instead we could just use Mosix that does things > transparently. My response > intended to clarify that you still need parallell programming > techniques, and your point that > you could then also use mosix to have them migrate around (and away from > the ressources > in the worst case) transparently is true. > > My point is: There is no automated transparent parallelization of your > serial code. > > My apologies if my answer was not clear enough. > > Michael Will > Nothing personal, friend. Your point is well taken. > Steve Brenneis wrote: > > >On Tue, 2005-01-18 at 15:29, Michael Will wrote: > > > > > >>On Tuesday 18 January 2005 11:31 am, Rajesh Bhairampally wrote: > >> > >> > >>>i am wondering when we have something like mosix (distributed OS available > >>>at www.mosix.org ), why we should still develop parallel programs and > >>>strugle with PVM/MPI etc. > >>> > >>> > >>Because Mosix does not work? > >> > >>This of course is not really true, for some applications Mosix might be appropriate, > >>but what it really does is transparently move processes around in a cluster, not > >>have them become suddenly parallelized. > >> > >>Let's have an example: > >> > >>Generally your application is solving a certain problem, like say taking an image and apply > >>a certain filter to it. You can write a program for it that is not parallel-aware, and does not use > >>MPI and just solves the problem of creating one filtered image from one original image. > >> > >>This serial program might take one hour to run (assuming really large image and really > >>complicated filter). > >> > >>Mosix can help you now run this on a cluster with 4 nodes, which is cool if you have 4 > >>images and still want to wait 1 hour until you see the first result. > >> > >>Now if you want to really filter only one image, but in about 15 minutes, you can program your > >>application differently so that it only works on a quarter of the image. Mosix could still help you > >>run your code with different input data in your cluster, but then you have to collect the four pieces > >>and stitch them together and would be unpleasently surprised because the borders of the filter > >>will show - there was information missing because you did not have the full image available but just > >>a quarter of it. Now when you adjust your code to exchange that border-information, you are actually > >>already on the path to become an MPI programmer, and might as well just run it on a beowulf cluster. > >> > >>So the mpi aware solution to this would be a program that splits up the image into the four quadrants, > >>forks into four pieces that will be placed on four available nodes, communicates the border-data between > >>the pieces and finally collects the result and writes it out as one final image, all in not much more than > >>the 15 minutes expected. > >> > >>Thats why you want to learn how to do real parallel programming instead of relying on some transparent > >>mechanism to guess how to solve your problem. > >> > >>Michael > >> > >> > >> > >> > > > >Ignoring the inflammatory opening of the above response, I'll just state > >that its representation of what Mosix does and how it works is neither > >fair nor accurate. > > > >Before message-passing mechanisms arrived, and before the concept of > >multi-threading was introduced, the favored mechanism for > >multi-processing and parallelism was the good old fork-join method. That > >is, a parent process divided the task into small, manageable sub-tasks > >and then forked child processes off to handle each subtask. When the > >subtask was complete, the child notified the parent (usually by simply > >exiting) and the parent joined the results of the sub-tasks into the > >final task result. This mechanism works quite well on multi-tasking > >operating systems with various scheduling models. It can be effective on > >multi-CPU single systems or on clusters of single or multiple CPU > >systems. > > > >Mosix (or at least Open Mosix) handles this kind of parallelism > >brilliantly in that it will balance the forked child processes around > >the cluster based on load factors. So your image processing, your > >Gaussian signal analysis, your fluid dynamics simulations, your parallel > >software compilations, or your Fibonacci number generations are > >efficiently distributed while you still maintain programmatic control of > >the sub-tasking. > > > >While the fork-join mechanism is not without a downside > >(synchronization, for one, as mentioned above), it can be used with a > >system like Mosix to provide parallelism without the overhead of the > >message-passing paradigm. Maybe not better, probably not worse, just > >different. > > > >The effect described above in which sub-tasks operate completely > >independently to produce an erroneous result is really an artifact of > >poor programming and design skills and cannot be blamed on the task > >distribution system. Mosix is used regularly to do image processing and > >other highly parallel tasks. Creating a system like this for Mosix > >requires no knowledge of a message-passing interface or API, but simply > >requires a working knowledge of standard multi-processing methods and > >parallelism in general. > > > >One final note: most people consider a Mosix cluster to be a Beowulf as > >long as it meets the requirements of using commodity hardware and > >readily available software. > > > >Just keeping the record straight. > > > > > > > >>>Tough i never used either mosix or PVM/MPI, I am > >>>genunely puzzled about it. Can someone kindly educate me? > >>> > >>>thanks, > >>>rajesh > >>> > >>>_______________________________________________ > >>>Beowulf mailing list, Beowulf at beowulf.org > >>>To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > >>> > >>> > >>> > -- Steve Brenneis From michael.gauckler at gmx.ch Fri Jan 21 04:00:27 2005 From: michael.gauckler at gmx.ch (Michael Gauckler) Date: Fri, 21 Jan 2005 13:00:27 +0100 Subject: [Beowulf] Writing MPICH2 programs In-Reply-To: <1106301148.13613.24.camel@tcr.cs.bath.ac.uk> References: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> <20050120.124600.56681754.lusk@localhost> <1106301148.13613.24.camel@tcr.cs.bath.ac.uk> Message-ID: <1106308827.5113.2.camel@localhost.localdomain> > Has anyone written any nice MPI test programs that I could use to test > my cluster? I've been using the ones given with the MPICH2 distribution, > but it'd be good to try some others. A nice demo for a cluster is the parallel version of a raytracer. Google for "mpi povray". With the graphics version you can see the blocks which the slaves return, which is quite impressive. Chers, Michael From asabigue at fing.edu.uy Fri Jan 21 13:06:45 2005 From: asabigue at fing.edu.uy (Ariel Sabiguero) Date: Fri, 21 Jan 2005 22:06:45 +0100 Subject: Now anecdotical and off topic - was Re: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: <41F16EE5.2030905@fing.edu.uy> Robert G. Brown escribi?: >Has anyone observed that a megahour is 114 years? Has anyone observed >that this is so ludicrous a figure as to be totally meaningless? Show >me a single disk on the planet that will run, under load, for a mere two >decades and I'll bow down before it and start sacrificing chickens. > > The longest lasting drives I have ever installed are still spinning since 1995 (I think back in june). Those disks implement a RAID-1 on a Novell 3.12 and had been working 24x7 since then. I am aware that it is only 83K - 84K hours, but those NEC, 524 MB disks still astonishes me. I thought that they would last at most 5 years..... I even wrote a letter by 2000 suggesting urgent replacement.... They survived a couple of power source explosions and the death of one motherboard. They were installed when PC BIOS did not address more than 504MB and motherboard bios upgrade was too expensive... The stepper bearings tolerated betwen 1,7x10^10 and 1,8x10^10 platter spins.... (if you check my figures, remember 3600 rpm, not 5400 and even less 7200 ;-) ). Now I tend to believe that this company will continue adding servers, but never retire the old Novell.... (at least they retired the PCMOS one!) In ten years now I will try to remember this and download the "R.G.B. Unplugged" video with the promised ritual. Yours. Ariel PS: by the way, no cooling on those disks! Just environment temperature, which can reach up to 30?C during summer in that office (don't make me remember what is like working there....). >Humans don't live a megahour MTBF. Disks damn sure don't. > > rgb > > > From shaeffer at neuralscape.com Fri Jan 21 13:13:52 2005 From: shaeffer at neuralscape.com (Karen Shaeffer) Date: Fri, 21 Jan 2005 13:13:52 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: <20050121211352.GA19722@synapse.neuralscape.com> On Fri, Jan 21, 2005 at 03:10:31PM -0500, Robert G. Brown wrote: > > Has anyone observed that a megahour is 114 years? Has anyone observed > that this is so ludicrous a figure as to be totally meaningless? Show > me a single disk on the planet that will run, under load, for a mere two > decades and I'll bow down before it and start sacrificing chickens. > > Humans don't live a megahour MTBF. Disks damn sure don't. Yes. Well put. The only number that has any meaning to a disk drive manufacturer is the warranty time. The disk drive business is highly competitive and technology intensive. Profit margins are razor thin. A minimal uptic in warranted failure rates will have a significant impact on drive manufacturer's bottom line. They go to great pains to ensure the product works for as long as the warranty, under the prescribed operating conditions. Those MTBF numbers are projections based on sophisticated thermal cycling methods employed witin the set of QA processes used on disk drives. They run those drives in ovens, cycling them though a wide array of temperature variances and durations, leading to the projections. Internally, those disk drive companies only care that the drive lasts until the warranty expires. Thanks, Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com From joelja at darkwing.uoregon.edu Fri Jan 21 14:26:54 2005 From: joelja at darkwing.uoregon.edu (Joel Jaeggli) Date: Fri, 21 Jan 2005 14:26:54 -0800 (PST) Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: References: Message-ID: On Fri, 21 Jan 2005, Robert G. Brown wrote: > On Fri, 21 Jan 2005, David Mathog wrote: > >>> 2. higher reliability - typically 1.2-1.4M hours, and usually >>> specified under higher load. this is a very fuzzy area, since >>> commodity disks often quote 1Mhr under "lower" load. >> >> Exactly. It's very, very hard to figure out just how much reliability >> one is trading for the lower price. Anecdotally, for heavy disk >> usage, it's apparently a lot. Anecdotally, for low disk usuage, >> ATA disks aren't all that reliable either. > > Has anyone observed that a megahour is 114 years? Has anyone observed > that this is so ludicrous a figure as to be totally meaningless? Show > me a single disk on the planet that will run, under load, for a mere two > decades and I'll bow down before it and start sacrificing chickens. mtbf is an estimate of how often failure should occur in the disks if you run them for their service life and replace them. since disk vendors don't publish service life (one assumes it's shorter than the warranty) it's kind of a meaningless number. I guess if you have a million disks you should lose about 1 per hour. > Humans don't live a megahour MTBF. Disks damn sure don't. > > rgb > > -- -------------------------------------------------------------------------- Joel Jaeggli Unix Consulting joelja at darkwing.uoregon.edu GPG Key Fingerprint: 5C6E 0104 BAF0 40B0 5BD3 C38B F000 35AB B67F 56B2 From srgadmin at cs.hku.hk Sat Jan 22 01:50:04 2005 From: srgadmin at cs.hku.hk (srg-admin) Date: Sat, 22 Jan 2005 17:50:04 +0800 Subject: [Beowulf] Preliminary CFP : HPC-Asia 2005 Message-ID: <41F221CC.9020302@cs.hku.hk> Preliminary Call for Papers : HPC-Asia 2005 The 8th International Conference and Exhibition on High-Performance Computing in Asia-Pacific Region November 30-December 3, 2005, Beijing, China http://www.ict.ac.cn/hpcasia2005 HPC ASIA is an international conference series held every 18 months on an Asia-Pacific regional site. It provides a forum for HPC researchers, developers, and users throughout the world to exchange ideas, case studies, and research results related to all issues of high performance computing. The last seven conferences were held in Taipei in 1995, Seoul in 1997, Singapore in 1998, Beijing in 2000, Australia in 2001, India in 2002, and Tokyo in 2004. In addition to contributed technical papers, HPC Asia 2005 will include keynote addresses, invited/plenary speeches, panel discussions, workshops, poster presentations and industrial track presentations. A commercial/research exhibition will be the highlight of HPC Asia 2005. Authors are invited to submit manuscripts of original unpublished research work in all areas of high performance computing, including the development of experimental and commercial systems. Relevant topics include (but are not limited to) the following: - System Architecture and Models for HPC - System Software for HPC - Algorithms and Applications for HPC - Performance Evaluation and Productivity - Grid Computing Technical Paper Submission Instructions: Authors are invited to submit papers of no more than 8 pages of double column text using single spaced 10 point size type on 8.5x11 inches papers, as per IEEE 8.5x11 manuscript guidelines (see http://www.computer.org/cspress/instruct.htm). Authors should submit a PDF file that will print on a PostScript printer using 8.5 x 11 inch size (letter size) paper. Detailed paper submission instruction will be placed on the conference official website. The results presented in the paper must be original. One author of each accepted paper will be expected to present the paper at the conference. It is expected that the proceedings will be published by the IEEE Computer Society Press. Important Dates: Technical Paper Submission Deadline May 8, 2005 Notification of Acceptance/Rejection July 1, 2005 Camera-Ready Paper Deadline August 1, 2005 General Co-Chairs: Kai Li, Princeton University, USA Satoshi Sekiguchi, AIST, Japan Program Committee Co-Chairs: Jianping Fan, ICT, China Jysoo Lee, KISTI, Korea Steering Committee Chair David Kahaner, ATIP, USA Publicity Chair Cho-Li Wang, University of Hong Kong, China From srgadmin at cs.hku.hk Sat Jan 22 01:39:07 2005 From: srgadmin at cs.hku.hk (srg-admin) Date: Sat, 22 Jan 2005 17:39:07 +0800 Subject: [Beowulf] Preliminary Call For Paper: APPT2005 Message-ID: <41F21F3B.4080206@cs.hku.hk> Preliminary Call For Paper Sixth International Workshop on Advanced Parallel Processing Technologies (APPT 2005) 27?28 Oct. 2005, Hong Kong, China http://www.comp.polyu.edu.hk/APPT05 APPT is a biennial workshop on parallel and distributed processing. Its scope covers all aspects of parallel and distributed computing technologies, including architectures, software systems and tools, algorithms, and applications. APPT was originated from collaborations by researchers from China and Germany and has evolved to be an international workshop. The past five workshops were held in Beijing, Koblenz, Changsha, Illmenau, and Xiamen, respectively. APPT??05 will be the sixth in the series. Following the success of the last five workshops, APPT??05 provides a forum for scientists and engineers in academia and industry to exchange and discuss their experiences, new ideas, and results about research in the areas related to parallel and distributed processing Papers are solicited. Topics of particular interest include, but are not limited to : - Parallel / Distributed System Architectures - Advanced Microprocessor Architecture, - Middleware, Software Tools and Environments - Parallelizing Compilers, - Software Engineering issues - Interconnection Networks - Network Protocols, - Task Scheduling and Load Balancing - Grid Computing, Cluster Computing, Peer-to-Peer Computing - Internet & Web computing - Pervasive and Mobile Computing - Security in networks and distributed systems - Fault tolerance and dependability - Image Generation and Processing: Rendering Techniques, Virtual Reality, Visualization, Graphic Processing, etc SUBMISSION GUIDELINES Prospective authors are invited to submit a full paper in English (should not exceed 10 pages (12 pt) in length) presenting original and unpublished research results and experience. Papers will be selected based on their originality, timeliness, significance, relevance, and clarity of presentation. Papers must be submitted in PDF (preferably) or Postscript that is interpretable by Ghostscript. Submissions will be carried out electronically on the Web via a link found at the conference web page http://www.comp.polyu.edu.hk/APPT05/. Submissions imply the willingness of at least one author to register, attend the conference, and present the paper. There will be two best student paper awards to recognise distinguished student research. PUBLICATION The proceedings of the symposium will be published in Springer??s Lecture Notes in Computer Science (Pending). Important Dates: Paper submission: 1 April 2005 Acceptance notification: 1 June 2005 Camera-ready due: 1 July 2005 APPT'05 workshop: 27-28 Oct 2005 Sponsored by Sponsored by Architecture Professional Committee of China Computer Federation. Organized by Department of Computing, Hong Kong Polytechnic University Supported by IEEE HK, ACM HK ORGANIZING COMMITTEE General Chair Xingming Zhou, Member of Chinese Academy of Science. National Lab for Parallel and Distributed Processing, China Vice General Co-Chairs Xiaodong Zhang, College of William and Mary, USA David A. Bader, Univ. of New Mexico, USA, Program Co-Chairs Jiannong Cao, Hong Kong Polytechnic Univ, H.K. Wolfgang Nejdl, Univ. of Hannover, Germany Publicity Chair Cho-Li Wang, Univ. of Hong Kong, H.K. Publication Chair TBD Local Organisation Chair Allan K.Y. Wong, H.K. PloyU, H.K. Finance / Registration Chair Ming Xu, National Lab for Parallel and Distributed Processing, China PROGRAM COMMITTEE Srinivas Aluru, Iowa State University, USA Jose Nelson Amaral, University of Alberta, Canada Nancy Amato, Texas A&M University, USA Wentong Cai, Nanyang Technological Univ. Singapore Y. K. Chan, City Univ. of Hong Kong, Hong Kong John Feo, Cray Inc., USA Tarek El-Ghazawi, George Washington University, USA Ananth Grama, Purdue University, USA Binxing Fang, Harbin Institute of Technology, China Guang Gao, University of Delaware, USA Manfred Hauswirth, EPFL, Switzerland Bruce Hendrickson, Sandia National Lab., USA Zhenzhou Ji, Harbin Institute of Technology, China Mehdi Jazayeri, Technical University of Vienna, Austria Ashfaq Khokhar, University of Illinois, Chicago, USA Ajay Kshemkalyani, Univ. of Illinois, Chicago, USA Xiaoming Li, Peking University, China Francis Lau, University of Hong Kong, China Xinsong Liu, Electronical Sciences University, China Yunhao Liu, Hong Kong University of Science and Technology, China Xinda Lu, Shanghai Jiao Tong University, China Siwei Luo, Northern Jiao Tong University, China Beth Plale, Indiana University, USA Bernhard Plattner, Swiss Federal Institute of Tech., Switzerland Sartaj Sahni, University of Florida, USA Nahid Shahmehri, Link?pings universitet, Sweden Chengzheng Sun, Griffith University, Australia Zhimin Tang, Institute of Computing, CAS, China Bernard Traversat, Sun Microsystems Peter Triantafillou, University of Patras, Greece Lars Wolf, Tech. Universit?t Braunschweig, Germany Jie Wu, Florida Atlantic University, USA Li Xiao, Michigan State Univ, USA Cheng-Zhong Xu, Wayne State University, USA Weimin Zheng, Tsinghua University, China From toon at moene.indiv.nluug.nl Fri Jan 21 14:38:47 2005 From: toon at moene.indiv.nluug.nl (Toon Moene) Date: Fri, 21 Jan 2005 23:38:47 +0100 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <23253044.1106345938615.JavaMail.root@dtm1eusosrv72.dtm.ops.eu.uu.net> References: <23253044.1106345938615.JavaMail.root@dtm1eusosrv72.dtm.ops.eu.uu.net> Message-ID: <41F18477.2090407@moene.indiv.nluug.nl> David Mathog wrote: >>On Fri, 21 Jan 2005, Robert G. Brown wrote >>Has anyone observed that a megahour is 114 years? Has anyone observed >>that this is so ludicrous a figure as to be totally meaningless? Show >>me a single disk on the planet that will run, under load, for a mere two >>decades and I'll bow down before it and start sacrificing chickens. > > ROTFLMAO It's not nearly as bad as a manager at my Institute (KNMI - the Dutch Weather Service) noting that a particular piece of measuring equipment had a 98 % up time - which meant that it would be down a week per year, and most probably *the* week in the year where it was most needed (i.e., a temperature sensor in winter close to 0 degrees Centigrade), because .... sensors are weather-dependent equipment just as well. :-) -- Toon Moene - e-mail: toon at moene.indiv.nluug.nl - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands Maintainer, GNU Fortran 77: http://gcc.gnu.org/onlinedocs/g77_news.html A maintainer of GNU Fortran 95: http://gcc.gnu.org/fortran/ From rockwell at pa.msu.edu Thu Jan 20 10:22:57 2005 From: rockwell at pa.msu.edu (Tom Rockwell) Date: Thu, 20 Jan 2005 13:22:57 -0500 Subject: [Beowulf] cheap 48 port gigabit ethernet switch w/ jumbo frames? In-Reply-To: <200501201700.j0KH0PfQ032360@bluewest.scyld.com> References: <200501201700.j0KH0PfQ032360@bluewest.scyld.com> Message-ID: <41EFF701.60905@pa.msu.edu> Hi, I'm looking for a switch that will be used for NFS traffic on a cluster of about 40 nodes. The nodes will have Broadcom 5704 ethernet. From what I've read, jumbo frames is important for getting the best NFS performance over gigabit ethernet. D-link and Netgear have newer 48 port switches priced below managed switches. The D-link is model DGS-1248T http://dlink.com/products/?sec=2&pid=367 and the Netgear is model GS748T http://netgear.com/products/details/GS748T.php. Each are about $1200 or so. I'm unable to find info on their websites specifying whether these switches support jumbo frames. Anyone know? Thanks, Tom Rockwell Michigan State University From reuti at staff.uni-marburg.de Thu Jan 20 11:10:56 2005 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu, 20 Jan 2005 20:10:56 +0100 Subject: [Beowulf] Writing MPICH2 programs In-Reply-To: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> References: <1106240945.41efe5b13d0de@webmail.bath.ac.uk> Message-ID: <1106248256.41f00240414d9@home.staff.uni-marburg.de> You can download the MPI documentation at netlib.org as .ps: http://www.netlib.org/utk/papers/mpi-book/mpi-book.html or look at some tutorials: ftp://math.usfca.edu/pub/MPI/mpi/guide.ps http://www.science.uva.nl/research/scs/edu/pscs/guide.pdf http://www.science.uva.nl/research/ scs/edu/distr/guide_to_the_practical_work.pdf Cheers - Reuti Quoting Tom Crick : > Hi, > > Are there any resources for writing MPICH2 programs? I've found the MPICH2 > User's Guide (Argonne National Laboratory), but haven't been able to find > any > decent material detailing the approaches and methods to writing programs > for > MPICH2. > > Thanks and regards, > > Tom Crick > tc at cs.bath.ac.uk > http://www.cs.bath.ac.uk/~tc > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From szapytow at kent.edu Thu Jan 20 12:20:57 2005 From: szapytow at kent.edu (Steve Zapytowski) Date: Thu, 20 Jan 2005 15:20:57 -0500 Subject: [Beowulf] Need advice on cluster hardware Message-ID: <20050120202057.70C75269C6D@smtp.kent.edu> I wish to know if all machines in a Beowulf Cluster must be identical. Can you please answer this question for me? -------------- next part -------------- An HTML attachment was scrubbed... URL: From george at galis.org Thu Jan 20 12:28:19 2005 From: george at galis.org (George Georgalis) Date: Thu, 20 Jan 2005 15:28:19 -0500 Subject: [Beowulf] Cooling vs HW replacement In-Reply-To: <20050119064145.GB28329@piskorski.com> References: <20050117063322.GA26246@sta.local> <20050119064145.GB28329@piskorski.com> Message-ID: <20050120202819.GA30757@sta.local> On Wed, Jan 19, 2005 at 01:41:45AM -0500, Andrew Piskorski wrote: >On Mon, Jan 17, 2005 at 01:33:22AM -0500, George Georgalis wrote: > >> Use a SAN/NAS (nfs) and keep the disks in a separate room than the CPUs. >> Disk drives generate a lot of heat, and compared to on board components >> don't really need cooling, circulated air should largely cover them. > >This is an oxymoron. If disks generate a lot of heat, then that heat >needs to be removed. If you have a small room stuffed with hundreds >of those disks, all that heat has to go somewhere... I think you are absolutely right, especially about the heat delta. Per Dr Brown's excellent response, that focused on the issue, the components generate heat which must be removed, I still say a hot room, cold room approach will be effective, and also point out disks can tolerate a hotter ambient temperature than the other components. So a savings can be made keeping memory/cpu/etc at lower ambient than disk storage. of course if you only have one or two raid units and no other disks, then there is no need to house them separately. // George -- George Georgalis, systems architect, administrator Linux BSD IXOYE http://galis.org/george/ cell:646-331-2027 mailto:george at galis.org From ashley at quadrics.com Mon Jan 24 10:49:52 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Mon, 24 Jan 2005 18:49:52 +0000 Subject: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster In-Reply-To: <41F114AA.50002@dragaera.net> References: <1106121183.17565.90.camel@nuts.clc.cuhk.edu.hk> <41F114AA.50002@dragaera.net> Message-ID: <1106592592.7512.68.camel@localhost.localdomain> On Fri, 2005-01-21 at 09:41 -0500, Sean Dilda wrote: > John Lau wrote: > > Hi, > > > > Have anyone try running MPI programs with MPICH on heterogeneous cluster > > with both i386 and x86_64 machines? Can I use a i386 binary on the i386 > > machines while use a x86_64 binary on the x86_64 machines for the same > > MPI program? I thought they can communicate before but it seems that I > > was wrong because I got error in the testing. > > > > Have anyone try that before? > > I've not tried it, but I can think of a few good reasons why you'd want > to avoid it. Lets say you want to send some data that's stored in a > long from the x86_64 box to the x86 box. Well, on the x86_64 box, a > long takes up 8 bytes. But on the x86 box, it only takes 4 bytes. So, > chances are some Bad Stuff(tm) is going to happen if you try to span an > MPI program across architectures like that. I've done it with MPI across ia32 and ia64 machines, purely as a demonstration though, it's a headache to get it right and hard to see why you'd want to except in very special circumstances. As to whether MPICH can do this though it another matter (I don't know the answer to this) but I really think you should try and find another solution. Running a ia32 binary on mixed ia64 and ia32 machines works much better however, I've even done this between ia64 and x86_64, just keep everything 32bit and it should all work. I do know of people however who run a parallel application across alphas and nVidia graphics cards plugged into p4's but this isn't using MPI. Ashley, From egan at sense.net Mon Jan 24 10:52:34 2005 From: egan at sense.net (Egan Ford) Date: Mon, 24 Jan 2005 11:52:34 -0700 Subject: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster In-Reply-To: <41F114AA.50002@dragaera.net> Message-ID: <00d701c50245$e024e340$0183a8c0@oberon> I have tried it and it did not work (i.e. i686 + x86_64). I also did not spend a lot of time trying to figure it out. I know that this method is sound, it works great with hybrid ia64 and x86_64 clusters. Below is a .pbs script to automate running xhpl with multiple arch. Each xhpl binary must have a .$(uname -m) suffix. This was done with Myrinet. The resulting pgfile will look like this (node14 really has 2 procs, but since mpirun started from node14 it already has one processor assigned to rank 0, so the pgfile only needs to describe the rest of the processors). node14 1 /home/egan/bench/hpl/bin/xhpl.x86_64 node10 2 /home/egan/bench/hpl/bin/xhpl.ia64 node13 2 /home/egan/bench/hpl/bin/xhpl.x86_64 node9 2 /home/egan/bench/hpl/bin/xhpl.ia64 Script: #PBS -l nodes=4:compute:ppn=2,walltime=10:00:00 #PBS -N xhpl # prog name PROG=xhpl.$(uname -m) PROGARGS="" NODES=$PBS_NODEFILE # How many proc do I have? NP=$(wc -l $NODES | awk '{print $1}') # create pgfile with rank 0 node with one less # process because it gets one by default ME=$(hostname -s) N=$(egrep "^$ME\$" $NODES | wc -l | awk '{print $1}') N=$(($N - 1)) if [ "$N" = "0" ] then >pgfile else echo "$ME $N $PWD/$PROG" >pgfile fi # add other nodes to pgfile for i in $(cat $NODES | egrep -v "^$ME\$" | sort | uniq) do N=$(egrep "^$i\$" $NODES | wc -l | awk '{print $1}') ARCH=$(ssh $i uname -m) echo "$i $N $PWD/xhpl.$ARCH" done >>pgfile # MPICH path # mpirun is a script, no worries MPICH=/usr/local/mpich/1.2.6..13/gm/x86_64/smp/pgi64/ssh/bin PATH=$MPICH/bin:$PATH export LD_LIBRARY_PATH=/usr/local/goto/lib set -x # cd into the directory where I typed qsub if [ "$PBS_ENVIRONMENT" = "PBS_INTERACTIVE" ] then mpirun.ch_gm \ -v \ -pg pgfile \ --gm-kill 5 \ --gm-no-shmem \ LD_LIBRARY_PATH=/usr/local/goto/lib \ $PROG $PROGARGS else cd $PBS_O_WORKDIR cat $PBS_NODEFILE >hpl.$PBS_JOBID mpirun.ch_gm \ -pg pgfile \ --gm-kill 5 \ --gm-no-shmem \ LD_LIBRARY_PATH=/usr/local/goto/lib \ $PROG $PROGARGS >>hpl.$PBS_JOBID fi exit 0 > -----Original Message----- > From: beowulf-bounces at beowulf.org > [mailto:beowulf-bounces at beowulf.org] On Behalf Of Sean Dilda > Sent: Friday, January 21, 2005 7:42 AM > To: cflau at clc.cuhk.edu.hk > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster > > > John Lau wrote: > > Hi, > > > > Have anyone try running MPI programs with MPICH on > heterogeneous cluster > > with both i386 and x86_64 machines? Can I use a i386 binary > on the i386 > > machines while use a x86_64 binary on the x86_64 machines > for the same > > MPI program? I thought they can communicate before but it > seems that I > > was wrong because I got error in the testing. > > > > Have anyone try that before? > > I've not tried it, but I can think of a few good reasons why > you'd want > to avoid it. Lets say you want to send some data that's stored in a > long from the x86_64 box to the x86 box. Well, on the x86_64 box, a > long takes up 8 bytes. But on the x86 box, it only takes 4 > bytes. So, > chances are some Bad Stuff(tm) is going to happen if you try > to span an > MPI program across architectures like that. > > On the other hand, the x86_64 box will run x86 code without a > problem. > So i suggest running x86 binaries (and mpich) libraries on all of the > boxes. While I haven't tested it myself, I can't think of any reason > why that wouldn't work. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) > visit http://www.beowulf.org/mailman/listinfo/beowulf > From josip at lanl.gov Mon Jan 24 11:09:06 2005 From: josip at lanl.gov (Josip Loncaric) Date: Mon, 24 Jan 2005 12:09:06 -0700 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <6.1.1.1.2.20050124100047.04160488@mail.jpl.nasa.gov> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> <004701c50225$2d76e7e0$32a8a8c0@LAPTOP152422> <41F52947.70507@lanl.gov> <6.1.1.1.2.20050124100047.04160488@mail.jpl.nasa.gov> Message-ID: <41F547D2.4010808@lanl.gov> Jim Lux wrote: > At 08:58 AM 1/24/2005, Josip Loncaric wrote: > >> However, infant mortality can be a *serious* problem. Once you >> install a bad batch of drives and 40% of them start to go bad within >> months, you've got an expensive problem to fix (in terms of the >> manpower required), regardless of what the warranty says. > > > The Seagate documentation actually had some charts in there with > expected failure rates, by month, for the first few months. > > [...] > >> Until a better solution is found, we can only make educated guesses -- >> and share anecdotal stories about bad batches to avoid... >> > > Or, spend some time with the full reliability data and make a > "calculated" guess. ...but a "calculated" guess would not prevent one from getting hurt by a bad batch of drives. I doubt that Seagate expected 40% infant mortality, yet this is precisely what I experienced with first-generation 7200rpm Seagate drives in my first cluster. Any new design could have unexpected flaws, regardless of what the manufacturer's advertised reliability expectations are. This is why actual reliability experience is so important -- and building community experience takes time (e.g. 6-12 months). The good news is that by then formerly new products are considered mature and are priced more competitively. So, we're back to the well-established rule: Staying a step behind the bleeding edge allows one to avoid design flaws in brand new products, and have more confidence in guesses calculated on the basis of manufacturer's reliability expectations. Sincerely, Josip From ctierney at HPTI.com Mon Jan 24 11:19:25 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Mon, 24 Jan 2005 12:19:25 -0700 Subject: [Beowulf] Need advice on cluster hardware In-Reply-To: <20050120202057.70C75269C6D@smtp.kent.edu> References: <20050120202057.70C75269C6D@smtp.kent.edu> Message-ID: <1106594365.3699.84.camel@localhost.localdomain> On Thu, 2005-01-20 at 13:20, Steve Zapytowski wrote: > I wish to know if all machines in a Beowulf Cluster must be > identical. Can you please answer this question for me? > Short answer: no. Longer answer: no, but it may be much harder to program efficiently depending on your application. If you want to build a Beowulf cluster with nodes of different speeds (or architectures) it will more difficult to break the problem up so that you are maximizing cpu usage across all nodes. If you have a program that does a lot of repetitive steps and the size of each piece is very small compared to the overall program and there isn't a lot of interprocess (inter-node) communication then you can use a heterogeneous cluster quite efficiently. Programs that are written as master/slaves may take advantage of this type of system (eg. ray-tracing and some geophysics applications). If you are running programs that have dependencies between the nodes, like inter-node communication, it can be more difficult to make the model run efficiently. Weather models (MM5, WRF) could work, but will run as slow as the slowest node. If you are writing your own software you can treat the problem similar to if you are addressing load balancing issues to better use the different systems in the cluster. Craig From egan at sense.net Mon Jan 24 11:35:00 2005 From: egan at sense.net (Egan Ford) Date: Mon, 24 Jan 2005 12:35:00 -0700 Subject: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster In-Reply-To: <00d701c50245$e024e340$0183a8c0@oberon> Message-ID: <00ed01c5024b$cbdd2e00$0183a8c0@oberon> I should have added that xhpl.x86_64 and xhpl.ia64 are native 64 bit binaries for each platform using the native 64-bit Goto libraries. > -----Original Message----- > From: beowulf-bounces at beowulf.org > [mailto:beowulf-bounces at beowulf.org] On Behalf Of Egan Ford > Sent: Monday, January 24, 2005 11:53 AM > To: 'Sean Dilda'; cflau at clc.cuhk.edu.hk > Cc: beowulf at beowulf.org > Subject: RE: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster > > > I have tried it and it did not work (i.e. i686 + x86_64). I > also did not > spend a lot of time trying to figure it out. I know that > this method is > sound, it works great with hybrid ia64 and x86_64 clusters. > > Below is a .pbs script to automate running xhpl with multiple > arch. Each > xhpl binary must have a .$(uname -m) suffix. This was done > with Myrinet. > > The resulting pgfile will look like this (node14 really has 2 > procs, but > since mpirun started from node14 it already has one processor > assigned to > rank 0, so the pgfile only needs to describe the rest of the > processors). > > node14 1 /home/egan/bench/hpl/bin/xhpl.x86_64 > node10 2 /home/egan/bench/hpl/bin/xhpl.ia64 > node13 2 /home/egan/bench/hpl/bin/xhpl.x86_64 > node9 2 /home/egan/bench/hpl/bin/xhpl.ia64 > > Script: > > #PBS -l nodes=4:compute:ppn=2,walltime=10:00:00 > #PBS -N xhpl > > # prog name > PROG=xhpl.$(uname -m) > PROGARGS="" > > NODES=$PBS_NODEFILE > > # How many proc do I have? > NP=$(wc -l $NODES | awk '{print $1}') > > # create pgfile with rank 0 node with one less > # process because it gets one by default > ME=$(hostname -s) > N=$(egrep "^$ME\$" $NODES | wc -l | awk '{print $1}') > N=$(($N - 1)) > if [ "$N" = "0" ] > then > >pgfile > else > echo "$ME $N $PWD/$PROG" >pgfile > fi > > # add other nodes to pgfile > for i in $(cat $NODES | egrep -v "^$ME\$" | sort | uniq) > do > N=$(egrep "^$i\$" $NODES | wc -l | awk '{print $1}') > ARCH=$(ssh $i uname -m) > echo "$i $N $PWD/xhpl.$ARCH" > done >>pgfile > > # MPICH path > # mpirun is a script, no worries > MPICH=/usr/local/mpich/1.2.6..13/gm/x86_64/smp/pgi64/ssh/bin > PATH=$MPICH/bin:$PATH > > export LD_LIBRARY_PATH=/usr/local/goto/lib > > set -x > > # cd into the directory where I typed qsub > if [ "$PBS_ENVIRONMENT" = "PBS_INTERACTIVE" ] > then > mpirun.ch_gm \ > -v \ > -pg pgfile \ > --gm-kill 5 \ > --gm-no-shmem \ > LD_LIBRARY_PATH=/usr/local/goto/lib \ > $PROG $PROGARGS > else > cd $PBS_O_WORKDIR > cat $PBS_NODEFILE >hpl.$PBS_JOBID > > mpirun.ch_gm \ > -pg pgfile \ > --gm-kill 5 \ > --gm-no-shmem \ > LD_LIBRARY_PATH=/usr/local/goto/lib \ > $PROG $PROGARGS >>hpl.$PBS_JOBID > fi > > exit 0 > > > -----Original Message----- > > From: beowulf-bounces at beowulf.org > > [mailto:beowulf-bounces at beowulf.org] On Behalf Of Sean Dilda > > Sent: Friday, January 21, 2005 7:42 AM > > To: cflau at clc.cuhk.edu.hk > > Cc: beowulf at beowulf.org > > Subject: Re: [Beowulf] MPICH on heterogeneous (i386 + > x86_64) cluster > > > > > > John Lau wrote: > > > Hi, > > > > > > Have anyone try running MPI programs with MPICH on > > heterogeneous cluster > > > with both i386 and x86_64 machines? Can I use a i386 binary > > on the i386 > > > machines while use a x86_64 binary on the x86_64 machines > > for the same > > > MPI program? I thought they can communicate before but it > > seems that I > > > was wrong because I got error in the testing. > > > > > > Have anyone try that before? > > > > I've not tried it, but I can think of a few good reasons why > > you'd want > > to avoid it. Lets say you want to send some data that's > stored in a > > long from the x86_64 box to the x86 box. Well, on the > x86_64 box, a > > long takes up 8 bytes. But on the x86 box, it only takes 4 > > bytes. So, > > chances are some Bad Stuff(tm) is going to happen if you try > > to span an > > MPI program across architectures like that. > > > > On the other hand, the x86_64 box will run x86 code without a > > problem. > > So i suggest running x86 binaries (and mpich) libraries on > all of the > > boxes. While I haven't tested it myself, I can't think of > any reason > > why that wouldn't work. > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > > To change your subscription (digest mode or unsubscribe) > > visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) > visit http://www.beowulf.org/mailman/listinfo/beowulf > From mathog at mendel.bio.caltech.edu Mon Jan 24 14:04:21 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Mon, 24 Jan 2005 14:04:21 -0800 Subject: [Beowulf] Writing MPICH2 programs Message-ID: > >A nice demo for a cluster is the parallel version of a raytracer. Google >for "mpi povray". With the graphics version you can see the blocks which >the slaves return, which is quite impressive. Even more impressive (assuming 20 nodes) run 20 jobs sequentially through the MPI version and then 20 single jobs, one per node (using SGE or MOSIX, for instance) on the compute nodes in parallel. Last time I tried that with POVray the total time to complete the 20 single jobs in parallel was something like 30% less than that for the 20 parallel jobs in order. Note that it was important to render to local storage on the compute nodes (/tmp, so it never actually hit disk there) and then copy the results back to the final NFS directory. That moves data in large chunks and since the jobs tend not to finish all at the same time it does a pretty fair job of keeping the network running efficiently. In another test where each node wrote results on the fly back to the common NFS directory performance wasn't nearly so good. The network went nuts trying to handle all of the smallish packets. (Only 100BaseT, maybe less of a problem on Myrinet or 1000BaseT.) Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From Dave.Shepherd at Emulex.Com Mon Jan 24 14:08:03 2005 From: Dave.Shepherd at Emulex.Com (Dave.Shepherd at Emulex.Com) Date: Mon, 24 Jan 2005 14:08:03 -0800 Subject: [Beowulf] Platform LSF vs. Beowulf Message-ID: <81A95FFFB8332F468D948BBD7AD9B9603C1661@xbt.ad.emulex.com> Hello all, I've been working to convince that powers that be to implementing a Beowulf Cluster as an addition to LSF. Our company is heavy into engineering tools like Synopsys & Cadence that are single threaded applications. Currently an engineer submits a job to LSF. LSF schedules the job and passes it to another system for execution. The job runs to completion on another system. This model requires many separate Linux systems configured and maintained similarly. Different engineers each submit there jobs to the LSF scheduler. I believe (and perhaps I am in error), that even with single threaded applications of the type above, that a Beowulf Cluster would schedule each separate single threaded application submitted by different engineers to separate system within the Beowulf cluster. In this model, the jobs would complete in about the same time as with LSF, but from a management point of view, I would have to only maintain the Cluster heads and not many separate independent hosts. Is there something that I'm not seeing here? Can someone tell me if there would or would not be any advantage to implementing a Beowulf over LSF in this environment? Is there anyone that might have a similar environment that could pass some advice? Platform also makes an HPC version. But I don't know if that add any benefit either. See http://www.platform.com/products/HPC/ Thank you _____ Dave Shepherd Network & Systems Engineer -------------- next part -------------- An HTML attachment was scrubbed... URL: From becker at scyld.com Mon Jan 24 20:35:38 2005 From: becker at scyld.com (Donald Becker) Date: Mon, 24 Jan 2005 20:35:38 -0800 (PST) Subject: [Beowulf] BWBUG meeting January 25, 2005 -- Greenbelt MD Message-ID: --- Special Notes: - This months meeting will be held in Greenbelt MD, not in Virginia - Both the starting time (2pm) and week are different this month. - See http://www.bwbug.org/ for full information and any corrections Date: Tuesday January 25, 2004 Time: 2:00 PM - 5:00 PM Location: Northrop Grumman IT, Greenbelt MD Titles: HP's Linux Cluster Offerings Oracle 10g on HP COTS servers Abstract: An overview of HP's Linux Cluster offerings specifically targeted around our newly announce Unified Cluster Portfolio (UPC) offerings covering entry to large scale cluster implementations with both HP developed software as well as HP strategic Partner driven solutions. Speakers: Dan Cox, HP, et al ____ This month's meeting will be in our Maryland venue, the Northrop Grumman Information Technology Offices at 7501 Greenway Center Drive, Suite 1200 Greenbelt, MD, 20770 See http://www.bwbug.org/ web page for directions. Registration on the web site is highly encourage to speed sign-in. As usual there will be door prizes and refreshments. Essential questions: Need to be a member?: No, and guests are welcome. Parking and parking fees: Free surface lot parking is readily available Ease of access: 30 seconds from the D.C. beltway Also as usual, the organizer and host for the meeting is T. Michael Fitzmaurice, Jr. 8110 Gatehouse Road, Suite 400W Falls Church, VA 22042 703-205-3132 office 240-475-7877 cell mail michael.fitzmaurice at ngc.com From mark.westwood at ohmsurveys.com Tue Jan 25 00:20:47 2005 From: mark.westwood at ohmsurveys.com (Mark Westwood) Date: Tue, 25 Jan 2005 08:20:47 +0000 Subject: [Beowulf] Need advice on cluster hardware In-Reply-To: <20050120202057.70C75269C6D@smtp.kent.edu> References: <20050120202057.70C75269C6D@smtp.kent.edu> Message-ID: <41F6015F.80700@ohmsurveys.com> Steve Yes I can, no they need not be identical. If they don't all run the same O/S then your cluster might not meet some definitions of Beowulf, but that's another matter. Mark Steve Zapytowski wrote: > I wish to know if all machines in a Beowulf Cluster must be identical. > Can you please answer this question for me? > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Mark Westwood Parallel Programmer OHM Ltd The Technology Centre Offshore Technology Park Claymore Drive Aberdeen AB23 8GD United Kingdom +44 (0)870 429 6586 www.ohmsurveys.com From rene at renestorm.de Tue Jan 25 02:45:33 2005 From: rene at renestorm.de (rene) Date: Tue, 25 Jan 2005 11:45:33 +0100 Subject: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster In-Reply-To: <00ed01c5024b$cbdd2e00$0183a8c0@oberon> References: <00ed01c5024b$cbdd2e00$0183a8c0@oberon> Message-ID: <200501251145.34022.rene@renestorm.de> Hi, for me it seems easier to add platform directories to the home and nfs dirs. We've done that for a heterogen cluster. starting a job would look something like: /opt//mpich/bin mpirun -np $NUMPROCS /home/user/program//binary But if your code sends something like sizeof(double) and that differs on the archs, you could ran into a problem. First solution sould be: Compile the x86_64 stuff on i686 too. It should run as well and works fine for us. The only reason why you shouldn't do that is, if you got code which uses different size of memory on the nodes in one rank (for example 3GB on i686 and 6GB x86_64) Or try to change the code. Normally you wouldn't send any MPI call which addicted to sizeof(long). You can send the length of the var first, if you don`t know how long it is on execution time. I do that while creating MPI_Comms Regards Rene Am Montag 24 Januar 2005 20:35 schrieb Egan Ford: > I should have added that xhpl.x86_64 and xhpl.ia64 are native 64 bit > binaries for each platform using the native 64-bit Goto libraries. > > > -----Original Message----- > > From: beowulf-bounces at beowulf.org > > [mailto:beowulf-bounces at beowulf.org] On Behalf Of Egan Ford > > Sent: Monday, January 24, 2005 11:53 AM > > To: 'Sean Dilda'; cflau at clc.cuhk.edu.hk > > Cc: beowulf at beowulf.org > > Subject: RE: [Beowulf] MPICH on heterogeneous (i386 + x86_64) cluster > > > > > > I have tried it and it did not work (i.e. i686 + x86_64). I > > also did not > > spend a lot of time trying to figure it out. I know that > > this method is > > sound, it works great with hybrid ia64 and x86_64 clusters. > > > > Below is a .pbs script to automate running xhpl with multiple > > arch. Each > > xhpl binary must have a .$(uname -m) suffix. This was done > > with Myrinet. > > > > The resulting pgfile will look like this (node14 really has 2 > > procs, but > > since mpirun started from node14 it already has one processor > > assigned to > > rank 0, so the pgfile only needs to describe the rest of the > > processors). > > > > node14 1 /home/egan/bench/hpl/bin/xhpl.x86_64 > > node10 2 /home/egan/bench/hpl/bin/xhpl.ia64 > > node13 2 /home/egan/bench/hpl/bin/xhpl.x86_64 > > node9 2 /home/egan/bench/hpl/bin/xhpl.ia64 > > > > Script: > > > > #PBS -l nodes=4:compute:ppn=2,walltime=10:00:00 > > #PBS -N xhpl > > > > # prog name > > PROG=xhpl.$(uname -m) > > PROGARGS="" > > > > NODES=$PBS_NODEFILE > > > > # How many proc do I have? > > NP=$(wc -l $NODES | awk '{print $1}') > > > > # create pgfile with rank 0 node with one less > > # process because it gets one by default > > ME=$(hostname -s) > > N=$(egrep "^$ME\$" $NODES | wc -l | awk '{print $1}') > > N=$(($N - 1)) > > if [ "$N" = "0" ] > > then > > > > >pgfile > > > > else > > echo "$ME $N $PWD/$PROG" >pgfile > > fi > > > > # add other nodes to pgfile > > for i in $(cat $NODES | egrep -v "^$ME\$" | sort | uniq) > > do > > N=$(egrep "^$i\$" $NODES | wc -l | awk '{print $1}') > > ARCH=$(ssh $i uname -m) > > echo "$i $N $PWD/xhpl.$ARCH" > > done >>pgfile > > > > # MPICH path > > # mpirun is a script, no worries > > MPICH=/usr/local/mpich/1.2.6..13/gm/x86_64/smp/pgi64/ssh/bin > > PATH=$MPICH/bin:$PATH > > > > export LD_LIBRARY_PATH=/usr/local/goto/lib > > > > set -x > > > > # cd into the directory where I typed qsub > > if [ "$PBS_ENVIRONMENT" = "PBS_INTERACTIVE" ] > > then > > mpirun.ch_gm \ > > -v \ > > -pg pgfile \ > > --gm-kill 5 \ > > --gm-no-shmem \ > > LD_LIBRARY_PATH=/usr/local/goto/lib \ > > $PROG $PROGARGS > > else > > cd $PBS_O_WORKDIR > > cat $PBS_NODEFILE >hpl.$PBS_JOBID > > > > mpirun.ch_gm \ > > -pg pgfile \ > > --gm-kill 5 \ > > --gm-no-shmem \ > > LD_LIBRARY_PATH=/usr/local/goto/lib \ > > $PROG $PROGARGS >>hpl.$PBS_JOBID > > fi > > > > exit 0 > > > > > -----Original Message----- > > > From: beowulf-bounces at beowulf.org > > > [mailto:beowulf-bounces at beowulf.org] On Behalf Of Sean Dilda > > > Sent: Friday, January 21, 2005 7:42 AM > > > To: cflau at clc.cuhk.edu.hk > > > Cc: beowulf at beowulf.org > > > Subject: Re: [Beowulf] MPICH on heterogeneous (i386 + > > > > x86_64) cluster > > > > > John Lau wrote: > > > > Hi, > > > > > > > > Have anyone try running MPI programs with MPICH on > > > > > > heterogeneous cluster > > > > > > > with both i386 and x86_64 machines? Can I use a i386 binary > > > > > > on the i386 > > > > > > > machines while use a x86_64 binary on the x86_64 machines > > > > > > for the same > > > > > > > MPI program? I thought they can communicate before but it > > > > > > seems that I > > > > > > > was wrong because I got error in the testing. > > > > > > > > Have anyone try that before? > > > > > > I've not tried it, but I can think of a few good reasons why > > > you'd want > > > to avoid it. Lets say you want to send some data that's > > > > stored in a > > > > > long from the x86_64 box to the x86 box. Well, on the > > > > x86_64 box, a > > > > > long takes up 8 bytes. But on the x86 box, it only takes 4 > > > bytes. So, > > > chances are some Bad Stuff(tm) is going to happen if you try > > > to span an > > > MPI program across architectures like that. > > > > > > On the other hand, the x86_64 box will run x86 code without a > > > problem. > > > So i suggest running x86 binaries (and mpich) libraries on > > > > all of the > > > > > boxes. While I haven't tested it myself, I can't think of > > > > any reason > > > > > why that wouldn't work. > > > _______________________________________________ > > > Beowulf mailing list, Beowulf at beowulf.org > > > To change your subscription (digest mode or unsubscribe) > > > visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > > To change your subscription (digest mode or unsubscribe) > > visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Rene Storm @Cluster From kinghorn at pqs-chem.com Tue Jan 25 08:16:33 2005 From: kinghorn at pqs-chem.com (Donald Kinghorn) Date: Tue, 25 Jan 2005 10:16:33 -0600 Subject: [Beowulf] real hard drive failures Message-ID: <200501251016.33047.kinghorn@pqs-chem.com> I'm only partially interested in the thread "Cooling vs HW replacement" but the problem with drive failures is a real pain for me. So, I thought I'd share some of my experience. Background: I do clusters for computational chemistry and every node has two drives raid striped for scratch since some comp chem procedures require huge amounts of scratch space. Our older systems were typical rack mounts but overt the last year and a half we have used a custom chassis with better cooling ... We have used mostly Western Digital (WD) drives for > 4 years. We use the higher rpm and larger cache varieties ... We also used IBM 60GB drives for a while and some of you will have experienced that mess ... approx. 80% failure over 1 year time frame! Observations on WD drive failures: (estimates) WD 20, 40, 60 GB drives in the field for 3+ years, [~600 drives] very few, ( <1%) failures most machines have been retired. WD 80GB drives in the field for 1+ years, [~500 drives] "ARRRRGGGG!" ~15% failure and increasing. I send out 3-5 replacement drives every month. WD 120 and 200GB SATA in the field <1 year, [~400 drives] one failure so far. I'm moving to a 3 drive raid5 setup on each node (drives are cheap, down time is not) and considering changing to Seagate SATA drives anyone care to offer opinions or more anecdotes? :-) Best wishes to all -Don -- Dr. Donald B. Kinghorn Parallel Quantum Solutions LLC http://www.pqs-chem.com From alvin at Mail.Linux-Consulting.com Tue Jan 25 13:42:05 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Tue, 25 Jan 2005 13:42:05 -0800 (PST) Subject: [Beowulf] real hard drive failures In-Reply-To: <200501251016.33047.kinghorn@pqs-chem.com> Message-ID: hi ya donald On Tue, 25 Jan 2005, Donald Kinghorn wrote: > I'm only partially interested in the thread "Cooling vs HW replacement" but > the problem with drive failures is a real pain for me. So, I thought I'd > share some of my experience. i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm disks if the warranty is 1Yr... your disks might start to die at about 1.5yr so if the warranty is 3yrs, your disks "might" start to die at about 3.5yrs - or just a day after warranty expired ( from the day it arrived ) > We have used mostly Western Digital (WD) drives for > 4 years. We use the > higher rpm and larger cache varieties ... 8MB cache versions tend to be better > We also used IBM 60GB drives for a while and some of you will have experienced > that mess ... approx. 80% failure over 1 year time frame! 80% failure is way way ( 15x) too high, but if its deskstar ( from thailand) than, those disks are known to be bad if it's not the deskstar, than you probably have a vendor problem of the folks that sold those disks to you > WD 20, 40, 60 GB drives in the field for 3+ years, [~600 drives] very few, ( > <1%) failures most machines have been retired. good .. normal ... > WD 80GB drives in the field for 1+ years, [~500 drives] "ARRRRGGGG!" ~15% > failure and increasing. I send out 3-5 replacement drives every month. probably running too hot ... needs fans cooling the disks - get those "disk coolers with 2 fans on it ) > WD 120 and 200GB SATA in the field <1 year, [~400 drives] one failure so far. very good .. but too early to tell ... > I'm moving to a 3 drive raid5 setup on each node (drives are cheap, down time > is not) and considering changing to Seagate SATA drives anyone care to offer > opinions or more anecdotes? :-) == using 4 drive raid is better ... but is NOT the solution == - configuring raid is NOT cheap ... - fixing raid is expensive time ... (due to mirroring and syncing) - if downtime is important, and should be avoidable, than raid is the worst thing, since it's 4x slower to bring back up than a single disk failure - raid will NOT prevent your downtime, as that raid box will have to be shutdown sooner or later ( shutting down sooner ( asap ) prevents data loss ) - if you want the system to keep working while you move data to another node .. than raid did what it supposed to in keeping your box up after a drive failure, but, that failed disk still need to be replaced asap - if downtime is not exceptable... ( high availability is what you'd want) have 2 nodes that supports the same data ( data is mirrored ( manually or rsync ) on 2 different nodes ) you see it as one system .. ( like www.any-huge-domain.com ) ( just one "www" even if lots of machines behind it ) c ya alvin From mwill at penguincomputing.com Tue Jan 25 13:59:27 2005 From: mwill at penguincomputing.com (Michael Will) Date: Tue, 25 Jan 2005 13:59:27 -0800 Subject: [Beowulf] real hard drive failures In-Reply-To: References: Message-ID: <200501251359.27731.mwill@penguincomputing.com> On Tuesday 25 January 2005 01:42 pm, Alvin Oga wrote: > > I'm moving to a 3 drive raid5 setup on each node (drives are cheap, down time > > is not) and considering changing to Seagate SATA drives anyone care to offer > > opinions or more anecdotes? :-) > > == using 4 drive raid is better ... but is NOT the solution == > > - configuring raid is NOT cheap ... Depends on what controller you use. 3ware escalade can be scripted with tw_cli. > - fixing raid is expensive time ... (due to mirroring and syncing) Lower performance while reconstructing and degraded, but back to normal after the replacing drive has been updated. > - if downtime is important, and should be avoidable, than raid > is the worst thing, since it's 4x slower to bring back up than > a single disk failure You are talking about double disk failure that brings down the raid5? > - raid will NOT prevent your downtime, as that raid box > will have to be shutdown sooner or later > ( shutting down sooner ( asap ) prevents data loss ) Hot-swappable drive bays should be the standard nowadays, and do not require shutdown of the server. Michael -- Michael Will, Linux Sales Engineer Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com Visit us at LinuxWorld 2005! Hynes Convention Center, Boston, MA February 15th-17th, 2005 Booth 609 From hahn at physics.mcmaster.ca Tue Jan 25 14:26:36 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Tue, 25 Jan 2005 17:26:36 -0500 (EST) Subject: [Beowulf] real hard drive failures In-Reply-To: Message-ID: > > I'm only partially interested in the thread "Cooling vs HW replacement" but > > the problem with drive failures is a real pain for me. So, I thought I'd > > share some of my experience. > > i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm > disks I'm pretty dubious of this: adding two 50Khour moving parts to improve the airflow around a 1Mhour moving part which only dissipates 10W in the first place? designing the chassis for proper airflow with minimum fanage is obviously smarter and probably safer. > - if downtime is important, and should be avoidable, than raid > is the worst thing, since it's 4x slower to bring back up than > a single disk failure eh? you have a raid which is not operational while rebuilding? > - raid will NOT prevent your downtime, as that raid box > will have to be shutdown sooner or later > ( shutting down sooner ( asap ) prevents data loss ) huh? hotspares+hotplug=zero downtime. but yes, treating whole servers as your hotspare+hotplug element is a nice optimization, since hotplug ethernet is pretty cheap vs $50 hotplug caddies for each and every disk ;) From shaeffer at neuralscape.com Tue Jan 25 11:21:21 2005 From: shaeffer at neuralscape.com (Karen Shaeffer) Date: Tue, 25 Jan 2005 11:21:21 -0800 Subject: [Beowulf] real hard drive failures In-Reply-To: <200501251016.33047.kinghorn@pqs-chem.com> References: <200501251016.33047.kinghorn@pqs-chem.com> Message-ID: <20050125192121.GA19159@synapse.neuralscape.com> On Tue, Jan 25, 2005 at 10:16:33AM -0600, Donald Kinghorn wrote: > WD 120 and 200GB SATA in the field <1 year, [~400 drives] one failure so far. > > I'm moving to a 3 drive raid5 setup on each node (drives are cheap, down time > is not) and considering changing to Seagate SATA drives anyone care to offer > opinions or more anecdotes? :-) Hi Don, For 250 GB SATA drives, I would recommend either Hitachi or Seagate. Hitachi drives may exhibit nominally better performance numbers. For 400GB SATA drives, I don't have enough experience to recommend anything, but I have used 400 GB Hitachi SATA drives recently and all is going well at this time. It's too early to draw conclusions. Thanks, Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com From rhodas at gmail.com Tue Jan 25 18:07:27 2005 From: rhodas at gmail.com (Rolando Espinoza La Fuente) Date: Tue, 25 Jan 2005 22:07:27 -0400 Subject: [Beowulf] python & Lush on a cluster (Newbie question) Message-ID: <94b391ae05012518073d3592bd@mail.gmail.com> Hi :) (my english isn't very good...) I'll build a basic beowulf cluster for numerical "research" (i hope), what do you think about using python and lush for programming on the cluster? Anybody has comments about lush? Better way (language... than C/Fortran) for programming (numerical apps) on the cluster? Thanks in advance. PD: If you don't know about lush: Lush is an object-oriented programming language designed for researchers, experimenters, and engineers interested in large-scale numerical and graphic applications. Lush's main features includes: * A very clean, simple, and easy to learn Lisp-like syntax. * A compiler that produces very efficient C code and relies on the C compiler to produce efficient native code (no inefficient bytecode or virtual machine) . .... more info: http://lush.sourceforge.net/ -- (c) RHODAS: Robotic Humanoid Optimized for Destruction and Accurate Sabotage (w) http://darkstar.fcyt.umss.edu.bo/~rolando From rgb at phy.duke.edu Tue Jan 25 22:52:32 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 26 Jan 2005 01:52:32 -0500 (EST) Subject: [Beowulf] python & Lush on a cluster (Newbie question) In-Reply-To: <94b391ae05012518073d3592bd@mail.gmail.com> References: <94b391ae05012518073d3592bd@mail.gmail.com> Message-ID: On Tue, 25 Jan 2005, Rolando Espinoza La Fuente wrote: > Hi :) > > (my english isn't very good...) > > I'll build a basic beowulf cluster for numerical "research" (i hope), > what do you think about using python and lush for programming on the > cluster? > > Anybody has comments about lush? > > Better way (language... than C/Fortran) for programming (numerical > apps) on the cluster? Boy, you don't know the risks you run asking for advice on languages on this list. Folks have, um, "strong" is too weak a word -- opinions. As a general rule, though, if you are serious enough about numerical research to build a cluster in the first place to speed up the computations, you will USUALLY be better off using a proper compiled language such as C, C++, or Fortran than using any sort of interpreted language. I will avoid endorsing any one of these three at the expense of the other two, but well-written code in any of them will generally blow away equally well written code in an interpreted language (with a few possible exceptions). You'd have to run tests to get some idea of the difference, but at a guess an interpreter will be around an order of magnitude slower. This means you'd need some ten nodes running in efficient parallel just to break even. Additionally, "real" parallel numerical programming requires library support that is generally not available for anything but real compiled languages. To use e.g. MPI or PVM to write a distributed program, you'll pretty much need one of these three. Some of the advanced commercial compilers have a certain amount of built-in support for parallel programs as well. Numerical libraries, e.g. the GSL, are only likely to be available for and run efficiently within compiled code, although I've heard rumors of ports into some interpreted languages as well. > Thanks in advance. > > PD: If you don't know about lush: > > Lush is an object-oriented programming language designed for > researchers, experimenters, and engineers interested in large-scale > numerical and graphic applications. > Lush's main features includes: > * A very clean, simple, and easy to learn Lisp-like syntax. > * A compiler that produces very efficient C code and relies > on the C compiler to produce efficient native code > (no inefficient bytecode or virtual machine) > . .... > more info: http://lush.sourceforge.net/ You already know more about lush than I do, obviously. Here I cannot help you. Perhaps this is a possible solution, but as a generally cynical person about translation engines I doubt it. In particular, I'd want to see those claims demonstrated in real benchmark code. Getting maximal performance out of a system often requires some fairly subtle tricks, tricks that translators are unlikely to be able to figure out and implement safely. My experiences with tools "like" this (e.g. f2c) are that they perform a sort of linear translation of the language elements into e.g. C code fragments and assemble them into something that is logically identical but often impossible to read (on the translated side) and not terribly efficiently translated. Now fortran is of course an upper-level compiled language not THAT dissimilar to C, and yet f2c produces illegible code with all sorts of wrapper-enclosed subroutine calls to do the translation of various fortran functions into something that can be called in C, with automatically generated variables, with goto statements and crude loop structure. f2c is (was) a fairly mature product and has been around a long time. This is the basis of my cynicism. Would a translator of a "lisp like" upper level language that is very DISsimilar to C be capable of doing a decent job of producing legible C code that isn't heavily instrumented with black box calls and perverse loop and conditional constructs? What does it do for arrays and pointers? Is it likely to do better than the ugly job done by f2c, when fortran actually CAN be translated to legible C pretty easily by humans but not, apparently, by computers? If you decide to forge on ahead with lush, report back in a few months and tell us how things are going -- I for one am very curious as to how you'll do. If possible, run some comparative benchmarks in native C and in lush-based C. See if you can make lush create PVM or MPI-based parallel applications. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From eugen at leitl.org Wed Jan 26 01:32:48 2005 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 26 Jan 2005 10:32:48 +0100 Subject: [Beowulf] python & Lush on a cluster (Newbie question) In-Reply-To: References: <94b391ae05012518073d3592bd@mail.gmail.com> Message-ID: <20050126093248.GA1404@leitl.org> On Wed, Jan 26, 2005 at 01:52:32AM -0500, Robert G. Brown wrote: > As a general rule, though, if you are serious enough about numerical > research to build a cluster in the first place to speed up the > computations, you will USUALLY be better off using a proper compiled > language such as C, C++, or Fortran than using any sort of interpreted You can have both, actually. It's easy to extend Python (via SWIG, or other means), see e.g. Konrad Hinsen's MMTK or Warren DeLano's PyMol. Python works with MPI as well, but this looks a bit experimental in places. e.g. https://geodoc.uchicago.edu/climatewiki/DiscussPythonMPI ... import pypar # The Python-MPI interface numproc = pypar.size() # Number of processes as specified by mpirun myid = pypar.rank() # Id of of this process (myid in [0, numproc-1]) node = pypar.get_processor_name() # Host name on which current process is running print "I am proc %d of %d on node %s" %(myid, numproc, node) if myid == 0: # Actions for process 0 msg = "P0" pypar.send(msg, destination=1) # Send message to proces 1 (right hand neighbour) msg = pypar.receive(source=numproc-1) # Receive message from last process print 'Processor 0 received message "%s" from processor %d' %(msg, numproc-1) else: # Actions for all other processes source = myid-1 destination = (myid+1)%numproc msg = pypar.receive(source) msg = msg + 'P' + str(myid) # Update message pypar.send(msg, destination) pypar.finalize() > language. I will avoid endorsing any one of these three at the expense > of the other two, but well-written code in any of them will generally > blow away equally well written code in an interpreted language (with a > few possible exceptions). You'd have to run tests to get some idea of > the difference, but at a guess an interpreter will be around an order of > magnitude slower. This means you'd need some ten nodes running in > efficient parallel just to break even. > > Additionally, "real" parallel numerical programming requires library > support that is generally not available for anything but real compiled > languages. To use e.g. MPI or PVM to write a distributed program, > you'll pretty much need one of these three. Some of the advanced > commercial compilers have a certain amount of built-in support for > parallel programs as well. Numerical libraries, e.g. the GSL, are > only likely to be available for and run efficiently within compiled > code, although I've heard rumors of ports into some interpreted > languages as well. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From alvin at Mail.Linux-Consulting.com Wed Jan 26 01:50:27 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Wed, 26 Jan 2005 01:50:27 -0800 (PST) Subject: [Beowulf] real hard drive failures In-Reply-To: Message-ID: hi ya mark On Tue, 25 Jan 2005, Mark Hahn wrote: > > i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm > > disks > > I'm pretty dubious of this: adding two 50Khour moving parts to > improve the airflow around a 1Mhour moving part which only dissipates > 10W in the first place? designing the chassis for proper airflow > with minimum fanage is obviously smarter and probably safer. the purpose of the fan is to keep the hdd temp down, low as possible - while the disks have a 1M hr MTBF, those disks still fail - most often, fans fails before anything else, and creates a chain reaction in that the item it was cooling will be next to fail - you can detect the fan failure ( tach signal ) and replace the fan before the hard disk fails - a disk that runs 10C cooler will allow that disk to live 2x as long before it dies, given the same operating conditions - there is very very few chassis with proper airflow - those silly carboard aroudn the cpu heatink is silly, in that if that one fan dies, the cpu will die - if you have 2 or 3 separate fans around it, than it will not matter tha one fan died - proper airflow has always been the trick to keeping the system running for a year or threee and "good parts vendors" and "good parts selection" makes all the difference in the world > > - if downtime is important, and should be avoidable, than raid > > is the worst thing, since it's 4x slower to bring back up than > > a single disk failure > > eh? you have a raid which is not operational while rebuilding? if the raid is in degraded mode ... you do NOT have "raid" if it's resyncing ... you do NOT have raid .. if another disk dies while its operating in degraded mode or during resync ... you have a very high possibility that the whole raid array is toast it'd just depends on why and how it failed > > - raid will NOT prevent your downtime, as that raid box > > will have to be shutdown sooner or later > > ( shutting down sooner ( asap ) prevents data loss ) > > huh? hotspares+hotplug=zero downtime. you're assuming that "hot plug" work as its supposed to - i usually get the phone calls after the raid didnt do its magic for some odd reason hotspare should work by shutting down (hotremove) the failed disks and hotadding the previously idle/unused hotspare or in the case of hw raid.. jsut pull the disk and plug in a new one > but yes, treating whole servers as your hotspare+hotplug element is > a nice optimization, since hotplug ethernet is pretty cheap vs > $50 hotplug caddies for each and every disk ;) i like/require redundnacy of the entire system .. not just a disk - a complete 2nd independent system ( 2nd pw, 2nd mb, 2nd cpu, 2nd memory, 2nd disks, 2nd nic ) - but if the t1/router/switch does down .. oh well, but that too is cheap to get a 2nd backup ( its cheap compared to downtime, where its important ) i think it/these, covers michael's replies too c ya alvin From mark.westwood at ohmsurveys.com Wed Jan 26 01:56:59 2005 From: mark.westwood at ohmsurveys.com (Mark Westwood) Date: Wed, 26 Jan 2005 09:56:59 +0000 Subject: [Beowulf] python & Lush on a cluster (Newbie question) In-Reply-To: <94b391ae05012518073d3592bd@mail.gmail.com> References: <94b391ae05012518073d3592bd@mail.gmail.com> Message-ID: <41F7696B.4010508@ohmsurveys.com> Rolando I know nothing about Lush (or Python for that matter). Here's my off-the-cuff opinion: Numbers + Clusters => Fortran One of the problems you might have using Python and Lush is interfacing to MPI. If your objective is numerical research you might want to avoid learning too much about this sort of issue. With Fortran / C / C++ you can get all the components you need off the shelf. Good luck with your research. Mark PS Your English seems pretty good to me :-) Rolando Espinoza La Fuente wrote: > Hi :) > > (my english isn't very good...) > > I'll build a basic beowulf cluster for numerical "research" (i hope), > what do you think about using python and lush for programming on the > cluster? > > Anybody has comments about lush? > > Better way (language... than C/Fortran) for programming (numerical > apps) on the cluster? > > Thanks in advance. > > PD: If you don't know about lush: > > Lush is an object-oriented programming language designed for > researchers, experimenters, and engineers interested in large-scale > numerical and graphic applications. > Lush's main features includes: > * A very clean, simple, and easy to learn Lisp-like syntax. > * A compiler that produces very efficient C code and relies > on the C compiler to produce efficient native code > (no inefficient bytecode or virtual machine) > . .... > more info: http://lush.sourceforge.net/ > -- Mark Westwood Parallel Programmer OHM Ltd The Technology Centre Offshore Technology Park Claymore Drive Aberdeen AB23 8GD United Kingdom +44 (0)870 429 6586 www.ohmsurveys.com From srgadmin at cs.hku.hk Wed Jan 26 01:48:12 2005 From: srgadmin at cs.hku.hk (srg-admin) Date: Wed, 26 Jan 2005 17:48:12 +0800 Subject: [Beowulf] Call for Papers: Grid2005 Message-ID: <41F7675C.40102@cs.hku.hk> Call for Papers: Grid 2005 - 6th IEEE/ACM International Workshop on Grid Computing (held in conjunction with SC05) November 14, 2005, Seattle, Washington, USA http://pat.jpl.nasa.gov/public/grid2005/ http://www.gridcomputing.org/ ******************************************************************* General Information In the last few years, the Grid community has been growing very rapidly and many new technologies and components have been proposed. This, along with the growing popularity of web-based technologies, and the availability of cheap commodity components is changing the way we do computing and business. There are now many ongoing grid projects with research and production-oriented goals. Grid 2005 is an international meeting that brings together a community of researchers, developers, practitioners, and users involved with the Grid. The objective of Grid 2005 is to serve as a forum to present current and emerging work as well as to exchange research ideas in this field. The previous events in this series were: Grid 2000, Bangalore, India; Grid 2001, Denver; Grid 2002, Baltimore; Grid 2003, Phoenix; and Grid 2004, Pittsburgh. All of these events have been successful in attracting high quality papers and a wide international participation. Last year's event attracted about 400 registered participants. The proceedings of the first three workshops were published by Springer-Verlag, and the proceedings of the two most recent workshops were published by the IEEE Computer Society Press. We expect this year's proceedings will join those of the last two years in the IEEE Computer Society's Digital Library. Grid 2005 topics of interest (in no particular order) include, but are not limited to: * Internet-based Computing Models * Grid Applications, including eScience and eBusiness Applications * Data Grids, including Distributed Data Access and Management * Grid Middleware and Toolkits * Grid Monitoring, Management and Organization Tools * Resource Management and Scheduling * Networking * Virtual Instrumentation * Grid Object Metadata and Schemas * Creation and Management of Virtual Enterprises and Organizations * Grid Architectures and Fabrics * Grid Information Services * Grid Security Issues * Programming Models, Tools, and Environments * Grid Economy * Autonomic and Utility Computing on Global Grids * Performance Evaluation and Modeling * Cluster and Grid Integration Issues * Scientific, Industrial and Social Implications ******************************************************************* Important Dates 27 May 2005 Full paper submission due 29 July 2005 Acceptance notification 19 August 2005 Camera-ready copy due 14 November 2005 Workshop 27 January 2006 Extended versions of best 6-8 papers due Fall 2006 Grid 2005 special issue of IJHPCN (Issue 3 of 2006) to appear ******************************************************************* Paper Submission and Publication Grid 2005 invites authors to submit original and unpublished work (also not submitted elsewhere for review) reporting solid and innovative results in any aspect of grid computing and its applications. Papers should not exceed 8 single-spaced pages of text using 10-point size type on 8.5 x 11 inch paper (see IEEE author instructions at http://www.computer.org/cspress/instruct.htm). All bibliographical references, tables, and figures must be included in these 8 pages.Submissions that exceed the 8-page limit will not be reviewed. Authors should submit a PDF file that will print on a PostScript printer. Electronic submission is required. The URL of the site for submission will be listed on the workshop website. Submission implies the willingness of at least one of the authors to register and present the paper. Proceedings: All papers selected for this workshop are peer-reviewed and will be published as a separate proceedings. After the event, the papers will be published in IEEE Xplore and in the CS digital library. Special Issue: The best 6 to 8 papers from the workshop will be selected for journal length extension and their publication in a special issue of International Journal of High Performance Computing and Networking (IJHPCN). The special issue is expected to be published in Fall 2006 as issue 3 of 2006. ******************************************************************* Conference Organization General Chair: Wolfgang Gentzsch, MCNC, USA Program Chair: Daniel S. Katz, JPL/Caltech, USA Program Vice Chairs: * Applications: Alan Sussman, University of Maryland, USA * Data grids: Heinz Stockinger, University of Vienna, Austria * Networking/Security/Infrastructure: Olle Mulmo, Royal Institute of Technology (KTH), Sweden * Scheduling/Resource management: Henri Casanova, UCSD, USA * Tools/Software/Middleware: Jennifer Schopf, National e-Science Centre/Argonne National Lab, UK/US Publicity Chair: Cho-Li Wang, University of Hong Kong, China Proceedings Chair: Joseph C. Jacob, JPL/Caltech, USA Program Committee: being determined, see web site for updates Steering Committee: * Mark Baker, University of Portsmouth, UK * Rajkumar Buyya, University of Melbourne, Australia * Craig Lee, Aerospace Corp., USA * Manish Parashar, Rutgers University, USA * Heinz Stockinger, University of Vienna, Austria Contact For further information on Grid 2005, please contact the Program Chair: Daniel S. Katz - d.katz at ieee.org ***************************************************** We apologize if this information is not of your interest. In that case please send e-mail to: majordomo at cs.hku.hk, with Content: "unsubscribe srg-CFP" ****************************************************** From mathog at mendel.bio.caltech.edu Wed Jan 26 08:25:23 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Wed, 26 Jan 2005 08:25:23 -0800 Subject: [Beowulf] RE: real hard drive failures Message-ID: > > > - raid will NOT prevent your downtime, as that raid box > > > will have to be shutdown sooner or later > > > ( shutting down sooner ( asap ) prevents data loss ) > > > > huh? hotspares+hotplug=zero downtime. > > you're assuming that "hot plug" work as its supposed to > - i usually get the phone calls after the raid didnt > do its magic for some odd reason My impression, based solely on web research and not personal experience, is that RAIDs that don't rebuild are often suffering from "latent lost block" syndrome. That is, a block on disk 1 has gone bad, but hasn't been read yet, so that bad block is "latent". Then disk 2 fails. The RAID tries to rebuild and now tries to read the bad block on disk 1, gets a read error, and that's pretty much all she wrote for that chunk of data. The "fix" is to disk scrub, forcing reads of every block on every disk periodically, and so converting the "latent" bad blocks into "known" bad blocks at a time when the RAID still has sufficient information to rebuild a lost disk block. Also use SMART to keep track of disks which have started to swap out blocks and replace them before they fail totally. Deciding how many bad blocks is too many on a drive seems like it might be a fairly complex decision in a storage environment involving hundreds or thousands of disks. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From agrajag at dragaera.net Wed Jan 26 05:21:41 2005 From: agrajag at dragaera.net (Sean Dilda) Date: Wed, 26 Jan 2005 08:21:41 -0500 Subject: [Beowulf] python & Lush on a cluster (Newbie question) In-Reply-To: <94b391ae05012518073d3592bd@mail.gmail.com> References: <94b391ae05012518073d3592bd@mail.gmail.com> Message-ID: <41F79965.30603@dragaera.net> Rolando Espinoza La Fuente wrote: >Hi :) > >(my english isn't very good...) > >I'll build a basic beowulf cluster for numerical "research" (i hope), >what do you think about using python and lush for programming on the >cluster? > > > A number of my users use python for their jobs. However, the real number crunching isn't done by python. Instead they have 3rd party python modules like SciPy or numarray which do the heavy lifting. And those python modules are usually compiled C code. This way they get some of the speedups of running compiled C code combined with the ease of structuring their program in python. I haven't benchmarked or anything. But I'm betting that python with compiled C modules still isn't as fast as pure C code. On the other hand, using python may cause your researchers to code and debug the program much quicker. >Anybody has comments about lush? > >Better way (language... than C/Fortran) for programming (numerical >apps) on the cluster? > > Traditionally the HPC community is a big user of Fortran. Although I've never coded in it, I'm told its wonderful for dealing with large matrices. C is also quite popular. From nicas at freemail.it Wed Jan 26 08:09:51 2005 From: nicas at freemail.it (nicas at freemail.it) Date: 26 Jan 2005 16:09:51 -0000 Subject: [Beowulf] Problem with `mpi_init_' in MM5 MPP Message-ID: <20050126160951.24196.qmail@mail.supereva.it> Hi, I'm Nicola Nicastro, an italian student. I ask you if you can help me I'm using MM5 in MPP mode on LINUX CLUSTER, I have been trying to get it compiled but I have run into some problems, about `mpi_init_'. I write the last lines of terminal after the "make mpp" ....................... /opt/mpich-1.2.5.10-ch_p4-gcc/bin/mpicc -c -I../../MPP -I../../MPP/RSL -I../../pick -I../../MPP/debug -I../../MPP/RSL/RSL -I/opt/mpich-1.2.5.10-ch_p4-gcc/include -DMPP1 -DIOR=2 -DIWORDSIZE=4 -DRWORDSIZE=4 -DLWORDSIZE=4 -DASSUME_HOMOGENEOUS_ENVIRONMENT=1 -DMPI -I/opt/mpich-1.2.5.10-ch_p4-gcc/include milliclock.c /opt/mpich-1.2.5.10-ch_p4-gcc/bin/mpif77 -o mm5.mpp addall.o addrx1c.o addrx1n.o bdyin.o bdyrst.o bdyten.o bdyval.o cadjmx.o coef_diffu.o condload.o consat.o convad.o couple.o date.o dcpl3d.o dcpl3dwnd.o decouple.o define_comms.o diffu.o dm_io.o dots.o dtfrz.o fillcrs.o fkill_model.o gamma.o gauss.o hadv.o init.o initsav.o initts.o kfbmdata.o kill_model.o lb_alg.o lbdyin.o mhz.o mm5.o mp_equate.o mp_initdomain.o mp_shemi.o mparrcopy.o mpaspect.o nconvp.o nudge.o output.o outsav.o outtap.o outts.o outts_c.o param.o paramr.o rdinit.o rho_mlt.o savread.o settbl.o setvegfr.o sfcrad.o shutdo.o slab.o solar1.o solve.o sound.o subch.o trans.o transm.o upshot_mm5.o vadv.o vecgath.o write_big_header.o write_fieldrec.o write_flag.o exmoiss.o cup.o cupara3.o maximi.o minimi.o mrfpbl.o tridi2.o initnest.o chknst.o nstlev1.o nstlev2.o nstlev3.o mp_stotndt.o smt2.o bcast_size.o merge_size.o mp_feedbk.o rdter.o lwrad.o swrad.o milliclock.o ../../MPP/RSL/RSL/librsl.a -O2 -Mcray=pointer -tp p6 -pc 32 -Mnoframe -byteswapio -L/opt/mpich-1.2.5.10-ch_p4-gcc/lib -lfmpich -lmpich ../../MPP/RSL/RSL/librsl.a(rsl_mpi_compat.o)(.text+0x51): In function `rslMPIInit': : undefined reference to `mpi_init_' make[1]: [all] Error 2 (ignored) /bin/mv mm5.mpp ../../Run/mm5.mpp /bin/mv: can't stat source mm5.mpp make[1]: [all] Error 1 (ignored) make[1]: Leaving directory `/home/nnicastro/MM5V3/MM5/MPP/build' [nnicastro at kali MM5]$ Have you ever had the same problem? My configure.user is: RUNTIME_SYSTEM = "linux" MPP_TARGET=$(RUNTIME_SYSTEM) # edit the following definition for your system LINUX_MPIHOME = /opt/mpich-1.2.5.10-ch_p4-gcc MFC = $(LINUX_MPIHOME)/bin/mpif77 MCC = $(LINUX_MPIHOME)/bin/mpicc MLD = $(LINUX_MPIHOME)/bin/mpif77 FCFLAGS = -O2 -Mcray=pointer -tp p6 -pc 32 -Mnoframe -byteswapio LDOPTIONS = -O2 -Mcray=pointer -tp p6 -pc 32 -Mnoframe -byteswapio LOCAL_LIBRARIES = -L$(LINUX_MPIHOME)/lib -lfmpich -lmpich MAKE = make -i -r AWK = awk SED = sed CAT = cat CUT = cut EXPAND = expand M4 = m4 CPP = /lib/cpp -C -P -traditional CPPFLAGS = -DMPI -Dlinux -DSYSTEM_CALL_OK CFLAGS = -DMPI -I$(LINUX_MPIHOME)/include ARCH_OBJS = milliclock.o IWORDSIZE = 4 RWORDSIZE = 4 LWORDSIZE = 4 Can you tell me if I have to change or set something? Thank you. Nick. --------------------------------------------------------------- Scegli il tuo dominio preferito e attiva la tua email! Da oggi l'eMail di superEva e' ancora piu' veloce e ricca di funzioni! http://webmail.supereva.it/new/ --------------------------------------------------------------- From Roverite3 at aol.com Wed Jan 26 11:33:46 2005 From: Roverite3 at aol.com (Roverite3 at aol.com) Date: Wed, 26 Jan 2005 14:33:46 EST Subject: [Beowulf] help Message-ID: <12d.54a62532.2f294a9a@aol.com> I am running mpi and getting the following message please help. mpirun - v - npz test _mpi.mpich running/root/test_mpi.mpich on 2 linux ch_p4 processors created/root/pi16256 p0_16414: p4_error:child process exists while making connection to remote process on node1.enterprise.net:0 /user/bin/mpirun: line1: 16414 broken pipe /root/test_mpi.mpich - p4pg.root.pi16256 - p4wd root its for my hons project terry -------------- next part -------------- An HTML attachment was scrubbed... URL: From lusk at mcs.anl.gov Wed Jan 26 17:39:35 2005 From: lusk at mcs.anl.gov (Rusty Lusk) Date: Wed, 26 Jan 2005 19:39:35 -0600 (CST) Subject: [Beowulf] help In-Reply-To: <12d.54a62532.2f294a9a@aol.com> References: <12d.54a62532.2f294a9a@aol.com> Message-ID: <20050126.193935.85407144.lusk@localhost> I would suggest starting by using MPICH2 instead of the very-much-older MPICH. Many things have been improved, including diagnostic messages (which, however, are still not perfect). Regards, Rusty Lusk From: Roverite3 at aol.com Subject: [Beowulf] help Date: Wed, 26 Jan 2005 14:33:46 EST > I am running mpi and getting the following message please help. > mpirun - v - npz test _mpi.mpich > running/root/test_mpi.mpich on 2 linux ch_p4 > processors > created/root/pi16256 > p0_16414: p4_error:child process exists while making connection to remote > process on node1.enterprise.net:0 > /user/bin/mpirun: line1: 16414 broken pipe > /root/test_mpi.mpich - p4pg.root.pi16256 - > p4wd root > > > its for my hons project > terry From shaeffer at neuralscape.com Thu Jan 27 00:48:44 2005 From: shaeffer at neuralscape.com (Karen Shaeffer) Date: Thu, 27 Jan 2005 00:48:44 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <20050124071413.GA1493@greglaptop.greghome.keyresearch.com> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> <20050124071413.GA1493@greglaptop.greghome.keyresearch.com> Message-ID: <20050127084844.GA26065@synapse.neuralscape.com> On Sun, Jan 23, 2005 at 11:14:14PM -0800, Greg Lindahl wrote: > On Mon, Jan 24, 2005 at 01:57:16AM -0500, Robert G. Brown wrote: > > > Otherwise, what I was basically doing is describing the bathtub > > Didn't look like that to me, but I just read your rants, I wasn't the > guy who wrote them. > > > As was pointed out by Karen (and I agree) the mfr warranty period is > > perhaps a better number for most people to pay attention to than MTBF > > I disagree. The warranty period tells you about disk lifetime. The > MTBF tells you about the failure rate in the bottom of the > bathtub. These are nearly independent quantities; I already pointed > out that the fraction of disks which fail in the bottom of the bathtub > is small, even if you multiply it by 2X or 3X. So the major factor in > the price and length of that warranty is the lifetime. > > Lifetime and MTBF are simply different measures. Depending on what you > are thinking about, you pay attention to one or the other or neither. These numbers are defined by their collective usage in the industry. I accept your assertion about their definitions. But the MTBF number has no consequential significance to a disk drive manufacturer, and thus has a poor confidence associated with it -- and I am going to explain why. As stated previously, the disk drive business is an extremely high volume, low margin, technology intensive business. Product cycles last about 6 months. A typical disk drive comes out of development and ramps up from zero units to several million units within about 6 weeks. This is an operational miracle in of its self, but it is standard buisness in this industry. Now, I assert the warranty period and the integration of the failure rate during the six month production run is the only issue the disk drive manufacturer (DDM) cares about. During this period, as production is in progress, the failure rate is dominated by the infant mortality of recently sold drives from this production run. Even the first batch of production drives sold are only 6 months into their lifecycle at the end of the production run. (Let's define the infant mortality time window to be a weighted 6 week period. This is the left wall of the bathtub.) The rate of failure, from the drives that make it through the infant mortality time period, is a very small component compared to the drives that are failing during infant mortality. In other words, with several million drives a month being sold, this infant mortality of recently sold drives is the dominating term in the equation during the life of the production run. Now, failed drives are classified in numerous ways by the DDM, but the most important issue is how long it lived in production. Was it an infant mortality rate death? If it was, then the DDM is very interested in it. If it survived the infant mortality failure time window, then the DDM has very little (if any) interest in it during the production run. (Someone asserted the DDM will take failed drives and determine the failure mode. This is only true of failures during infant mortality. Drives that fail in the bottom of the bathtub are generally thrown away and simply replaced. You would need to exceed the standard deviation for the MTBF significantly, before the DDM would start analyzing the failure modes in this case. On the other hand, any perceived aberation in the expected infant mortality rate would start a fire drill.) The point is, as in ALL mass produced products, and especially in semiconductors and disk drives, early detection of statistically significant failure modes is ABSOLUTELY ESSENTIAL to the profitability of the firm. If you have a production problem and don't figure it out until a million drives are out there in the market, then you have just lost a huge amount of money. I'm talking about a whole quarter's profits or worse. If you have a role in such a disaster -- then your career in the DDM industry just ended, which is why everyone is keenly focused on the issues that matter. In summary, for a specific disk drive, the bulk of the failed drives that the DDM will replace based on the warranty will have in fact failed during their infant mortality window immediately after being put into production. If you integrate these over the product's production and then calculate the drives that fail based on MTBF numbers associated with the bottom of the bathtub, you will find this is a minor term in total number of failed drives during the warranted time. (This is clearly true in the normal case.) The DDM is entirely focused on these infant mortality failures, because they provide early warning for preventing large scale problems. (This includes the case where actual midlife failure rates would far exceed the projections one could expect based on MTBF. This is an essential element in this discussion. Please keep it in mind.) The scrutiny of these infant mortality failures is intense. Any aberations of the expected numbers causes the whole production and engineering teams associated with the particular disk drive product to become available resources to resolve the problem. The time to resolving problems is counted in hours. Its that intense. Do the math. Several million units produced per month translates to 27,397 drives produced per hour. If you have a problem, it can become a disaster quickly. (Note the DDM actually puts several thousand drives in a QA lab about 6 weeks prior to shipping the first drive. So they actually have the initial statistical results of the infant mortality before shipping any drives.) Now, once production ends, then the infant mortality deaths drop off as soon as the supply of drives in the channel are all sold. At this point, there is nothing the DDM can do. The drives are out there. Whether or not the total number of warranted drives that fail will have exceeded expectations is already known in almost all cases. All resources are now turned to the next product release. Nobody at the DDM gives a whoot about the MTBF numbers. They are not even discussed within the internal workings of the DDM business. The POINT IS, even if the MTBF numbers are not accurate, and the drives fail at a much higher rate, there is nothing the DDM can actually do about it. Production is over. It is this reality that relegates these numbers to be nothing more than window dressing on marketing literature. And there are numerous products where MTBF rates have been wildly understated WRT the actual midlife failure rates -- where the DDM took a big loss. But the reaction, after the fact, would all be focused on why the early detection and component QA processes failed. It would not even consider how the MTBF numbers were derived. Because they need to catch the problem early or it is not helpful. So, now that we know what interests the DDM with respect to failed drives, let's consider the MTBF numbers that are published. As Greg and others have pointed out, these numbers speak of the rate of failure at the bottom of the bathtub. The definition is not in question here. The question is the confidence you can place on those published numbers. I and others have asserted you cannot place much confidence in these numbers, because they have no financial consequence to the DDM. (Except of course if they are wildly wrong -- which brings with it the particular problem of being too late to do anything about it.) I have explained why this is so. I have also explained how the DDM assigns all it's resouces to the critical problems, as the rate of production is so high, time is the essence in protecting profits. Once production ends, all resources are reassigned to the next product to be released. It is my understanding that these MTBF numbers are derived from thermal cycling in ovens as part of the QA process. All the likely failure modes in a disk drive are quite sensitive to thermal conditions. These are the media, the heads, the spindle, bearings, lubricants, etc comprising the critical mechanical structure, the temperature dependence on band gaps and other calibrating circuitry within the electronics, nominal currents within the microeclectronics and espectially the power mosfet arrays, the servo system cailibration, etc. As the thermal cycling QA processes proceed, defects in these systems can be forced to manifest during the testing, and the normal state characteristics and stability of these subsystems can also be extracted from the experiments. These results are then rigorously integrated within the observed profiles and characteristics of drives failing within the infant mortality window. It is all highly integrated within statistical models for expectations. MTBF numbers are also extrapolated from the results. In effect, the MTBF numbers become the long term projections that are extrapolated from this data. But the primary focus and optimization of processess is intended to create the statistical underpinning from which to analyze infant mortality drive failures. The uncertainty in these numbers naturally increases for the MTBF extrapolations. It's all perfectly logical. > Yes. And I have yet to see anything in your complaint that is anything > but misinterpretation on your part. Reality check time, indeed. You > can't use MTBF by itself as a measure of quality, period, so complaining > that it isn't a good single item to measure disk quality is, well, > operator error. > > -- greg I think the problem with the logic you and others have embraced, is that it is not well correlated with the operational priorities of the DDM industry. As with all industries, competitors publish normalized metrics for customers to compare. (and of course, they want you to think these metrics are really important! They want you to buy their product.) I believe the MTBF number is more of a marketing number than something the DDM goes to great lengths to formulate. On the other hand, in a well executed production run, where everything goes as planned, the MTBF numbers are likely to be accurate. After all, MTBF and infant mortality rates clearly share dependencies in the normal case -- and in fact, the MTBF numbers are derived from the processes optimized for anticipating the expected infant mortality rate during the production run. If DDMs were interested in helping customers discriminate based on the actual expected lifetime of drives, they would all publish running infant mortality rates, updated weekly, during the production run of their disk drives. Afterall, this is the one metric the entire organization is focused on during production. But, what they hand out is this MTBF number to prospective customers. A number they pay no attention to internally. HTH, Karen -- Karen Shaeffer Neuralscape, Palo Alto, Ca. 94306 shaeffer at neuralscape.com http://www.neuralscape.com From rene at renestorm.de Wed Jan 26 15:42:28 2005 From: rene at renestorm.de (rene) Date: Thu, 27 Jan 2005 00:42:28 +0100 Subject: [Beowulf] help In-Reply-To: <12d.54a62532.2f294a9a@aol.com> References: <12d.54a62532.2f294a9a@aol.com> Message-ID: <200501270042.28097.rene@renestorm.de> Hi, look if your child node could open the binary. I dont think you export your root dir via nfs. BTW: help is not a nice subject ;o) Cya > I am running mpi and getting the following message please help. > mpirun - v - npz test _mpi.mpich > running/root/test_mpi.mpich on 2 linux ch_p4 > processors > created/root/pi16256 > p0_16414: p4_error:child process exists while making connection to remote > process on node1.enterprise.net:0 > /user/bin/mpirun: line1: 16414 broken pipe > /root/test_mpi.mpich - p4pg.root.pi16256 - > p4wd root > > > its for my hons project > terry -- Rene Storm @Cluster From josip at lanl.gov Thu Jan 27 09:26:36 2005 From: josip at lanl.gov (Josip Loncaric) Date: Thu, 27 Jan 2005 10:26:36 -0700 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <20050127084844.GA26065@synapse.neuralscape.com> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> <20050124071413.GA1493@greglaptop.greghome.keyresearch.com> <20050127084844.GA26065@synapse.neuralscape.com> Message-ID: <41F9244C.602@lanl.gov> Karen Shaeffer wrote: > [...] > > If DDMs were interested in helping customers discriminate based on the > actual expected lifetime of drives, they would all publish running infant > mortality rates, updated weekly, during the production run of their disk > drives. Afterall, this is the one metric the entire organization is focused > on during production. But, what they hand out is this MTBF number to > prospective customers. A number they pay no attention to internally. Karen's excellent introduction to the logic of disk drive manufacturing (DDM) is well worth reading -- particularly since the same factors drive other computer manufacturers: rapid product cycles, insane time pressures, thin profit margins, limited opportunity to prevent financially ruinous mistakes, etc. I'd just like to offer my personal guesses of what manufacturers of commodity disk drives want to achieve: product lifespan of about 5 years, infant mortality under 1%, and competitive MTBF numbers. While MTBF claims are indeed soft, they are the only published data that relates to the mid-life failure rate, i.e. the period of peak interest to cluster users. Infant mortality <1% is probably acceptable to cluster builders, but as Karen pointed out, things can go wrong. Although DDMs will try to fix problems before too many bad drives are sold, the basic fact is that a bad batch of drives is something that neither DDMs nor their customers could have predicted (otherwise, DDMs would not put themselves at financial risk -- and they have more information than we do). Therefore, deciding which drive model (or other component) to use fits under the topic of optimal decision making under uncertainty -- which is a standard part of game theory, often used in operations research, etc. Making rational choices, which can withstand scrutiny even when things unexpectedly go wrong, is not just an art. There is theory to build on. Sincerely, Josip From jkrauska at cisco.com Thu Jan 27 09:47:58 2005 From: jkrauska at cisco.com (Joel Krauska) Date: Thu, 27 Jan 2005 09:47:58 -0800 Subject: [Beowulf] OSDL Clusters SIG Message-ID: <41F9294E.9090902@cisco.com> (thought there'd be some interest on this list, sorry for the spam) OSDL is starting a public Clusters Special Interest Group. It's meant to be a place for developers to discuss changes to linux to better support clustering features. http://developer.osdl.org/dev/clusters/ --joel From laytonjb at charter.net Thu Jan 27 10:00:12 2005 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Thu, 27 Jan 2005 13:00:12 -0500 Subject: [Beowulf] OSDL Clusters SIG In-Reply-To: <41F9294E.9090902@cisco.com> References: <41F9294E.9090902@cisco.com> Message-ID: <41F92C2C.9070405@charter.net> I quick glance at the website leads me to believe it's more of an HA cluster group than an HPC group. Some of the documents are about Carrier Grade Linux. Jeff > > (thought there'd be some interest on this list, sorry for the spam) > > OSDL is starting a public Clusters Special Interest Group. > > It's meant to be a place for developers to discuss changes to linux to > better support clustering features. > > http://developer.osdl.org/dev/clusters/ > > --joel > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From jkrauska at cisco.com Thu Jan 27 10:05:04 2005 From: jkrauska at cisco.com (Joel Krauska) Date: Thu, 27 Jan 2005 10:05:04 -0800 Subject: [Beowulf] OSDL Clusters SIG In-Reply-To: <41F92C2C.9070405@charter.net> References: <41F9294E.9090902@cisco.com> <41F92C2C.9070405@charter.net> Message-ID: <41F92D50.7010700@cisco.com> Jeffrey B. Layton wrote: > I quick glance at the website leads me to believe it's more > of an HA cluster group than an HPC group. Some of the > documents are about Carrier Grade Linux. That's mostly true, but I would say that there's some worthwhile overlap. One of the projects supposedly involved is OpenSSI.org, who's work has some similarities to Beowulf capabilities. --joel From James.P.Lux at jpl.nasa.gov Thu Jan 27 10:41:31 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Thu, 27 Jan 2005 10:41:31 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <20050127084844.GA26065@synapse.neuralscape.com> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> <20050124071413.GA1493@greglaptop.greghome.keyresearch.com> <20050127084844.GA26065@synapse.neuralscape.com> Message-ID: <6.1.1.1.2.20050127103031.04242340@mail.jpl.nasa.gov> At 12:48 AM 1/27/2005, Karen Shaeffer wrote: >On Sun, Jan 23, 2005 at 11:14:14PM -0800, Greg Lindahl wrote: > > >These numbers are defined by their collective usage in the industry. I >accept your assertion about their definitions. But the MTBF number has >no consequential significance to a disk drive manufacturer, and thus >has a poor confidence associated with it -- and I am going to explain >why. > >As stated previously, the disk drive business is an extremely high >volume, low margin, technology intensive business. Product cycles last >about 6 months. A typical disk drive comes out of development and >ramps up from zero units to several million units within about 6 weeks. >This is an operational miracle in of its self, but it is standard buisness >in this industry. >I and others have asserted you cannot place much confidence in these >numbers, because they have no financial consequence to the DDM. (Except >of course if they are wildly wrong -- which brings with it the particular >problem of being too late to do anything about it.) I have explained why >this is so. I have also explained how the DDM assigns all it's resouces to >the critical problems, as the rate of production is so high, time is the >essence in protecting profits. Once production ends, all resources are >reassigned to the next product to be released. > >It is my understanding that these MTBF numbers are derived from thermal >cycling in ovens as part of the QA process. All the likely failure modes >in a disk drive are quite sensitive to thermal conditions. These are the >media, the heads, the spindle, bearings, lubricants, etc comprising the >critical mechanical structure, the temperature dependence on band gaps and >other calibrating circuitry within the electronics, nominal currents within >the microeclectronics and espectially the power mosfet arrays, the servo >system cailibration, etc. As the thermal cycling QA processes proceed, >defects in these systems can be forced to manifest during the testing, and >the normal state characteristics and stability of these subsystems can also >be extracted from the experiments. These results are then rigorously >integrated within the observed profiles and characteristics of drives >failing within the infant mortality window. It is all highly integrated >within statistical models for expectations. MTBF numbers are also >extrapolated from the results. In effect, the MTBF numbers become the long >term projections that are extrapolated from this data. But the primary >focus and optimization of processess is intended to create the statistical >underpinning from which to analyze infant mortality drive failures. The >uncertainty in these numbers naturally increases for the MTBF >extrapolations. > >It's all perfectly logical. > I can see where this process would be typical for quick turnaround consumer oriented drives. However, maybe there are product lines which seem to be much longer lived.. call them "professional" grade. Maybe they aren't really the same drive, just the same "model name", but then, it seems that there are customers (i.e. Defense Department, etc.) who expect to be able to buy "exactly the same drive" for an extended period of time (several years, at least), and that the manufacturers would accomodate them. If I'm making, for instance, high end video editing systems that cost a million dollars, I'm probably not interested in saving a few bucks on the drives, but I AM interested in drives that last a long time, and that can be replaced easily with the same drive. (I don't build these systems, maybe that's not their market model...) The fast turnaround in modern electronics is a huge curse to us developing systems with long lead times. By the time the component is tested and qualified (heck, even breadboarded to see if it's the "right" component), it's obsolete and unavailable. not just disk drives, but things like RAM, microprocessors, data converters, RF ICs, etc. As far as warranties go... Here's an interesting quote from Seagate's website: (note the identical product gets 1yr in Americas and 2yrs in EMEA countries) " What products are excluded from the new 5-year warranty? The only products that are excluded are our retail external hard drives (external retail products, pocket drives, portable & compact flash drives). They are treated much more like a storage appliance and are used in very different operating environments. We have a competitive one-year warranty on external drives in the Americas, and a two-year warranty in the EMEA countries. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rgb at phy.duke.edu Thu Jan 27 11:07:51 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Thu, 27 Jan 2005 14:07:51 -0500 (EST) Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <41F9244C.602@lanl.gov> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> <20050124071413.GA1493@greglaptop.greghome.keyresearch.com> <20050127084844.GA26065@synapse.neuralscape.com> <41F9244C.602@lanl.gov> Message-ID: On Thu, 27 Jan 2005, Josip Loncaric wrote: > Karen Shaeffer wrote: > > [...] > > > > If DDMs were interested in helping customers discriminate based on the > > actual expected lifetime of drives, they would all publish running infant > > mortality rates, updated weekly, during the production run of their disk > > drives. Afterall, this is the one metric the entire organization is focused > > on during production. But, what they hand out is this MTBF number to > > prospective customers. A number they pay no attention to internally. > > Karen's excellent introduction to the logic of disk drive manufacturing > (DDM) is well worth reading -- particularly since the same factors drive > other computer manufacturers: rapid product cycles, insane time > pressures, thin profit margins, limited opportunity to prevent > financially ruinous mistakes, etc. Agreed. Been there, been burned (which is why I keep urging caution about using published numbers from the mfr as a sound basis for engineering without a grain of salt, especially to people with relatively little experience in this arena -- cluster newbies). > Therefore, deciding which drive model (or other component) to use fits > under the topic of optimal decision making under uncertainty -- which is > a standard part of game theory, often used in operations research, etc. > > Making rational choices, which can withstand scrutiny even when things > unexpectedly go wrong, is not just an art. There is theory to build on. This is also an excellent contribution to the discussion. In particular, I'd urge people building large clusters to consider the benefits of insuring some of the risks, which is what humans generally do when confronted with the same problem in the arena of human affairs. In a large cluster, the economic consequences of a massive component failure (however common or rare that might be) can be devastating to the project, to careers, to productivity. This is a classic component of game theory applied to real life and is the fundamental raison d'etre for the insurance industry (and why I keep referring to actuarial data). One piece of "insurance" is obviously the base warranty of each component, but this generally protects you only partially from the actual cost of the replacement hardware itself if a component fails. You still take a major hit in productivity and diversion of opportunity cost labor associated with downtime and repair. Whether or not this additional cost is affordable depends to a certain extent on luck, to a certain extent on the "value" of your project. Speaking from bitter personal experience with the Tyan 2460 and 2566 motherboards (as well as anecdotal experiences with various other system components such as drives, riser cards, cases, case fans, CPUs and CPU fans (OEM AMD Athlon MP fans in particular) things DO break in mass "catastrophic" bursts a lot more often than MTBF numbers or even warranties would lead you to expect, and this cost can be quite high and can drain resources and energy for years (until the hardware is finally aged out and replaced) or require an immediate infusion of much money for immediate replacements, or in our case (where the replacements were themselves a problem, albeit a lesser one), both. Practically speaking, hardware "insurance" often means considering extended and/or onsite warranties -- effectively betting someone that your systems will break for some percentage of their original cost (generally ballpark 10% for 3 years). Extended service has two valuable purposes -- one is that it obviously directly protects you from bearing the brunt of the cost of anything from the normal patter bathtub-bottom failures during the normal lifespan up to mass failures or higher than expected normal-lifespan failures during the period that the cluster is expected to be productive. Other forms of insurance against catstrophic failure (such as fire or theft insurance and surge protection and door locks) exist as well, although they tend to be purchased outside of the engineering/operations loop. Insurance via extended warranty addresses the paradox of mass failure (one that might "kill" you or at any rate your project). Even though it is often (or even generally) cheaper in terms of expectation value of the total cost to build a DIY cluster and self insure, excessive (unlucky) failures are far more likely to be "fatal". One major complaint against the HMO industry is that capitation (giving a physician X dollars per head for a group of patients up front while obligating the physician to treat all of that group who get sick) is that it exposes those physicians to the risk of catastrophy in the event that a plague comes along and strikes the group. It is anti-insurance (the passing of risks back to small groups for which the fluctuations can be fatal rather than assuming the risks spread out over a large group where it is more predictiable). It costs you a bit more to insure with a larger group (even with its more predictable risks), but the benefit you gain is that you'll stay "alive" no matter what if you can afford the insurance itself in the first place. Practically speaking, since additional cost means fewer nodes, you can choose to definitely get 10% less work done over the lifetime of your project but ensure that you have a very small chance of getting only 50% or 30% of the work done (or face massive out of pocket costs) due to catastrophic failure downtime. If things go well of course you lose -- maybe over the same interval you only lose 5% of your nodes -- and MOST of the time, one expects things to go well, which encourages people to assume the risk and gamble that things will go well. The second is that hardware backed by an onsite service contract and a company that assumes much of the risk is more likely not to fail in the first place. That company has a strong incentive to protect >>their<< risk in the venture by passing as much as possible back to the mfrs (even at an additional cost) and to perform additional testing and system engineering without a disincentive to uncover "bad" components after (as Karen points out) it is more or less too late to do anything other than sell them off as best you can and take your lumps. The company also typically has some actual clout with the manufacturers and can dicker out deals that further minimize their (and your by proxy) risk, both in terms of getting a premium selection of hardware and of getting better warranty terms per dollar spent. Deciding your optimum comfort level of risk taking is not easy -- partly it is subjective, partly it can be made objective if you can assign a dollar "value" to your time and the up time of your cluster. Even humans (with their relatively low failure rate during their "prime years") tend to buy insurance during this period because even if failure rates are low, the consequences to your family and loved ones of a failure are very high. rgb > > Sincerely, > Josip > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From bill at cse.ucdavis.edu Thu Jan 27 14:00:49 2005 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Thu, 27 Jan 2005 14:00:49 -0800 Subject: [Beowulf] Re: Cooling vs HW replacement In-Reply-To: <41F9244C.602@lanl.gov> References: <20050123223548.GA19760@greglaptop.greghome.keyresearch.com> <20050124071413.GA1493@greglaptop.greghome.keyresearch.com> <20050127084844.GA26065@synapse.neuralscape.com> <41F9244C.602@lanl.gov> Message-ID: <20050127220049.GA30816@cse.ucdavis.edu> On Thu, Jan 27, 2005 at 10:26:36AM -0700, Josip Loncaric wrote: > I'd just like to offer my personal guesses of what manufacturers of > commodity disk drives want to achieve: product lifespan of about 5 > years, infant mortality under 1%. I was looking at drive spec sheets and found the Maxtor Diamondmax 10 300GB drive. From the spec sheet: Start/Stop cycles (min) >50,000 Component Design Live (min) 5 years Annualized Return Rate (ARR) < 1% I don't have any personal experience with these drives, but it does look like particularly useful numbers for a spec sheet, especially when compared to MTBF. To make MTBF numbers even worse many posted numbers don't include the duty cycle. Sony bragged about higher MTBF with AIT vs DLT, only in the back of a 50 page document did they mention a 10% duty cycle vs the 100% duty cycle for the DLT numbers. -- Bill Broadley Computational Science and Engineering UC Davis From asabigue at fing.edu.uy Thu Jan 27 12:09:34 2005 From: asabigue at fing.edu.uy (Ariel Sabiguero) Date: Thu, 27 Jan 2005 21:09:34 +0100 Subject: [Beowulf] OSDL Clusters SIG In-Reply-To: <41F92D50.7010700@cisco.com> References: <41F9294E.9090902@cisco.com> <41F92C2C.9070405@charter.net> <41F92D50.7010700@cisco.com> Message-ID: <41F94A7E.7020901@fing.edu.uy> OSDL were some of the bigger sponsors of kernel 2.6. They pushed several "required" changes for carrier grade environments, like J. Layton mentions. They have also been working adding NUMA functionality, processor hotplugging and really interesting figures that allowed SGI (and others) to move to Linux, and also to move Linux to other areas. Their work made linux mature enough to be considered for high demanding bussines.... in the end, making COTS stronger and giving us better performance and HW. As it was mentioned before, in the list: "Beowulf is about performance, not reliability", but redesigns in 2.6 kernel increased latency (and reduced jitter) in ethernet TCP/IP communications. That is one of the first things that comes to my mind... Also, to avoid netfilter complexity might reduce latency, as QoS & Firewalling is only overhead for Beowulf private networks. It might be worth for performance to "unload" netfilter complexity... At least these are some things that I can think now to "suggest" to OSDL. I know that it might seem quite crazy to plug/unplug netfilter. This might avoid the movement towards raw-ethernet (even though we cannot avoid CRC calculation [unless we can disable it!{other suggestion?}]) so as to minimize latency. Sorry for the disorder in the ideas, it resembles my mind. Ariel Joel Krauska escribi?: > Jeffrey B. Layton wrote: > >> I quick glance at the website leads me to believe it's more >> of an HA cluster group than an HPC group. Some of the >> documents are about Carrier Grade Linux. > > > That's mostly true, but I would say that there's some worthwhile > overlap. One of the projects supposedly involved is OpenSSI.org, > who's work has some similarities to Beowulf capabilities. > > --joel > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From eugen at leitl.org Sat Jan 29 13:33:08 2005 From: eugen at leitl.org (Eugen Leitl) Date: Sat, 29 Jan 2005 22:33:08 +0100 Subject: [Beowulf] Simulating the Universe with a zBox Message-ID: <20050129213308.GA1404@leitl.org> Link: http://slashdot.org/article.pl?sid=05/01/29/1754237 Posted by: michael, on 2005-01-29 19:32:00 from the simgalaxy dept. An anonymous reader writes "Scientists at the University of Zurich predict that our galaxy is filled with a quadrillion clouds of dark matter with the mass of the Earth and size of the solar system. The results in this weeks journal [1]Nature, also covered in [2]Astronomy magazine, were made using a six month calculation on hundreds of processors of a self-built supercomputer, [3]the zBox. This novel machine is a high density cube of processors cooled by a central airflow system. I like the initial [4]back of an envelope design. Apparently, one of these ghostly dark matter haloes passes through the solar system every few thousand years leaving a trail of high energy gamma ray photons." References 1. http://www.nature.com/news/2005/050124/full/050124-9.html 2. http://www.astronomy.com/default.aspx?c=a&id=2840 3. http://krone.physik.unizh.ch/~stadel/zBox/ 4. http://krone.physik.unizh.ch/~stadel/zBox/story.html ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: From john.hearns at streamline-computing.com Sun Jan 30 00:25:59 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sun, 30 Jan 2005 08:25:59 +0000 Subject: [Beowulf] Simulating the Universe with a zBox In-Reply-To: <20050129213308.GA1404@leitl.org> References: <20050129213308.GA1404@leitl.org> Message-ID: <1107073560.6075.1.camel@Vigor11> On Sat, 2005-01-29 at 22:33 +0100, Eugen Leitl wrote: > Link: http://slashdot.org/article.pl?sid=05/01/29/1754237 > Posted by: michael, on 2005-01-29 19:32:00 > > from the simgalaxy dept. Neat! I like the 'three dimensional' temperature chart - that's a good idea! I suppose for scientists used to plotting millions of stars on 3D plots this wasn't difficult. From steve_heaton at ozemail.com.au Sat Jan 29 21:26:48 2005 From: steve_heaton at ozemail.com.au (Fringe Dweller) Date: Sun, 30 Jan 2005 16:26:48 +1100 Subject: [Beowulf] New toys In-Reply-To: <200501272000.j0RK056F012643@bluewest.scyld.com> References: <200501272000.j0RK056F012643@bluewest.scyld.com> Message-ID: <41FC7018.1050203@ozemail.com.au> I caught these articles on the good things that continue to brew with AMD Operton. Weird in a way... since immersion in the Beowulf environ I not longer get excited about RAID, SCSI or USB whatevers... just give me the grunt and the net IO! ;) Nvida extensions: http://www.linuxhardware.org/article.pl?sid=05/01/26/2240240&mode=thread Tyan mobos: http://www.linuxhardware.org/article.pl?sid=05/01/26/2240235&mode=thread Cheers Stevo From maurice at harddata.com Sun Jan 30 09:28:42 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Sun, 30 Jan 2005 10:28:42 -0700 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: <200501261008.j0QA80Go025885@bluewest.scyld.com> References: <200501261008.j0QA80Go025885@bluewest.scyld.com> Message-ID: <41FD194A.60303@harddata.com> Some observations: >Date: Tue, 25 Jan 2005 13:42:05 -0800 (PST) >From: Alvin Oga > > >i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm >disks > > Adding fans makes some assumptions: 1) There is inadequate chassis cooling in the first place. If that is the case, one should consider a better chassis. If the drives are not being cooled, then what else is also not properly cooled? 2) To add a fan effectively, one must have sufficient input of outside air, and sufficient exhaust capacity in the chassis to move out the heated air. In my experience the biggest deficiency in most chassis is in the latter example. Simply adding fans on the front input side, without sufficient exhaust capacity adds little real air flow. Think of most chassis as a funnel. You can only push in as much air as there is capacity for it to escape at the back. More fans do not add much more flow, unless the fans are capable of increasing the pressure inside the case sufficiently to force more air out of the back. You average small axial fan generates extremely small pressure. In effect the air flow will be stalled, in most cases. 3) Adding fans requires some place to mount them so that the airflow passes over the hard disks. Most chassis used in clusters do not provide that space and location. 4) Adding fans often creates some additional maintenance issues and failure points. Typical small fans have generally high spin rates, and correspondingly high failure rates. If the survival of a hard disk depends on the fan, and the fan has a short life what are you gaining in terms of lifespan? A fan with a 1 year lifespan to cool a hard disk with a 5 year lifespan is a waste of time, or, at best, a huge maintenance burden. >>We have used mostly Western Digital (WD) drives for > 4 years. We use the >>higher rpm and larger cache varieties ... >> >> > >8MB cache versions tend to be better > > > True, which is why WD sells those as their "Special Edition" (JB) variant with 3 year warranty, and the 2MB (BB) variants with 1 year. >>We also used IBM 60GB drives for a while and some of you will have experienced >>that mess ... approx. 80% failure over 1 year time frame! >> >> > >80% failure is way way ( 15x) too high, but if its deskstar ( from >thailand) than, those disks are known to be bad > > > The "bad drives" mainly came from their now defunct Hungarian plant. The Thailand plant products had few problems. >if it's not the deskstar, than you probably have a vendor problem >of the folks that sold those disks to you > > > Maxtor drives have had very high failure rates in recent (3) years. That probably prompted them to lead the rush to 1 year warranties 2.5 years ago. WD did very well in the market by keeping the 3 year Special Edition drives available, and recently Seagate, then Maxtor came back to add longer warranties, now generally 5 years. What is telling is that their product does not seem to have been improved in design reliability. This is ALL about marketing. What is also worth considering is the question of will the company will be around in 5 years to honor that warranty. With Seagate and Maxtor on a diet of steady losses for at least 3 years it is worth considering. WD, OTOH, have been making profit while selling 3-5 year warranty drives. >>WD 80GB drives in the field for 1+ years, [~500 drives] "ARRRRGGGG!" ~15% >>failure and increasing. I send out 3-5 replacement drives every month. >> >> > >probably running too hot ... needs fans cooling the disks > - get those "disk coolers with 2 fans on it ) > > Agreed ( but see comment above), also he probably has the "cheaper" BB model rather than the better "JB" on those 80's > > > > >>I'm moving to a 3 drive raid5 setup on each node (drives are cheap, down time >>is not) and considering changing to Seagate SATA drives anyone care to offer >>opinions or more anecdotes? :-) >> >> Average. WD are slightly more reliable in our experience ( we sell several thousand drives a year). As long as you stick to JB, JD, or SD models. Hitachi and Seagate tie for 2nd, Maxtor are last. BTW, Hitachi took over the IBM drive business, but most of the product line is new, so these are not the same as the older infamous "deathstar" drives. >== using 4 drive raid is better ... but is NOT the solution == > > - configuring raid is NOT cheap ... > > Why? Most modern boards support 4 IDE devices and 4 S-ATA devices. Using mdadm to configure and maintain a RAID is trivial. Onboard "RAID" on integrated controllers is not standardized, and is usually limited to RAID 0 and 1, whereas software RAID allows RAID 5, 6, and mixed RAID types on the same disks. Configuring RAID10 on a system entails twice as many drives, but provides much greater reliability of data, while costing virtually no overhead or performance loss. > - fixing raid is expensive time ... (due to mirroring and syncing) > > - if downtime is important, and should be avoidable, than raid > is the worst thing, since it's 4x slower to bring back up than > a single disk failure > > I disagree. You have no downtime on a RAID if you incorporate a redundant RAID scheme. If the interface supports swapping out disks you need never shut down to deal with a failed disk. . If you have to change drives immediately when they fail, maybe you do need a better controller. OTOH, shutdown time to change a disk on a decent chassis is under 1 minute. Depends on your needs. > - raid will NOT prevent your downtime, as that raid box > will have to be shutdown sooner or later > > Simply not true. As long as the controller supports removing and adding devices, and as long as your chassis has disk trays to support hot-swap, there is ZERO downtime. If you have redundant RAID you can delay the shutdown until the time that is convenient to you. You have to shut down for some form of scheduled maintenance at least once in a while. Price penalty is fairly light. For example, our 1U cluster node chassis have 4 hotswap S-ATA or SCSI trays, redundant disk cooling fans, and you can add a 4 port 3Ware controller and you pay a price premium of only $280. Not including extra disks, of course. What is downtime worth to you is the main question YOU have to answer.. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice at harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 -------------- next part -------------- An HTML attachment was scrubbed... URL: From gerson.sapac at gawab.com Sun Jan 30 14:46:45 2005 From: gerson.sapac at gawab.com (Gerson Galang) Date: Mon, 31 Jan 2005 09:16:45 +1030 Subject: [Beowulf] PBS and Globus gatekeeper node In-Reply-To: <41F985B8.70509@gawab.com> References: <41F9294E.9090902@cisco.com> <41F92C2C.9070405@charter.net> <41F92D50.7010700@cisco.com> <41F985B8.70509@gawab.com> Message-ID: <41FD63D5.7000107@gawab.com> We have a few clusters on our site and they all run PBS/Torque. What we want is to setup a gatekeeper node so that we will just have to install Globus on the gatekeeper node and not on all the head nodes of each cluster. I tried implementing this before and have not been successful because PBS's or Torque's routing queue functionality does not work as documented. I don't have any problems with (Torque's) routing of jobs from one queue to another BUT this is only if the other queue is also local to machine I submitted the job to. create queue router set queue router queue_type = Route set queue router route_destinations = batch at localhost set queue router enabled = True set queue router started = True create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch enabled = True set queue batch started = True Assuming I have "router" as a routing queue and "batch" as an execution queue. If I submit a job to router, batch will still end up executing the job. But if I set "router at localhost" to route jobs to queues on another machine "batch at anothermachine.mydomain.com", PBS won't run the jobs anymore. PBS will tell me "Jobs rejected by all possible destinations" even without me seeing it tried contacting anothermachine.mydomain.com. # a queue setup on my local machine create queue router set queue router queue_type = Route set queue router route_destinations = batch at anothermachine.mydomain.com set queue router enabled = True set queue router started = True # a queue setup on anothermachine.mydomain.com create queue batch set queue batch queue_type = Execution set queue batch resources_default.nodect = 1 set queue batch resources_default.nodes = 1 set queue batch enabled = True set queue batch started = True Has anybody in this list already got this functionality *(of routing jobs from a queue in your local machine to a queue in a remote machine)* working before? Thanks, Gerson From alvin at Mail.Linux-Consulting.com Sun Jan 30 19:18:03 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Sun, 30 Jan 2005 19:18:03 -0800 (PST) Subject: [Beowulf] Re: real hard drive failures In-Reply-To: <41FD194A.60303@harddata.com> Message-ID: On Sun, 30 Jan 2005, Maurice Hilarius wrote: > Some observations: yup .. and raid is fun and easy .. and i agree with all as long as the assumptions is the same, which it is .. > >Date: Tue, 25 Jan 2005 13:42:05 -0800 (PST) > >From: Alvin Oga > > > >i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm > >disks > > > > > Adding fans makes some assumptions: > 1) There is inadequate chassis cooling in the first place. If that is > the case, one should consider a better chassis. there are very few "better" chassis than the ones we use and the other view point is "fans" is the insurance policy that the disks will last longer than if it didnt have the fans > If the drives are not being cooled, then what else is also not properly > cooled? obviously the cpu and air etc.etc..etc... - most 1U, 2U, midtower, full tower chassis fail our tests > 2) To add a fan effectively, one must have sufficient input of outside > air, and sufficient exhaust capacity in the chassis to move out the > heated air. In my experience the biggest deficiency in most chassis is > in the latter example. exactly ... and preferrably cooler air ... some of our custoemrs have a closed system and the ambient temp is 150F .. why they designed silly video systems like that is little whacky ... > Simply adding fans on the front input side, without sufficient exhaust > capacity adds little real air flow. there must be more exhasust hoels than intake holes - the chassis should be COLD(cool) to the touch > Think of most chassis as a funnel. You can only push in as much air as > there is capacity for it to escape at the back. funnel with blockages and bends and 90 degree change in directions > More fans do not add much more flow, depends on your chassis design and the fan and intake and exhaust and position of the fans... blah .. blah > 3) Adding fans requires some place to mount them so that the airflow > passes over the hard disks. > Most chassis used in clusters do not provide that space and location. and looking at how people mount hard disks ... they probably dont care about the life fo the data on the disks ... so we avoid those vendors/cases/chassis ... > 4) Adding fans often creates some additional maintenance issues and > failure points. fans failing is cheap compared to disks dying and even better, buy better fans .... we dont see as many fan failures compared to the average bear ( the machines at colo's all have 50% or 80% fan failures ... usually the stuff we dont use .. good to see reinforcment that we wont use those cheap fans ) > Maxtor drives have had very high failure rates in recent (3) years. That > probably prompted them to lead the rush to 1 year warranties 2.5 years > ago. yup > WD did very well in the market by keeping the 3 year Special > Edition drives available, and recently Seagate, then Maxtor came back to > add longer warranties, now generally 5 years. competition is good ... if people looked at warranty period beore buying > What is telling is that their product does not seem to have been > improved in design reliability. This is ALL about marketing. marketing or gambling that the disk will outlive the "warranty" period and/or that the costs of the warranty replacement disks that dies that have to be replaced will be cheaper than the loss of market share > What is also worth considering is the question of will the company will > be around in 5 years to honor that warranty. With Seagate and Maxtor on > a diet of steady losses for at least 3 years it is worth considering. > WD, OTOH, have been making profit while selling 3-5 year warranty drives. "spinning didks" is a dead market ... - remember, ibm sold off that entire division to hitachi so their days are number ... which also obvious from watching how cheap 1GB and 2GB compact flash is and it's just a matter of time before it's 100GB CFs but is it fast enuff .. 1GB/sec sustained data transfer > Average. WD are slightly more reliable in our experience ( we sell > several thousand drives a year). > As long as you stick to JB, JD, or SD models. the 8MB versions ... the WD w/ 2MB versions made us buy seagate/maxtor/quantum insted and good thing we gave the 8MB buffers a try :-) > Hitachi and Seagate tie for 2nd, Maxtor are last. > BTW, Hitachi took over the IBM drive business, but most of the product > line is new, so these are not the same as the older infamous "deathstar" > drives. :-) > >== using 4 drive raid is better ... but is NOT the solution == > > > > - configuring raid is NOT cheap ... > > > > > Why? takes people time to properly config raid ... - most raids that are self built are misconfigured ( these are hands off tests, other than pull the disks ) - i expect raid to be able to boot with any disk pulled out - i expect raid to resync automagically > Most modern boards support 4 IDE devices and 4 S-ATA devices. i haven't found a mb raid system that works right .. so we ise sw raid instead for those that dont want to use ide-raid cards ( more $$$ ) > Using mdadm to configure and maintain a RAID is trivial. yes is it trivial .. to build and configure and setup .. - just takes time to test and if something went bonkers you have to rebuild and re-test b4 shipping - 1 day testing is worthless ... 7day or a month of burnin testing is a good thing to make sure they don't lose 2TB of data ( and always have 2 or 3 independently backuped data storage ) > Onboard "RAID" on integrated controllers is not standardized, and is > usually limited to RAID 0 and 1, whereas software RAID allows RAID 5, 6, > and mixed RAID types on the same disks. yup > I disagree. You have no downtime on a RAID if you incorporate a > redundant RAID scheme. If the interface supports swapping out disks you > need never shut down to deal with a failed disk. and that the drives is hotswappable if its not hotswap, you will have to shutdown to replace the dead disk and if downtime is important, they will have 2 or 3 systems up 24x7x365 with redundant data sync'd and saved in NY,LA,Miami .. or wherever > If you have to change drives immediately when they fail, maybe you do > need a better controller. or better disks ... :-) and disks should NOT fail before the power supply or fans .. ------------------------------------------------------------ > OTOH, shutdown time to change a disk on a decent chassis is under 1 minute. > Depends on your needs. and to resync that disk takes time ... and if during resync, a 2nd disks decides to go on vacation, you would be in a heap a trouble > > - raid will NOT prevent your downtime, as that raid box > > will have to be shutdown sooner or later > > > > > Simply not true. As long as the controller supports removing and adding > devices, and as long as your chassis has disk trays to support hot-swap, > there is ZERO downtime. as long as those "additional" constraints are met, and that other assumptions are also intact, yeah .. zero downtime is possible but i get the calls when those raids die that they bought elsewhere and there's nothing i can do, sicne they don't have backups either > If you have redundant RAID you can delay the shutdown until the time > that is convenient to you. You have to shut down for some form of > scheduled maintenance at least once in a while. exactly .. about redundancy ... in the server and multiple servers > Price penalty is fairly light. yes ... very inexpensive > For example, our 1U cluster node chassis have 4 hotswap S-ATA or SCSI > trays, redundant disk cooling fans, and you can add a 4 port 3Ware > controller and you pay a price premium of only $280. Not including extra > disks, of course. > What is downtime worth to you is the main question YOU have to answer.. that is the problem.. some think that all that "extra" prevention and precautionary measures is not worth it to them .. until afterward == summary ... = = good fans makes all the difference in the world in a good chasis = good disks and good vendors and good suppliers helps even more = = in my book, ide disks will not die (within reason) ... and if it = does, usually, there's a bigger problem = have fun alvin From hahn at physics.mcmaster.ca Mon Jan 31 11:14:43 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Mon, 31 Jan 2005 14:14:43 -0500 (EST) Subject: [Beowulf] Re: real hard drive failures In-Reply-To: Message-ID: > > and the other view point is "fans" is the insurance policy that > the disks will last longer than if it didnt have the fans > sounds like putting mudflaps and a cattle bar on a city-SUV. I just don't see disks dissipating enough heat to make them worthy of their own fan(s), considering that this could reduce the system MTBF by a good chunk (say, 1/3) and that it's easy to put disks into the chassis's airflow. > fans failing is cheap compared to disks dying but fans failing will *cause* disks to die. that's really the crux of the argument against more fans. > "spinning didks" is a dead market ... > - remember, ibm sold off that entire division to hitachi > so their days are number ... which also obvious from watching > how cheap 1GB and 2GB compact flash is and it's just a matter > of time before it's 100GB CFs but is it fast enuff .. > 1GB/sec sustained data transfer yeah, and the world needs 5 computers. flash and disks are both on a 2d-shrink curve, with no end in sight for either one. it's hard to see why flash would suddenly chage its curve (even if there are 16 300mm fabs under construction...) from a quick look at pricewatch, 2G flash costs $108 (there's a listing for 2.2G at $89, but it's a microdisk!) that's $54/GB, versus $0.4/GB for disk ($64/160G). I don't think flash is going to improve by a factor of 100 any time soon. on that note, though - does anyone have comments about booting machines from flash? > and disks should NOT fail before the power supply or fans .. right, disks "should" fail shortly after the fans ;) From gpadmanabhan at gmail.com Mon Jan 31 00:05:14 2005 From: gpadmanabhan at gmail.com (Ganesh) Date: Mon, 31 Jan 2005 13:35:14 +0530 Subject: [Beowulf] Beowulf Newbie Message-ID: Hello All, I am beginner in Beowulf and am experimenting with 3 nodes (1 EM64T and 2 true AMD-64 machines). Is there a free Clustering Software that I can start off with? Is there a popular one with a mailing list also? Thanks in advance, Ganesh From zs03 at aub.edu.lb Mon Jan 31 02:46:50 2005 From: zs03 at aub.edu.lb (Ziad Shaaban) Date: Mon, 31 Jan 2005 12:46:50 +0200 Subject: [Beowulf] Information Reseach Lab Message-ID: Dear All, I am planning to have an information lab in our faculty built of: Dell, Linux, Oracle and GIS. Can I use Beowulf to analyze GIS Data and display them on the web using ArcIMS, all three vendors said yes, but can I use Beowulf? *********************************************** Ziad Shaaban Faculty of Engineering and Architecture Information Technology Unit ************************************************ E-mail: ziad.shaaban at aub.edu.lb Pager: 0844 Phone: 961 1 374374 ext 3436 Fax: 961 1 744462 ************************************************ From tegner at nada.kth.se Mon Jan 31 05:20:44 2005 From: tegner at nada.kth.se (tegner at nada.kth.se) Date: Mon, 31 Jan 2005 14:20:44 +0100 (MET) Subject: [Beowulf] tg3 or bcm5700? Options? Message-ID: <18084.150.227.16.253.1107177644.squirrel@webmail.nada.kth.se> Hi all, We have some trouble with the parallel performance of our cluster of dual opterons MSI-9245 in IBM E325. We are using one of the internal Broadcom nics (eth0) as a "nfs-network", and the other (eth1) as a "computational network". We have tried both the tg3-driver and Broadcoms bcm5700. The tg3-driver seems to deliver good network performance over eth1 (we have used netpipe to test). The bcm5700-driver gives higher latency, but on "speedup" tests (i.e. checking the speedup of a cfd problem of fixed size on different numbers of processors) the bcm5700-driver gives significantly better results. By loading the bcm5700-driver with the options options bcm5700 adaptive_coalesce=0,0 rx_coalesce_ticks=1,1 \ rx_max_coalesce_frames=1,1 tx_coalesce_ticks=1,1 \ tx_max_coalesce_frames=1,1 the latency is improved, BUT the "speedup" performance is somewhat degraded. Question is if anyone of you have experienced these kinds of issues, and if you have suggestions on how the network performance can be "optimized" (e.g. what options to use in modprobe.conf). Thanks in advance, /jon From cjtan at OptimaNumerics.com Mon Jan 31 04:18:17 2005 From: cjtan at OptimaNumerics.com (C J Kenneth Tan -- OptimaNumerics) Date: Mon, 31 Jan 2005 12:18:17 +0000 (UTC) Subject: [Beowulf] python & Lush on a cluster (Newbie question) In-Reply-To: <41F7696B.4010508@ohmsurveys.com> References: <94b391ae05012518073d3592bd@mail.gmail.com> <41F7696B.4010508@ohmsurveys.com> Message-ID: Rolando, I would say the something similar to what Mark Westwood has said. But I may also add that nowadays, you will also find C being used quite extensively. Kenneth Tan ----------------------------------------------------------------------- News: OptimaNumerics Libraries 3.0: Over 200% Faster! ----------------------------------------------------------------------- C. J. Kenneth Tan, Ph.D. OptimaNumerics Ltd. E-mail: cjtan at OptimaNumerics.com Telephone: +44 798 941 7838 Web: http://www.OptimaNumerics.com Facsimile: +44 289 066 3015 ----------------------------------------------------------------------- On 2005-01-26 09:56 -0000 Mark Westwood (mark.westwood at ohmsurveys.com) wrote: > Date: Wed, 26 Jan 2005 09:56:59 +0000 > From: Mark Westwood > To: Rolando Espinoza La Fuente > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] python & Lush on a cluster (Newbie question) > > Rolando > > I know nothing about Lush (or Python for that matter). Here's my off-the-cuff > opinion: > > Numbers + Clusters => Fortran > > One of the problems you might have using Python and Lush is interfacing to > MPI. If your objective is numerical research you might want to avoid learning > too much about this sort of issue. With Fortran / C / C++ you can get all the > components you need off the shelf. > > Good luck with your research. > > Mark > > PS Your English seems pretty good to me :-) > > Rolando Espinoza La Fuente wrote: > > Hi :) > > > > (my english isn't very good...) > > > > I'll build a basic beowulf cluster for numerical "research" (i hope), > > what do you think about using python and lush for programming on the > > cluster? > > > > Anybody has comments about lush? > > > > Better way (language... than C/Fortran) for programming (numerical > > apps) on the cluster? > > > > Thanks in advance. > > > > PD: If you don't know about lush: > > > > Lush is an object-oriented programming language designed for > > researchers, experimenters, and engineers interested in large-scale > > numerical and graphic applications. > > Lush's main features includes: > > * A very clean, simple, and easy to learn Lisp-like syntax. > > * A compiler that produces very efficient C code and relies > > on the C compiler to produce efficient native code > > (no inefficient bytecode or virtual machine) > > . .... > > more info: http://lush.sourceforge.net/ > > > > -- > Mark Westwood > Parallel Programmer > OHM Ltd > The Technology Centre > Offshore Technology Park > Claymore Drive > Aberdeen > AB23 8GD > United Kingdom > > +44 (0)870 429 6586 > www.ohmsurveys.com > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From rhodas at gmail.com Mon Jan 31 19:57:36 2005 From: rhodas at gmail.com (Rolando Espinoza La Fuente) Date: Mon, 31 Jan 2005 23:57:36 -0400 Subject: [Beowulf] python & Lush on a cluster (Newbie question) In-Reply-To: References: <94b391ae05012518073d3592bd@mail.gmail.com> <41F7696B.4010508@ohmsurveys.com> Message-ID: <94b391ae0501311957305708b2@mail.gmail.com> Hi, thanks everyone. Now i have clear. I'll use (learn) Fortran or C instead, and will use python for gui and other things. Greets. On Mon, 31 Jan 2005 12:18:17 +0000 (UTC), C J Kenneth Tan -- OptimaNumerics wrote: > Rolando, > > I would say the something similar to what Mark Westwood has said. But > I may also add that nowadays, you will also find C being used quite > extensively. > > Kenneth Tan -- (c) RHODAS: Robotic Humanoid Optimized for Destruction and Accurate Sabotage (w) http://darkstar.fcyt.umss.edu.bo/~rolando From bvanhaer at sckcen.be Mon Jan 31 23:48:19 2005 From: bvanhaer at sckcen.be (Ben Vanhaeren) Date: Tue, 1 Feb 2005 08:48:19 +0100 Subject: [Beowulf] Information Reseach Lab In-Reply-To: References: Message-ID: <200502010848.19789.bvanhaer@sckcen.be> On Monday 31 January 2005 11:46, Ziad Shaaban wrote: > Dear All, > > I am planning to have an information lab in our faculty built of: Dell, > Linux, Oracle and GIS. > > Can I use Beowulf to analyze GIS Data and display them on the web using > ArcIMS, all three vendors said yes, but can I use Beowulf? > I think you should read the Beowulf FAQ: http://www.beowulf.org/overview/faq.html#1 Beowulf is a concept not a piece of software. I don't think you are going to need a beowulf cluster for the kind of application you want to run (analyzing GIS data). If you want to guarantee availability of your GIS data or do loadbalancing (distribute the load to several servers) you should take a look at linux HA project: http://www.linux-ha.org/ Apache loadbalancing with mod_backhand: http://www.backhand.org/ApacheCon2001/US/backhand_course_notes.pdf and Oracle Real Application Clusters (RAC). -- Ben Vanhaeren System Administrator RF&M Dept SCK-CEN http://www.sckcen.be ---------- SCK-CEN Disclaimer --------- http://www.sckcen.be/emaildisclaimer.html