From greg.lindahl at qlogic.com Fri Mar 2 12:09:06 2007 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] Companies contributing to the Linux kernel Message-ID: <20070302200906.GA18803@dhcp-2-204.internal.keyresearch.com> LWN recently did an article entitled "Who wrote 2.6.20?" It was a reponse to a Time magazine article which claimed that Linux was written by volunteers, when most of us know that most Linux kernel development is done by paid developers. http://lwn.net/Articles/222773/ In one of the charts he looked at all the changes to the kernel in the last year, and summed them up by company. The top companies were (drumroll please): (Unknown) 740990 29.5% Red Hat 361539 14.4% (None) 239888 9.6% IBM 200473 8.0% QLogic 91834 3.7% Novell 91594 3.6% Intel 78041 3.1% ... and we didn't even do our own distro! Hee hee. -- greg From peter.st.john at gmail.com Fri Mar 2 13:11:29 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] Companies contributing to the Linux kernel In-Reply-To: <20070302200906.GA18803@dhcp-2-204.internal.keyresearch.com> References: <20070302200906.GA18803@dhcp-2-204.internal.keyresearch.com> Message-ID: Greg, I'd just want to point out there is a difference between "Linux" and "[recent] Linux kernel development". That said, thanks so much for your substantial contributions to ongoing kernel development; that's important :-) and gratz on beating out Novell and Intel. Peter On 3/2/07, Greg Lindahl wrote: > > LWN recently did an article entitled "Who wrote 2.6.20?" It was a > reponse to a Time magazine article which claimed that Linux was > written by volunteers, when most of us know that most Linux kernel > development is done by paid developers. > > http://lwn.net/Articles/222773/ > > In one of the charts he looked at all the changes to the kernel in the > last year, and summed them up by company. The top companies were (drumroll > please): > > (Unknown) 740990 29.5% > Red Hat 361539 14.4% > (None) 239888 9.6% > IBM 200473 8.0% > QLogic 91834 3.7% > Novell 91594 3.6% > Intel 78041 3.1% > > ... and we didn't even do our own distro! Hee hee. > > -- greg > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070302/32f99254/attachment.html From mathog at caltech.edu Fri Mar 2 13:38:32 2007 From: mathog at caltech.edu (David Mathog) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] extreme dynamic underclocking and undervolting Message-ID: Long ago I started keeping my notes and a bit of editorial content on idle power consumption for various computers and related here: http://saf.bio.caltech.edu/saving_power.html A few days ago I realized that there was no Intel Core information in there. Since I don't have any myself, I hunted down an iMac and a PC and found much to my surprise, that while the Core processors were quite efficient when idling, there was apparently no way to adjust the power consumption downward any further via Enhanced Speed Step (or whatever Intel calls it today.) I assume that the CPUs in these two boxes supported this capability, but the BIOS (or it's equivalent on a Mac) apparently didn't enable this feature. Sure it's a small sample, but in this day and age I really expected it to be enabled by default pretty much everywhere. Anyway, that got me thinking about idle power consumption on clusters. Many of you have machines that run at 100% CPU 24/7, and for those systems the following discussion is irrelevant. But there are other clusters around that tend to sit for long periods of time between jobs, and whatever power they are using while waiting for a job is pretty close to a total waste. This is even more common on regular PCs, where CPU usage is extremely "bursty". The thing is, on pretty much every machine I've seen (exception: some laptops) there is a gaping hole between the lowest power level on a running machine, and the power level when it goes to sleep. Putting idle nodes all the way to sleep would save the most power, but it is a nightmare in terms of waking them back up again. Besides the issue of disks that might not spin back up, there is the problem of the (many) network protocols which are going to time out and break connections. Also returning from sleep nodes tends to be relatively slow, taking many seconds to many minutes, depending on a whole lot of variables. So it would be nice if the range of underclocking / undervolting adjustments provided on compute nodes extended quite a bit further towards the lower end than it currently does. Typically idle is something like 70-80W at the lowest clock speed and sleep is 2-4W. There's a lot of room in there to work with. Why is there not a system that can slow down far enough to use only 15W and still run, albeit very slowly? On a diskless node 7-10W might even be possible. Machines running in these nodes would be alive enough to keep network connections open, and would be a whole lot easier to get back up to full speed than the equivalent machine in a sleep state. Assuming the transition speed is similar to Cool'N Quiet we're talking much less than a second to speed back up again. There are a lot of articles around about statically underclocked machines, which proves that running modern hardware slowly is possible, but the statically underclocked machines cannot be sped up again - they start slow, and stay slow. Via sells some processors like the C7 which will operate over a very wide power range, but unfortunately the fastest those will crunch isn't anywhere near the speed of an Opteron or Core. Big iron SMP machines often the ability to shut off CPUs while the machine is running, well, except for the last one obviously. With quad cores pretty much here, and octo cores on the horizon, one might imagine large power savings at idle could be achieved the same way on these chips. Can any of the high core number Opterons or Core CPUs power down unused cores now? In closing, does anybody currently make a rack mountable compute node with a really, really, really, low idle power mode, and also competitive performance when running at 100%? Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From wavelet at iutlecreusot.u-bourgogne.fr Thu Mar 1 01:01:35 2007 From: wavelet at iutlecreusot.u-bourgogne.fr (Wavelet colloque) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] Call for papers : Wavelet Applications in Industrial Processing V Message-ID: *** Call for Papers and Announcement *** Wavelet Applications in Industrial Processing V (SA109) Part of SPIE?s International Symposium on Optics East 2007 9-12 September 2007 ? Seaport World Trade Center ? Boston, MA, USA --- Abstract Due Date Deadline prolongation: 4 March 2007 --- --- Manuscript Due Date: 13 August 2007 --- Web site http://spie.org/Conferences/Calls/07/oe/submitAbstract/index.cfm? fuseaction=SA109 ABSTRACT TEXT Approximately 500 words. Conference Chairs: Fr?d?ric Truchetet, Univ. de Bourgogne (France); Olivier Laligant, Univ. de Bourgogne (France) Program Committee: Patrice Abry, ?cole Normale Sup?rieure de Lyon (France); Radu V. Balan, Siemens Corporate Research; Atilla M. Baskurt, Univ. Claude Bernard Lyon 1 (France); Amel Benazza-Benyahia, Ecole Sup?rieure des Communications de Tunis (Tunisia); Albert Bijaoui, Observatoire de la C?te d'Azur (France); Seiji Hata, Kagawa Univ. (Japan); Henk J. A. M. Heijmans, Ctr. for Mathematics and Computer Science (Netherlands); William S. Hortos, Associates in Communication Engineering Research and Technology; Jacques Lewalle, Syracuse Univ.; Wilfried R. Philips, Univ. Gent (Belgium); Alexandra Pizurica, Univ. Gent (Belgium); Guoping Qiu, The Univ. of Nottingham (United Kingdom); Hamed Sari-Sarraf, Texas Tech Univ.; Peter Schelkens, Vrije Univ. Brussel (Belgium); Paul Scheunders, Univ. Antwerpen (Belgium); Kenneth W. Tobin, Jr., Oak Ridge National Lab.; G?nther K. G. Wernicke, Humboldt-Univ. zu Berlin (Germany); Gerald Zauner, Fachhochschule Wels (Austria) The wavelet transform, multiresolution analysis, and other space- frequency or space-scale approaches are now considered standard tools by researchers in image and signal processing. Promising practical results in machine vision and sensors for industrial applications and non destructive testing have been obtained, and a lot of ideas can be applied to industrial imaging projects. This conference is intended to bring together practitioners, researchers, and technologists in machine vision, sensors, non destructive testing, signal and image processing to share recent developments in wavelet and multiresolution approaches. Papers emphasizing fundamental methods that are widely applicable to industrial inspection and other industrial applications are especially welcome. Papers are solicited but not limited to the following areas: o New trends in wavelet and multiresolution approach, frame and overcomplete representations, Gabor transform, space-scale and space- frequency analysis, multiwavelets, directional wavelets, lifting scheme for: - sensors - signal and image denoising, enhancement, segmentation, image deblurring - texture analysis - pattern recognition - shape recognition - 3D surface analysis, characterization, compression - acoustical signal processing - stochastic signal analysis - seismic data analysis - real-time implementation - image compression - hardware, wavelet chips. o Applications: - machine vision - aspect inspection - character recognition - speech enhancement - robot vision - image databases - image indexing or retrieval - data hiding - image watermarking - non destructive evaluation - metrology - real-time inspection. o Applications in microelectronics manufacturing, web and paper products, glass, plastic, steel, inspection, power production, chemical process, food and agriculture, pharmaceuticals, petroleum industry. All submissions will be peer reviewed. Please note that abstracts must be at least 500 words in length in order to receive full consideration. ------------------------------------------------------------------------ --------- ! Abstract Due Date Deadline prolongation: 4 March 2007 ! ! Manuscript Due Date: 13 August 2007 ! ------------------------------------------------------------------------ --------- ------------- Submission of Abstracts for Optics East 2007 Symposium ------------ Abstract Due Date Deadline prolongation: 4 March 2007 - Manuscript Due Date: 13 August 2007 Abstracts, if accepted, will be distributed at the meeting. * IMPORTANT! - Submissions imply the intent of at least one author to register, attend the symposium, present the paper (either orally or in poster format), and submit a full-length manuscript for publication in the conference Proceedings. - By submitting your abstract, you warrant that all clearances and permissions have been obtained, and authorize SPIE to circulate your abstract to conference committee members for review and selection purposes and if it is accepted, to publish your abstract in conference announcements and publicity. - All authors (including invited or solicited speakers), program committee members, and session chairs are responsible for registering and paying the reduced author, session chair, program committee registration fee. (Current SPIE Members receive a discount on the registration fee.) * Instructions for Submitting Abstracts via Web - You are STRONGLY ENCOURAGED to submit abstracts using the ?submit an abstract? link at: http://spie.org/events/oe - Submitting directly on the Web ensures that your abstract will be immediately accessible by the conference chair for review through MySPIE, SPIE?s author/chair web site. - Please note! When submitting your abstract you must provide contact information for all authors, summarize your paper, and identify the contact author who will receive correspondence about the submission and who must submit the manuscript and all revisions. Please have this information available before you begin the submission process. - First-time users of MySPIE can create a new account by clicking on the create new account link. You can simplify account creation by using your SPIE ID# which is found on SPIE membership cards or the label of any SPIE mailing. - If you do not have web access, you may E-MAIL each abstract separately to: abstracts@spie.org in ASCII text (not encoded) format. There will be a time delay for abstracts submitted via e-mail as they will not be immediately processed for chair review. IMPORTANT! To ensure proper processing of your abstract, the SUBJECT line must include only: SUBJECT: SA109, TRUCHETET, LALIGANT - Your abstract submission must include all of the following: 1. PAPER TITLE 2. AUTHORS (principal author first) For each author: o First (given) Name (initials not acceptable) o Last (family) Name o Affiliation o Mailing Address o Telephone Number o Fax Number o Email Address 3. PRESENTATION PREFERENCE "Oral Presentation" or "Poster Presentation." 4. PRINCIPAL AUTHOR?S BIOGRAPHY Approximately 50 words. 5. ABSTRACT TEXT Approximately 500 words. Accepted abstracts for this conference will be included in the abstract CD-ROM which will be available at the meeting. Please submit only 500-word abstracts that are suitable for publication. 6. KEYWORDS Maximum of five keywords. If you do not have web access, you may E-MAIL each abstract separately to: abstracts@spie.org in ASCII text (not encoded) format. There will be a time delay for abstracts submitted via e- mail as they will not be immediately processed for chair review. * Conditions of Acceptance - Authors are expected to secure funding for registration fees, travel, and accommodations, independent of SPIE, through their sponsoring organizations before submitting abstracts. - Only original material should be submitted. - Commercial papers, papers with no new research/development content, and papers where supporting data or a technical description cannot be given for proprietary reasons will not be accepted for presentation in this symposium. - Abstracts should contain enough detail to clearly convey the approach and the results of the research. - Government and company clearance to present and publish should be final at the time of submittal. If you are a DoD contractor, allow at least 60 days for clearance. Authors are required to warrant to SPIE in advance of publication of the Proceedings that all necessary permissions and clearances have been obtained, and that submitting authors are authorized to transfer copyright of the paper to SPIE. * Review, Notification, Program Placement - To ensure a high-quality conference, all abstracts and Proceedings manuscripts will be reviewed by the Conference Chair/Editor for technical merit and suitability of content. Conference Chair/Editors may require manuscript revision before approving publication, and reserve the right to reject for presentation or publication any paper that does not meet content or presentation expectations. SPIE?s decision on whether to accept a presentation or publish a manuscript is final. - Applicants will be notified of abstract acceptance and sent manuscript instructions by e-mail no later than 7 May 2007. Notification of acceptance will be placed on SPIE Web the week of 4 June 2007 at http://spie.org/events/oe - Final placement in an oral or poster session is subject to the Chairs' discretion. Instructions for oral and poster presentations will be sent to you by e-mail. All oral and poster presentations require presentation at the meeting and submission of a manuscript to be included in the Proceedings of SPIE. * Proceedings of SPIE - These conferences will result in full-manuscript Chairs/Editor- reviewed volumes published in the Proceedings of SPIE and in the SPIE Digital Library. - Correctly formatted, ready-to-print manuscripts submitted in English are required for all accepted oral and poster presentations. Electronic submissions are recommended, and result in higher quality reproduction. Submission must be provided in PostScript created with a printer driver compatible with SPIE?s online Electronic Manuscript Submission system. Instructions are included in the author kit and from the ?Author Info? link at the conference website. - Authors are required to transfer copyright of the manuscript to SPIE or to provide a suitable publication license. - Papers published are indexed in leading scientific databases including INSPEC, Ei Compendex, Chemical Abstracts, International Aerospace Abstracts, Index to Scientific and Technical Proceedings and NASA Astrophysical Data System, and are searchable in the SPIE Digital Library. Full manuscripts are available to Digital Library subscribers. - Late manuscripts may not be published in the conference Proceedings and SPIE Digital Library, whether the conference volume will be published before or after the meeting. The objective of this policy is to better serve the conference participants as well as the technical community at large, by enabling timely publication of the Proceedings. - Papers not presented at the meeting will not be published in the conference Proceedings, except in the case of exceptional circumstances at the discretion of SPIE and the Conference Chairs/Editors. wa2 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070301/f5901870/attachment.html From jaime.perea at gmail.com Thu Mar 1 06:50:49 2007 From: jaime.perea at gmail.com (jaime.perea@gmail.com) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem Message-ID: <200703011550.49618.jaime.perea@gmail.com> Hi, I have a small (16 dual xeon machines) cluster. We are going to add an additional machine which is only going to serve a big filesystem via a gigabit interface. Does anybody knows what is better for a cluster of this size, exporting the filesystem via NFS or use another alternative such as a cluster filesystem like GFS or OCFS? Thanks in advance -- Jaime D. Perea Duarte. Linux registered user #10472 Dep. Astrofisica Extragalactica. Instituto de Astrofisica de Andalucia (CSIC) Apdo. 3004, 18080 Granada, Spain. From dkondo at lri.fr Thu Mar 1 08:27:37 2007 From: dkondo at lri.fr (Derrick Kondo) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] [PCGrid 2007] call for participation: workshop on desktop grids Message-ID: <60ec14620703010827n365b803fke625dbe84dd13a6@mail.gmail.com> CALL FOR PARTICIPATION (see advance program below) WORKSHOP ON LARGE-SCALE, VOLATILE DESKTOP GRIDS (PCGRID 2007) held in conjunction with the IEEE International Parallel & Distributed Processing Symposium (IPDPS) March 30, 2007 Long Beach, California U.S.A. http://pcgrid07.lri.fr Desktop grids utilize the free resources available in Intranet or Internet environments for supporting large-scale computation and storage. For over a decade, desktop grids have been one of the largest and most powerful distributed computing systems in the world, offering a high return on investment for applications from a wide range of scientific domains (including computational biology, climate prediction, and high-energy physics). While desktop grids sustain up to Teraflops/second of computing power from hundreds of thousands to millions of resources, fully leveraging the platform's computational power is still a major challenge because of the immense scale, high volatility, and extreme heterogeneity of such systems. The purpose of the workshop is to provide a forum for discussing recent advances and identifying open issues for the development of scalable, fault-tolerant, and secure desktop grid systems. The workshop seeks to bring desktop grid researchers together from theoretical, system, and application areas to identify plausible approaches for supporting applications with a range of complexity and requirements on desktop environments. ##################################################################### ADVANCE PROGRAM (In each session below, the following list of papers will be presented. For the detailed schedule, see http://pcgrid07.lri.fr/program.html) --------------------------------------------------------------------- KEYNOTE SPEAKER: David P. Anderson, Director of BOINC and SETI@home, University of California at Berkeley --------------------------------------------------------------------- SESSION I: SYSTEMS Invited Paper: Open Internet-based Sharing for Desktop Grids in iShare Xiaojuan Ren, Purdue University, U.S.A. Ayon Basumallik, Purdue University, U.S.A. Zhelong Pan, VMWare, Inc., U.S.A. Rudolf Eigenmann, Purdue University, U.S.A. Invited Paper: Decentralized Dynamic Host Configuration in Wide-area Overlay Networks of Virtual Workstations Arijit Ganguly, University of Florida, U.S.A. David I. Wolinsky, University of Florida, U.S.A. P. Oscar Boykin, University of Florida, U.S.A. Renato J. Figueiredo, University of Florida, U.S.A. SZTAKI Desktop Grid: a Modular and Scalable Way of Building Large Computing Grids Zoltan Balaton, MTA SZTAKI Research Institute, Hungary Gabor Gombas, MTA SZTAKI Research Institute, Hungary Peter Kacsuk, MTA SZTAKI Research Institute, Hungary Adam Kornafeld, MTA SZTAKI Research Institute, Hungary Jozsef Kovacs, MTA SZTAKI Research Institute, Hungary Attila Csaba Marosi, MTA SZTAKI Research Institute, Hungary Gabor Vida, MTA SZTAKI Research Institute, Hungary Norbert Podhorszki, UC Davis, U.S.A. Tamas Kiss, University of Westminster, U.K. Direct Execution of Linux Binary on Windows for Grid RPC Workers Yoshifumi Uemura, University of Tsukuba, Japan Yoshihiro Nakajima, University of Tsukuba, Japan Mitsuhisa Sato, University of Tsukuba, Japan --------------------------------------------------------------------- SESSION II: SCHEDULING AND RESOURCE MANAGEMENT Local Scheduling for Volunteer Computing David Anderson, UC Berkeley, U.S.A. John McLeod VII, Sybase, Inc., U.S.A. Moving Volunteer Computing towards Knowledge-Constructed, Dynamically-Adaptive Modeling and Scheduling Michela Taufer, University of Texas at El Paso, U.S.A. Andre Kerstens, University of Texas at El Paso, U.S.A. Trilce Estrada, University of Texas at El Paso, U.S.A. David Flores, University of Texas at El Paso, U.S.A. Richard Zamudio, University of Texas at El Paso, U.S.A. Patricia Teller, University of Texas at El Paso, U.S.A. Roger Armen, The Scripps Research Institute, U.S.A. Charles L. Brooks III, The Scripps Research Institute, U.S.A. Proxy-based Grid Information Dissemination Deger Erdil, State University of New York at Binghamton, U.S.A. Michael Lewis, State University of New York at Binghamton, U.S.A. Nael Abu-Ghazaleh, State University of New York at Binghamton, U.S.A. --------------------------------------------------------------------- SESSION III: DATA-INTENSIVE APPLICATIONS AND DISTRIBUTED STORAGE Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid Christopher Moretti, University of Notre Dame, U.S.A. Timothy Faltemier, University of Notre Dame, U.S.A. Douglas Thain, University of Notre Dame, U.S.A. Patrick Flynn, University of Notre Dame, U.S.A. Invited Paper: Storage@home: Petascale Distributed Storage Adam L. Beberg, Stanford University, U.S.A. Vijay Pande, Stanford University, U.S.A. --------------------------------------------------------------------- SESSION IV: THEORY Applying IC-Scheduling Theory to Familiar Classes of Computations Gennaro Cordasco, University of Salerno, Italy Grzegorz Malewicz, Google, Inc., U.S.A. Arnold Rosenberg, University of Massachusetts at Amherst, U.S.A. Invited Paper: A Combinatorial Model for Self-Organizing Networks Yuri Dimitrov, Ohio State University, U.S.A. Gennaro Mango, Ohio State University, U.S.A. Carlo Giovine, Ohio State University, U.S.A. Mario Lauria, Ohio State University, U.S.A. Invited Paper: Towards Contracts & SLA in Large Scale Clusters & Desktops Grids Denis Caromel, INRIA, France Francoise Baude, INRIA, France Alexandre di Costanzo, INRIA, France Christian Delbe, INRIA, France Mario Leyton, INRIA, France ##################################################################### ORGANIZATION General Chairs Derrick Kondo, INRIA Futurs, France Franck Cappello, INRIA Futurs, France Program Chair Gilles Fedak, INRIA Futurs, France From Michael.Frese at NumerEx.com Thu Mar 1 10:51:25 2007 From: Michael.Frese at NumerEx.com (Michael H. Frese) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] Load Balance Shifts During Run of Fixed Balance Application Message-ID: <6.1.0.6.2.20070301113903.049f2688@mail.swcp.com> We have a parallel problem that shifts its load balance while executing even though we are certain that it shouldn't. The following will describe our experience level, our clusters, our application, and the problem. Our Experience We are the developers of an MPI parallel application -- a 2-d time-dependent multiphysics code -- with all the intimate knowledge of its architecture and implementation that implies. We are presently using the Portland Group Fortran and C compilers and MPICH-1 version 1.2.7. We have had success building and using other parallel applications on HPC systems and clusters of workstations, though in those cases the physics was 3-d. We have plenty of Linux workstation sysadmin experience. Our House-Built Clusters We have built a few, small, generally heterogeneous clusters of workstations around AMD processors, Netgear GA311 NICs, and different switches. We used Redhat 8 and 9 for our 32-bit processors, and have shifted to Fedora for our recent systems including our few ventures into 64-bit land. Some of our nodes have dual processors. We have not tuned the OSs at all, other than to be sure that our NICs have appropriate drivers. Some of our switches give us 80-90% of Gb speed as measured by NetPipe, both TCP-IP and MPI, and others give us 30%. In the case described here, the switch is a slower one, but the application's performance is determined by the latency since the messages are relatively small. Our only performance tools are the LINUX utility top and a stopwatch. Our Application Architecture and Performance Expectation During execution, the application takes thousands of steps that each advance simulation time. The processors advance through the different physics packages and parts thereof in lock step from one MPIWaitAll to the next, with limited amounts of work being done between the barriers. We use MPIAllReduce to do maximums, minimums, and sums of various quantities. The application uses a domain decomposition that does not change during each run. Each time step is roughly the same amount of work as previous ones, though the number of iterations in the implicit solution methods changes. However, all processors are taking the same number of iterations in each time step. Thus we expect that the relative load on a processor will remain roughly the same as the relative size of the domain it is assigned in the decomposition. The problem is that it doesn't. There is one exception to our expectation, in that intermittently after some number of time steps or interval of simulation time, the application does output. Each processor writes some dump files identified with its node number to a problem directory, and a single processor combines those files into one while all the other processors wait. By controlling the frequency of the output, we keep the total time lost in this wait relatively small. In addition, every ten cycles, the output processor writes a brief summary of the problem state to the terminal output. One more thing before we get to the problem. We don't use mpirun; our application reads a processor group file and starts the remote processes itself. Thus, there is one processor that is distinguished from the others: it was directly invoked from the command line of a shell -- usually tcsh, but never mind that religious war. The Problem We have observed unexpected and extreme load-balance shifts during both two- and four-processor runs. In the following, our focus will be on the four processor run. We observe the load balance by monitoring CPU usage on each of the processors with separate xterm-invoked tops from a non-cluster machine. Our primary observable is %CPU; as a secondary observable, we monitor the wall time interval between the 10-cycle terminal edit. The load balance starts out looking like the relative sizes of the domains we assigned to the various processors, just as we expect. The processor on which the run was started has the smallest domain to handle, and its %CPU is initially around 50%, while the others are around 90%. After a few hundred time steps or so the CPU usage of the processor on which the job was started begins to increase and the others begin to fall. After a thousand time steps or so, the CPU usage is nearly 90% for the originating process, and less than 20% for the remote processes. Not surprisingly, the wall time between 10-cycle terminal edits goes up by a factor of 4 over the same period. By observation, no other task ever consumes more than a few tenths of a percent of the CPU. The originating processor is the output processor, but only the terminal output is happening during this period, and we observe no significant change in the CPU usage during the cycles when that output is produced. Top is updating its output every 5 seconds and in this run our application is taking one time step every 2 seconds. The message count and size of the messages imply that two processors are spending about 30% of their time in system time for message startup and about a tenth that much actually transmitting data. There are about 6,000 messages sent and received in each time step on those processors, though it varies slightly from time step to time step. The other two processors -- one of which is the originating processor -- have about half that many messages to send and receive, and spend correspondingly less time doing it. Though we have shuffled the originating processor and the processors in the group the results are always similar. In one case we ran with four identical nodes except that one had Redhat 8 while the others were Redhat 9. In another case we ran four Redhat 9 machines with slightly different AMD processor speeds (2.08 vs 2.16 GHz). The 9.0 kernels are 2.4.20, while the 8.0 has been upgraded to 2.4.18. Here is a final bit of data. To prove that the shift was not determined by the state of the problem being simulated, we restarted the simulation from a restart dump made by our application when the load had shifted to the originating processor. The load balance immediately after the restart again reflected the domain size as it had in the beginning of the unrestarted simulation. After a thousand cycles in the restarted problem, the load had shifted back to the originating processor. Conclusion/Hypothesis Our tentative conclusion is that either MPICH or the operating system is eating an increasing amount of time on the originating processor as the number of time steps accumulates. It is probable that the accumulated number of messages transmitted is the problem. It acts like a leak, but of processor CPU time rather than memory. Top does not show any increase in resident set size (RSS) during the run. Does anyone have any ideas what this behavior might be, how we can test for it, and what we can do to fix it? Thanks for any help in advance. Mike From danapa2000 at gmail.com Fri Mar 2 14:26:46 2007 From: danapa2000 at gmail.com (Daniel Navas-Parejo Alonso) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703011550.49618.jaime.perea@gmail.com> References: <200703011550.49618.jaime.perea@gmail.com> Message-ID: <171130500703021426x10b06212jce260de3e6f07ba6@mail.gmail.com> Jaime, In my humble opinion, I think you can start with NFS (but using at least NFSv4), and see what happens, for instance, what's the real disk access pattern of your cluster, in terms of amount of IOPS, average BW usage, and read/write pattern. It could be interesting to know not only what's gonna be the file server, but what's the underlying storage subsystem, and the network requirements. I mean, the server could be big, the network sth like IB or Myrinet, and then you can be serving files that reside on a copule of internal disks so performance would sink... Anyway, I've seen some NFS file servers in that cluster size, I'd suggest try it and check if this is enough for your applications. Hope this helps, Daniel. 2007/3/1, jaime.perea@gmail.com : > > Hi, > > I have a small (16 dual xeon machines) cluster. We are going to add > an additional machine which is only going to serve a big filesystem via > a gigabit interface. > > Does anybody knows what is better for a cluster of this size, exporting > the > filesystem via NFS or use another alternative such as a cluster filesystem > like GFS or OCFS? > > Thanks in advance > > -- > > Jaime D. Perea Duarte. > Linux registered user #10472 > > Dep. Astrofisica Extragalactica. > Instituto de Astrofisica de Andalucia (CSIC) > Apdo. 3004, 18080 Granada, Spain. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070302/36d86e0c/attachment.html From hahn at mcmaster.ca Fri Mar 2 14:46:01 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703011550.49618.jaime.perea@gmail.com> References: <200703011550.49618.jaime.perea@gmail.com> Message-ID: > I have a small (16 dual xeon machines) cluster. We are going to add > an additional machine which is only going to serve a big filesystem via > a gigabit interface. are you comfortable with the expected performance of this design? that is, without any tuning/tweaking, you should achieve ~70 MB/s assuming the server's local disk(s) can manage, etc. > Does anybody knows what is better for a cluster of this size, exporting the > filesystem via NFS or use another alternative such as a cluster filesystem > like GFS or OCFS? NFS is really easy. for a small cluster and only Gb, I wouldn't even consider anything else. once you get into hundreds of nodes (or perhaps fewer very IO-intensive ones), alternatives are probably necessary. mostly, I'd decide on aggregate bandwidth requirements, though I'm sure NFS overhead eventually becomes a problem (thousands of nodes). I think your cluster would be very happy with 1-2 Gb links from the server, fed with a nice fast md-based raid array. regards, mark hahn. From csamuel at vpac.org Sat Mar 3 22:39:30 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <171130500703021426x10b06212jce260de3e6f07ba6@mail.gmail.com> References: <200703011550.49618.jaime.perea@gmail.com> <171130500703021426x10b06212jce260de3e6f07ba6@mail.gmail.com> Message-ID: <200703041739.30987.csamuel@vpac.org> On Sat, 3 Mar 2007, Daniel Navas-Parejo Alonso wrote: > In my?humble opinion, I think you can start with NFS (but using at least > NFSv4) How stable/usable is the Linux NFSv4 implementation these days ? It's been a while since I followed the Linux v4 mailing lists and I'm way out of touch with how they're getting on.. cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia From wt at atmos.colostate.edu Sun Mar 4 00:20:40 2007 From: wt at atmos.colostate.edu (Warren Turkal) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703041739.30987.csamuel@vpac.org> References: <200703011550.49618.jaime.perea@gmail.com> <171130500703021426x10b06212jce260de3e6f07ba6@mail.gmail.com> <200703041739.30987.csamuel@vpac.org> Message-ID: <200703040120.40403.wt@atmos.colostate.edu> On Saturday 03 March 2007 23:39, Chris Samuel wrote: > How stable/usable is the Linux NFSv4 implementation these days ? It sucks if you are using CentOS 4.x. With Debian Etch, it seems work pretty well. wt -- Warren Turkal From hahn at mcmaster.ca Sun Mar 4 09:38:26 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703041739.30987.csamuel@vpac.org> References: <200703011550.49618.jaime.perea@gmail.com> <171130500703021426x10b06212jce260de3e6f07ba6@mail.gmail.com> <200703041739.30987.csamuel@vpac.org> Message-ID: >> In my humble opinion, I think you can start with NFS (but using at least >> NFSv4) > > How stable/usable is the Linux NFSv4 implementation these days ? why V4? - security. within a cluster, I don't see the point to, say, kerberos. - compound rpcs. probably provides somewhat better efficiency. - open/close, byte-range locking. I don't see much demand. - client caching/delegation/leases - could be valuable for efficiency. I find that nfsv3 works fine for moderate IO on O(100) clients. I would be very interested to know whether others have observed performance benefits for v4, and whether it's an easy upgrade, such as no new/onerous security framework ('framework' is always a danger sign for me ;) From csamuel at vpac.org Sun Mar 4 14:02:58 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> Message-ID: <200703050902.58468.csamuel@vpac.org> On Mon, 5 Mar 2007, Mark Hahn wrote: > why V4? > - security. ?within a cluster, I don't see the point to, say, kerberos. Agreed, not to mention all the pain of trying to get Kerberos tickets passed through the queueing system and the fact that if you're running a 3 month job it's going to be quite hard to persuade your Kerberos admin to let you be able to create a ticket that lasts that long.. > - compound rpcs. ?probably provides somewhat better efficiency. Yup, should ease the burden of a lot stat()'s. > - open/close, byte-range locking. ?I don't see much demand. Pass. :-) > - client caching/delegation/leases - could be valuable for efficiency. Indeed, this to me is the most useful part of it, especially for those people who are running code that should use local scratch but doesn't (either due to lack of coding experience or not having access to the source).. > I find that nfsv3 works fine for moderate IO on O(100) clients. Likewise, though we do get the occasional user who is able to generate a pathological case.. > I would be very interested to know whether others have observed > performance benefits for v4, and whether it's an easy upgrade, > such as no new/onerous security framework ('framework' is always > a danger sign for me ;) When I last played with it you could still use AUTH_SYS (as in v3) rather than having to use Kerberos. cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia From john.hearns at streamline-computing.com Sun Mar 4 14:59:09 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703050902.58468.csamuel@vpac.org> References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> Message-ID: <45EB4F3D.3060709@streamline-computing.com> Chris Samuel wrote: > On Mon, 5 Mar 2007, Mark Hahn wrote: > >> why V4? >> - security. within a cluster, I don't see the point to, say, kerberos. > > Agreed, not to mention all the pain of trying to get Kerberos tickets passed > through the queueing system and the fact that if you're running a 3 month job > it's going to be quite hard to persuade your Kerberos admin to let you be > able to create a ticket that lasts that long.. > Purely as a point of interest, since high energy physics labs use AFS (and hence kerberos) they have already faced this one. The ticket is extended when a batch job is submitted: http://services.web.cern.ch/services/afs/arc.html#SECTION00040000000000000000 From csamuel at vpac.org Sun Mar 4 23:55:05 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] extreme dynamic underclocking and undervolting In-Reply-To: References: Message-ID: <200703051855.05386.csamuel@vpac.org> On Sat, 3 Mar 2007, David Mathog wrote: > So it would be nice if the range of underclocking / undervolting > adjustments ?provided on compute nodes extended quite a bit further > towards the lower end than it currently does. FWIW 2.6.21 looks like it will include i386 support for the clockevents and dyntick patches that have been developed out in the real time Linux world. Apparently they have AMD64 and ARM patches too, but these haven't been merged as of yet. There's a nice LWN article that describes this work: http://lwn.net/Articles/223185/ All of this is an improvement, but there is still one thing which could be better: there is no real need for a periodic tick in the system. That is especially true when the processor is idle. An idle CPU can save quite a bit of power, but waking that CPU up 100 times (or more) per second will hurt those power savings considerably. With a flexible timer infrastructure, there is no point in turning the CPU back on until it has something to do. So, when the (i386) kernel goes into its idle loop, it checks the next pending timer event. If that event is further away than the next tick, the periodic tick is turned off altogether; instead, the timer is is programmed to fire when the next event comes due. The CPU can then rest unharrassed until that time - unless an interrupt comes in first. Once the processor goes out of the idle state, the periodic tick is restored. [...] It quotes the developers saying: The implementation leaves room for further development like full tickless systems, where the time slice is controlled by the scheduler, variable frequency profiling, and a complete removal of jiffies in the future. -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia From sdm900 at gmail.com Mon Mar 5 01:20:55 2007 From: sdm900 at gmail.com (Stu Midgley) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703011550.49618.jaime.perea@gmail.com> References: <200703011550.49618.jaime.perea@gmail.com> Message-ID: I'd strongly recommend Lustre. It will work perfectly well from a single server node and give much higher bandwidths than NFS. If you have two nic's you can also serve up the file system over both and see around 150MB/s total bandwidth. Also, if you need more storage in the future, you can just add more servers... and get linear scaling of bandwidth. Stu. On 3/1/07, jaime.perea@gmail.com wrote: > Hi, > > I have a small (16 dual xeon machines) cluster. We are going to add > an additional machine which is only going to serve a big filesystem via > a gigabit interface. > > Does anybody knows what is better for a cluster of this size, exporting the > filesystem via NFS or use another alternative such as a cluster filesystem > like GFS or OCFS? > > Thanks in advance > > -- > > Jaime D. Perea Duarte. > Linux registered user #10472 > > Dep. Astrofisica Extragalactica. > Instituto de Astrofisica de Andalucia (CSIC) > Apdo. 3004, 18080 Granada, Spain. > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Dr Stuart Midgley sdm900@gmail.com From andrew.robbie at gmail.com Mon Mar 5 05:06:07 2007 From: andrew.robbie at gmail.com (Andrew Robbie (GMail)) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] IB switches: managed or not? Message-ID: Hi, I am building a small (~16) node cluster with an IB interconnect. I need to decide whether I will buy a cheaper, dumb switch and run OpenSM, or get a more expensive switch with a built in subnet manager. The largest this system would every grow is 32 nodes (two 24 port switches). Various vendors (integrators, not switch OEMs) have stated to me that managed switches are the go, and that OpenSM is (a) buggy, and (b) very time consuming to set up. But, a managed name brand switch seems to cost a lot more than a non-managed one using the Mellanox reference design kit (rebadged, but I suspect made by Flextronics...). My other query is about diagnostic software. With an ethernet switch it is pretty easy to fire up Ethereal (sorry Wireshark, but it is such a silly name) or Etherape and get a look at what is going on. If I buy a Cisco or Voltaire etc do they come with tools that let me get accurate representations of what is going on? Or are their tools really for large IB networks? Regards, Andrew [v2] -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070306/c28ef7d0/attachment.html From walid.shaari at gmail.com Mon Mar 5 07:41:58 2007 From: walid.shaari at gmail.com (Walid) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703050902.58468.csamuel@vpac.org> References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> Message-ID: On 3/5/07, Chris Samuel wrote: > > > - compound rpcs. probably provides somewhat better efficiency. > > Yup, should ease the burden of a lot stat()'s. > > > - client caching/delegation/leases - could be valuable for efficiency. > > Indeed, this to me is the most useful part of it, especially for those > people > who are running code that should use local scratch but doesn't (either due > to > lack of coding experience or not having access to the source).. Our developers had that issue of inconsistent file system view in RHEL based systems, some of it is solved by disabling dir list caching, another by using noac, what the other was doing was writing simultaneously to the same file partitioned over several nodes, I told this is probably not the right way to do file writing. apparently he used to do it in Sun Solaris and it worked flawlessly. NFSv4 brings to the table standard client implementation. unfortunately Red Hat recommends RHEL5 which should be out soon now for NFSv4 Walid. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070305/ea9feb32/attachment.html From hahn at mcmaster.ca Mon Mar 5 08:08:28 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> Message-ID: > Our developers had that issue of inconsistent file system view in RHEL > based systems, some of it is solved by disabling dir list caching, another > by using noac, well, developers should be smart enough to know what FS they're using, and how it's intended to behave. turning off AC is a nice option, but smarter is to leave it on and not try to cause race conditions. (I expect that such race-friendly behavior will fail on some other non-NFS filesystems, though probably harder to trigger.) > what the other was doing was writing simultaneously to the > same file partitioned over several nodes, I told this is probably not the > right way to do file writing. apparently he used to do it in Sun Solaris and > it worked flawlessly. I would spank any developer who said "but it works on platform X"! developers must be aware of the spec, not merely what they can get away with somewhere, sometime. of course, this is the thinking behind apps having "supported" platforms - just a fancy way of saying "no, we don't know what standards-conformance we need, or how we violate the standard, but here's a few places we haven't yet noticed any bad-enough bugs". writing to different sections of a file is probably wrong on any networked FS, since there will inherently be obscure interactions with the size and alignment of the writes vs client pagecache, network transport, actual network FS, server pagecache and underlying server/disk FS. in my experience, people who expect it to "just work" have an incredibly naive model of how a network FS works (ie, write() produces an RPC direct to the server) From john.hearns at streamline-computing.com Mon Mar 5 08:26:28 2007 From: john.hearns at streamline-computing.com (John Hearns) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> Message-ID: <45EC44B4.5060009@streamline-computing.com> Walid wrote: > > > > > Our developers had that issue of inconsistent file system view in > RHEL based systems, some of it is solved by disabling dir list caching, > another by using noac, what the other was doing was writing > simultaneously to the same file partitioned over several nodes, I told > this is probably not the right way to do file writing. apparently he > used to do it in Sun Solaris and it worked flawlessly. That leads me to a damn stupid question - how do NFSv4 and ROMIO interoperate then? Anyone got experience of that, or is it signed "There be Dragons" ? From Michael.Frese at NumerEx.com Mon Mar 5 10:38:30 2007 From: Michael.Frese at NumerEx.com (Michael H. Frese) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] Load Balance Shifts During Run of Fixed Balance Application [RESOLVED] Message-ID: <6.1.0.6.2.20070305110359.049df958@mail.swcp.com> Thanks to those who took the time to consider my original description of our problem. It has now been resolved and the simulation load balance is remaining fixed over thousands of time steps. The problem, not surprisingly, was in our application code, specifically in our use of MPI in one particular place. We had posted some receives on the originating processor -- which was also the output processor -- for messages that were never sent. We failed to detect the error because -- in another error -- we had failed to do a WaitAll on the receive message queue for those messages. The result was that the originating/output processor had an ever increasing receive queue to hunt through while pairing up receives and arriving messages, and so took increasingly longer with each successive timestep. We also sent some messages to processors that did not exist, though I think this was less of a problem. We found the problem by looking for one a related kind. We built and ran a test code, and found accidently that failing to post receives caused processors to have to hunt through an increasing queue of received but unprocessed messages. Thanks again. Mike From ctierney at hypermall.net Mon Mar 5 13:47:50 2007 From: ctierney at hypermall.net (Craig Tierney) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> Message-ID: <45EC9006.50403@hypermall.net> Stu Midgley wrote: > I'd strongly recommend Lustre. It will work perfectly well from a > single server node and give much higher bandwidths than NFS. If you > have two nic's you can also serve up the file system over both and see > around 150MB/s total bandwidth. > > Also, if you need more storage in the future, you can just add more > servers... and get linear scaling of bandwidth. How much do you use Lustre? Yes, you can get that bandwidth, but if you code doesn't do large-streaming I/O, you performance will be worse than NFS. Also, I would like to hear someone speakup that uses Lustre in a PRODUCTION environment that doesn't have a kernel hacker on staff. Also, Lustre metadata doesn't scale (yet). You can add another server, but that won't improve the metadata. Using Lustre also requires you to re-patch your kernel every security update, then get the bugs out again. Lustre is the right answer for some, but if you aren't going to have that many compute nodes. It doesn't sound like it here. Craig > > Stu. > > > On 3/1/07, jaime.perea@gmail.com wrote: >> Hi, >> >> I have a small (16 dual xeon machines) cluster. We are going to add >> an additional machine which is only going to serve a big filesystem via >> a gigabit interface. >> >> Does anybody knows what is better for a cluster of this size, >> exporting the >> filesystem via NFS or use another alternative such as a cluster >> filesystem >> like GFS or OCFS? >> >> Thanks in advance >> >> -- >> >> Jaime D. Perea Duarte. >> Linux registered user #10472 >> >> Dep. Astrofisica Extragalactica. >> Instituto de Astrofisica de Andalucia (CSIC) >> Apdo. 3004, 18080 Granada, Spain. >> _______________________________________________ >> Beowulf mailing list, Beowulf@beowulf.org >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > From greg.lindahl at qlogic.com Mon Mar 5 14:50:35 2007 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <45EC9006.50403@hypermall.net> References: <200703011550.49618.jaime.perea@gmail.com> <45EC9006.50403@hypermall.net> Message-ID: <20070305225035.GH7056@localhost.localdomain> On Mon, Mar 05, 2007 at 02:47:50PM -0700, Craig Tierney wrote: > Also, I would like to hear someone > speakup that uses Lustre in a PRODUCTION environment that > doesn't have a kernel hacker on staff. One of our Oil & Gas customers does this, no kernel hacker, but they are paying CFS for support. Which is almost the same thing. -- greg From sdm900 at gmail.com Mon Mar 5 15:17:32 2007 From: sdm900 at gmail.com (Stu Midgley) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <45EC9006.50403@hypermall.net> References: <200703011550.49618.jaime.perea@gmail.com> <45EC9006.50403@hypermall.net> Message-ID: Actuall I run it in production and I'm not a kernel hacker. We currently have 6 OSS's with software raid5 to 6 internal SATA disks. We see about 190MB/s per OSS out of the disks and around 150MB/s via the dual network interfaces. I can't think of any benchmark you care to mention that a single lustre OSS/MDS won't outperform NFS. Especially, if you configure your systems to use both NIC's (most motherboards now come with dual interfaces) and I don't mean trunking the ports. Just configure portals to know that it can speak to the OSS's via both nics and it will handle the rest for you. Lustre's meta data performance is WAY better than NFS. I'd almost say its WAY better than any global FS I've played with. Certainly, you have to use Lustre kernels... all we do is run Centos as our clients/servers and then we just grab the pre-build/supported kernels from CFS. Its all pretty easy. The current Lustre V1.4 is very very nice and we have found it to be very robust. Nearly all the problems we experience turn out to be flakey hardware or kernel issues. Not Lustre at all. You can also checkout the FUSE implementation of a Lustre client I posted to CFS's website a few weeks back https://mail.clusterfs.com/wikis/lustre/fuse while it needs a LOT of work to give decent performance, it does work. Oh, and if someone ports liblustre to macosx, I could also run it on my mac :) Stu. > > How much do you use Lustre? Yes, you can get that bandwidth, > but if you code doesn't do large-streaming I/O, you performance > will be worse than NFS. Also, I would like to hear someone > speakup that uses Lustre in a PRODUCTION environment that > doesn't have a kernel hacker on staff. > > Also, Lustre metadata doesn't scale (yet). You can add > another server, but that won't improve the metadata. > > Using Lustre also requires you to re-patch your kernel every security > update, then get the bugs out again. > > Lustre is the right answer for some, but if you aren't going > to have that many compute nodes. It doesn't sound like > it here. > > Craig -- Dr Stuart Midgley sdm900@gmail.com From csamuel at vpac.org Mon Mar 5 19:43:30 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703011550.49618.jaime.perea@gmail.com> References: <200703011550.49618.jaime.perea@gmail.com> Message-ID: <200703061443.30970.csamuel@vpac.org> On Fri, 2 Mar 2007, jaime.perea@gmail.com wrote: > I have a small (16 dual xeon machines) cluster. [...] > Does anybody knows what is better for a cluster of this size, exporting the > filesystem via NFS FWIW we run two NFS servers (dual 2.0GHz Opteron 240's) with users split across the two and they cope with 3 clusters, two with ~180 CPUs and one with ~30 CPUs, all run at an average of 83% utilisation over the last 12 months (one at around 92% utilisation for the last 3 months). So yes, NFS should be fine. Just don't try and run Gaussian on it. :-) cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070306/fedd3479/attachment.bin From csamuel at vpac.org Mon Mar 5 19:46:51 2007 From: csamuel at vpac.org (Chris Samuel) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <45EB4F3D.3060709@streamline-computing.com> References: <200703011550.49618.jaime.perea@gmail.com> <200703050902.58468.csamuel@vpac.org> <45EB4F3D.3060709@streamline-computing.com> Message-ID: <200703061446.52247.csamuel@vpac.org> On Mon, 5 Mar 2007, John Hearns wrote: > Purely as a point of interest, since high energy physics labs use AFS > (and hence kerberos) they have already faced this one. Interesting, though it's not clear from that whether it can cope with, say, automatically renewing expiring tickets for running jobs where the job lifetime is longer than the maximum allowed lifetime of a Kerberos ticket. NB: I've never used Kerberos in anger, so be gentle. :-) cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070306/768ce825/attachment.bin From landman at scalableinformatics.com Mon Mar 5 20:48:48 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> Message-ID: <45ECF2B0.5060101@scalableinformatics.com> Mark Hahn wrote: >> Our developers had that issue of inconsistent file system view in >> RHEL >> based systems, some of it is solved by disabling dir list caching, >> another >> by using noac, > > well, developers should be smart enough to know what FS they're using, > and how it's intended to behave. turning off AC is a nice option, but > smarter is to leave it on and not try to cause race conditions. ... or to catch them and fix them ... > (I expect that such race-friendly behavior will fail on some other > non-NFS filesystems, though probably harder to trigger.) > >> what the other was doing was writing simultaneously to the >> same file partitioned over several nodes, I told this is probably not the >> right way to do file writing. apparently he used to do it in Sun >> Solaris and >> it worked flawlessly. > > I would spank any developer who said "but it works on platform X"! > developers must be aware of the spec, not merely what they can get away > with somewhere, sometime. of course, this is the thinking behind apps > having "supported" platforms - just a fancy way of saying > "no, we don't know what standards-conformance we need, or how we violate > the standard, but here's a few places we haven't yet noticed any > bad-enough bugs". *sigh* If only we could "spank" them. In ISV circles, there is a meme running about that Linux == RHEL*. So they code everything to that, and not to the LSB. Note: this is one thing that the Windows folks sorta kinda do right. There is a "spec" to some degree, and everyone can kinda sorta write to it. Then again, when you completely dominate something, you can dictate to users. We have a long standing IE problem with rendering forms in tables, everyone else can do it right, and the code checks out as w3c compliant ... oh, never mind, not worth trying to get IE to talk standards. Standards only work when all players follow them. They also need to be simple enough to follow. Making a standard impossible to follow helps no end user. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From stephen.hocking at gmail.com Mon Mar 5 14:46:28 2007 From: stephen.hocking at gmail.com (Stephen Hocking) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem Message-ID: <6300771b0703051446g777b5fach6b3bbfa338389fd1@mail.gmail.com> > > > Our developers had that issue of inconsistent file system view in RHEL > > based systems, some of it is solved by disabling dir list caching, another > > by using noac, > > well, developers should be smart enough to know what FS they're using, > and how it's intended to behave. turning off AC is a nice option, > but smarter is to leave it on and not try to cause race conditions. > (I expect that such race-friendly behavior will fail on some other > non-NFS filesystems, though probably harder to trigger.) I recall doing a port of a former employer's seismic processing code to Linux, which was used to having GPFS or PIOFS around. The only distributed(!) filesystem of any sort that I could afford was NFS, which wasn't too bad, except that various programs insisted on having multiple nodes (say about 200+) appending to the same file simultaneously. After much trial & error, I discovered noac and also that opening the file on the clients with O_SYNC would send each write off to the NFS server immediately. Not an elegant solution. All this was happening over 100base-T, and the NFS server, if we were lucky, had a GB connection to the switch. We discovered an interesting race condition in one of the ethernet drivers along the way. They tell me that they're using GPFS under Linux now. Stephen From ballen at gravity.phys.uwm.edu Tue Mar 6 00:57:30 2007 From: ballen at gravity.phys.uwm.edu (Bruce Allen) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <45EC9006.50403@hypermall.net> Message-ID: Hi Stu, > Actually I run it (Lustre) in production and I'm not a kernel hacker. Thank you for this snapshot of 'real world' Lustre use. At the risk of hijacking this thread (or borrowing it...) could I ask you a question about Lustre? I've always been interested in Lustre but never used it. Like everyone in this mailing list I am interested in a distributed filesystem whose bandwidth and speed are commensurate with the total raw hardware IO performance of the disks and the network speed and intersection bandwidth. But there are two additional features that I also think would be very desirable: (1) RAID-across-nodes. For example every ten nodes form a redundant RAID set. The disappearance of any one of these nodes causes no data loss, service loss, or corruption at the user level. The total redundant storage available from the ten nodes is 90% of the available raw storage. (2) Symmetry: all nodes have identical behavior and features. There are no specialized IO or metadata nodes, which act as filesystem bottlenecks and which are single points of failure. Am I correct that Lustre does not offer either of these features? Do you (or does someone else) know if there is an open-source or commercial distributed (posix) filesystem with these features? Cheers, Bruce From landman at scalableinformatics.com Tue Mar 6 04:42:56 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <45EC9006.50403@hypermall.net> Message-ID: <45ED61D0.9000205@scalableinformatics.com> Bruce Allen wrote: > (1) RAID-across-nodes. For example every ten nodes form a redundant > RAID set. The disappearance of any one of these nodes causes no data > loss, service loss, or corruption at the user level. The total > redundant storage available from the ten nodes is 90% of the available > raw storage. Using iSCSI targets and iSCSI initiators, you could build RAID5 or RAID6 across boxes using the linux MD device. We have proposed this to some financial customers using our JackRabbit unit. > (2) Symmetry: all nodes have identical behavior and features. There are > no specialized IO or metadata nodes, which act as filesystem bottlenecks > and which are single points of failure. For this, we use a set/pair/triple of thin HA servers with stonith running. You can run them in active-passive, active-active (requires some sort of CFS then). The nice part about this is that the metadata resides within the FS, and you look at each machine as a big block o disks. > Am I correct that Lustre does not offer either of these features? Lustre is an object data storage system. Breaks apart metadata from data. > > Do you (or does someone else) know if there is an open-source or > commercial distributed (posix) filesystem with these features? If you use our idea above (iSCSI targets/initiators), you could run active-passive/STONITH mode using xfs/jfs . We have proposed this at a number of places when they need very fast cutover, and downtime of any sort means significant loss. > > Cheers, > Bruce Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From charliep at cs.earlham.edu Tue Mar 6 05:00:23 2007 From: charliep at cs.earlham.edu (Charlie Peck) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703061443.30970.csamuel@vpac.org> References: <200703011550.49618.jaime.perea@gmail.com> <200703061443.30970.csamuel@vpac.org> Message-ID: <8ECE0815-E096-431E-B94D-40437949C7D4@cs.earlham.edu> On Mar 5, 2007, at 10:43 PM, Chris Samuel wrote: > So yes, NFS should be fine. Just don't try and run Gaussian on > it. :-) Ok, I'll bite. We're just starting to support Gaussian on a couple of small clusters (32 and 64 cores respectively) and we don't have a lot of experience with it. It looks like there are 3 primary directories, the software root, the tmp dir, and the molecular system/ output files. Which subset of these shouldn't be accessed via NFS? thanks, charlie Charlie Peck Computer Science, Earlham College http://cs.earlham.edu hhtp://cluster.earlham.edu From landman at scalableinformatics.com Tue Mar 6 05:26:40 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <8ECE0815-E096-431E-B94D-40437949C7D4@cs.earlham.edu> References: <200703011550.49618.jaime.perea@gmail.com> <200703061443.30970.csamuel@vpac.org> <8ECE0815-E096-431E-B94D-40437949C7D4@cs.earlham.edu> Message-ID: <45ED6C10.4020603@scalableinformatics.com> Charlie Peck wrote: > On Mar 5, 2007, at 10:43 PM, Chris Samuel wrote: > >> So yes, NFS should be fine. Just don't try and run Gaussian on it. :-) > > Ok, I'll bite. We're just starting to support Gaussian on a couple of > small clusters (32 and 64 cores respectively) and we don't have a lot of > experience with it. It looks like there are 3 primary directories, the > software root, the tmp dir, and the molecular system/output files. > Which subset of these shouldn't be accessed via NFS? Charlie: Depending upon which links are run, Gaussian can do a fairly good job of consuming all your I/O bandwidth, and then some. Not all links are like this, the DFT links seem to be non-IO bound. As soon as you start spilling integrals to disk, you will see what we mean. Joe > > thanks, > charlie > > Charlie Peck > Computer Science, Earlham College > http://cs.earlham.edu > hhtp://cluster.earlham.edu > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From reuti at staff.uni-marburg.de Tue Mar 6 06:32:54 2007 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <8ECE0815-E096-431E-B94D-40437949C7D4@cs.earlham.edu> References: <200703011550.49618.jaime.perea@gmail.com> <200703061443.30970.csamuel@vpac.org> <8ECE0815-E096-431E-B94D-40437949C7D4@cs.earlham.edu> Message-ID: <9CCDAF56-531C-4529-80DF-C9FD342DC0E4@staff.uni-marburg.de> Hi, Am 06.03.2007 um 14:00 schrieb Charlie Peck: > On Mar 5, 2007, at 10:43 PM, Chris Samuel wrote: > >> So yes, NFS should be fine. Just don't try and run Gaussian on >> it. :-) > > Ok, I'll bite. We're just starting to support Gaussian on a couple > of small clusters (32 and 64 cores respectively) and we don't have > a lot of experience with it. It looks like there are 3 primary > directories, the software root, the tmp dir, and the molecular > system/output files. Which subset of these shouldn't be accessed > via NFS? for Gaussian all scratch files can be in the local $TMPDIR on the nodes - though the program itself is distributed via NFS for convenience. Even for parallel runs with Linda. We copy necessary files after the job back to the directory of the user, if s/he wishes to access them. Only the default output is written directly to the final location during execution time. Just don't set GAUSS_SCRDIR but make a cd to the batch system supplied $TMPDIR before the program call. Some hints you can find on the SGE list: http://gridengine.sunsource.net/servlets/ReadMsg? listName=users&msgNo=14600 -- Reuti > thanks, > charlie > > Charlie Peck > Computer Science, Earlham College > http://cs.earlham.edu > hhtp://cluster.earlham.edu > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From sdm900 at gmail.com Tue Mar 6 07:13:46 2007 From: sdm900 at gmail.com (Stu Midgley) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <45EC9006.50403@hypermall.net> Message-ID: Evening > Thank you for this snapshot of 'real world' Lustre use. At the risk of > hijacking this thread (or borrowing it...) could I ask you a question > about Lustre? I've always been interested in Lustre but never used it. I strongly suggest you grab a few old boxes and play with it, it really is very good. 1.6betas are a little unstable, but 1.4 is very very solid. A tad fiddly to setup, but not really that hard. 1.6 definitely is nice to setup and use. > > Like everyone in this mailing list I am interested in a distributed > filesystem whose bandwidth and speed are commensurate with the total raw > hardware IO performance of the disks and the network speed and > intersection bandwidth. But there are two additional features that I also > think would be very desirable: > > (1) RAID-across-nodes. For example every ten nodes form a redundant RAID > set. The disappearance of any one of these nodes causes no data loss, > service loss, or corruption at the user level. The total redundant > storage available from the ten nodes is 90% of the available raw storage. No, Lustre does not currently support this. There are lots of ways you could acheive this (as mentioned in other emails), but they will all reduce bandwidth :) It is definitely on the Lustre road map to deliver RAID across servers, but it isn't there yet. Having said that, there is nothing stopping you raiding the disks within a node. But, as I keep saying, your NFS servers don't do this either ;) > (2) Symmetry: all nodes have identical behavior and features. There are > no specialized IO or metadata nodes, which act as filesystem bottlenecks > and which are single points of failure. No, this is not really what Lustre is trying to acheive. But, it does allow you to have fail over in the servers and meta-data servers. So, if one crashes, another will take over. On the roadmap will be clustered meta-data servers... but again, its not there yet. Um... did I mention your NFS servers? > Am I correct that Lustre does not offer either of these features? > > Do you (or does someone else) know if there is an open-source or > commercial distributed (posix) filesystem with these features? > > Cheers, > Bruce > I think there are some opensource projects (glusterfs?) that claim to do this, but I suspect their bandwidth is nothing aproaching lustre... and probably for all they claim, their meta-data performance probably won't match lustre either. With Lustre 1.6 I was seeing 170MB/s sustained from single clients to the lustre storage. That's pretty impressive given two NIC's in the client... and I didn't even play with the tuning parameters or jumbo frames etc. That was straight out of the box. With 6 OSS's the agregate bandwidth with 1.6 was ~1GB/s... and it happily ran with 16 instances of Bonnie++ hammering away on it for a week. 1.4 is slightly down on bandwidth... but stable :) I've since been told that with tuning you can get 1.4 to perform as well as 1.6. I really think people should try Lustre. A lot of people were put off in the early days cause their was little documentation... there were few tools to help configure/mount etc. BUT, with 1.4, it is a very nice product. If you can afford Elan, then you will be in for a very nice experience. Stu. -- Dr Stuart Midgley sdm900@gmail.com From buccaneer at rocketmail.com Tue Mar 6 07:17:59 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <45ECF2B0.5060101@scalableinformatics.com> Message-ID: <105755.55110.qm@web30612.mail.mud.yahoo.com> [snip] > *sigh* If only we could "spank" them. In ISV circles, > there is a meme running about that Linux == RHEL*. So > they code everything to that, and not to the LSB. RHEL makes it simple for 3rd party vendors to port their product to Linux because of the much longer support windows. Vendors love it. And large companies like the one I work for love it. And it works very well-outside the cluster. We have the RHEL conversation on a regular basis with our bosses. RH will come in and talk to them about the wonders of RHEL, and they want us to use it in the cluster, specially for infrastructure. We then have to patiently explain that we have tested standard RHEL kernels and do not perform well under our workload-and we point them to our documentation. That usually halts their forward momentum and workflow does not suffer. ____________________________________________________________________________________ Any questions? Get answers on any topic at www.Answers.yahoo.com. Try it now. From robl at mcs.anl.gov Tue Mar 6 07:53:41 2007 From: robl at mcs.anl.gov (Robert Latham) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> Message-ID: <20070306155340.GC9998@mcs.anl.gov> On Mon, Mar 05, 2007 at 11:08:28AM -0500, Mark Hahn wrote: > writing to different sections of a file is probably wrong on any > networked FS, since there will inherently be obscure interactions > with the size and alignment of the writes vs client pagecache, I'm rather surprised to see that sentiment on a mailing list for high performance clusters :> I would contend that writing to different sections of a file *must* be supported by any file system deployed on a cluster. How else would you get good performance from MPI-IO? PVFS, GPFS, and Lustre all suppoort simultaneous writes to different sections of a file. > in my experience, people who expect it to "just work" have an > incredibly naive model of how a network FS works (ie, write() > produces an RPC direct to the server) I agree that the POSIX API and consistency semantics make it difficult to achieve high I/O rates for common scientific workloads, and that NFS is probably not the best solution for those truly parallel workloads. Fortunately, there are good alternatives out there. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From robl at mcs.anl.gov Tue Mar 6 07:58:33 2007 From: robl at mcs.anl.gov (Robert Latham) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <45EC44B4.5060009@streamline-computing.com> References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> <45EC44B4.5060009@streamline-computing.com> Message-ID: <20070306155833.GD9998@mcs.anl.gov> On Mon, Mar 05, 2007 at 04:26:28PM +0000, John Hearns wrote: > That leads me to a damn stupid question - how do NFSv4 and ROMIO > interoperate then? Anyone got experience of that, > or is it signed "There be Dragons" ? It's not a stupid question at all. It's very important to understand the impact the choice of file system has on the higher levels of the I/O software stack. ROMIO does not have a special "NFSv4" ADIO driver. ROMIO treats it like regular NFSv3. In short, you can use it, but you'll have to disable all caching to make it behave correctly. You'll get rather bad performance for most workloads, but what good is fast I/O if you get garbled data in your file? I know I beat this drum a lot, but do consider a true parallel file system like PVFS for your MPI-IO applications. GPFS and Luster would be good options too. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From robl at mcs.anl.gov Tue Mar 6 08:06:54 2007 From: robl at mcs.anl.gov (Robert Latham) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <45EC9006.50403@hypermall.net> Message-ID: <20070306160653.GE9998@mcs.anl.gov> On Tue, Mar 06, 2007 at 08:17:32AM +0900, Stu Midgley wrote: > I can't think of any benchmark you care to mention that a single > lustre OSS/MDS won't outperform NFS. Consider an MPI-IO benchmark where all processes write to different regions of a file. This workload is common in scientific applications, say when all processes need to write an HDF5 element to a datafile. Run that benchmark with one processor and you will get great performance out of Lustre. Lustre does an excellent job of caching data amd making single-processor I/O go really really fast. Run that benchmark with two processors, and the clients will spend a great deal of time revoking each others extent-based locks and expiring entries from their caches. Performance will take a significant hit, but will increase as you add more processes. I don't mean to come across as a Lustre hater. I'm just trying to keep the discussion honest: the discussion of the "right" file system for an application is hard, and lots of factors come into play. ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From hahn at mcmaster.ca Tue Mar 6 08:09:18 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: <20070306155340.GC9998@mcs.anl.gov> References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> <20070306155340.GC9998@mcs.anl.gov> Message-ID: >> writing to different sections of a file is probably wrong on any >> networked FS, since there will inherently be obscure interactions >> with the size and alignment of the writes vs client pagecache, > > I'm rather surprised to see that sentiment on a mailing list for high > performance clusters :> smiley noted, but I would suggest that HPC is not about convenience first - simply having each node write to a separate file eliminates any such issue, and is hardly an egregious complication to the code. > I would contend that writing to different sections of a file *must* be > supported by any file system deployed on a cluster. How else would > you get good performance from MPI-IO? who uses MPI-IO? straight question - I don't believe any of our 1500 users do. > PVFS, GPFS, and Lustre all suppoort simultaneous writes to different > sections of a file. NFS certainly does as well. you just have to know the constraints. are you saying you can never get pathological or incorrect results from parallel operations on the same file on any of those FS's? >> in my experience, people who expect it to "just work" have an >> incredibly naive model of how a network FS works (ie, write() >> produces an RPC direct to the server) > > I agree that the POSIX API and consistency semantics make it difficult > to achieve high I/O rates for common scientific workloads, and that > NFS is probably not the best solution for those truly parallel workloads. > > Fortunately, there are good alternatives out there. starting with the question: "do you have a good reason to be writing in parallel to the same file?". I'm not saying the answer is never yes. I guess I tend to value portability by obscurity-avoidance. not if it makes life utter hell, of course, but... From buccaneer at rocketmail.com Tue Mar 6 08:48:00 2007 From: buccaneer at rocketmail.com (Buccaneer for Hire.) Date: Fri May 9 01:05:42 2008 Subject: [Beowulf] network filesystem In-Reply-To: Message-ID: <833756.7894.qm@web30607.mail.mud.yahoo.com> > smiley noted, but I would suggest that HPC is not > about convenience first - simply having each node > write to a separate file eliminates any such issue, > and is hardly an egregious complication to the code. I my environment, it is not always up to the system admins to make those decisions. Convenvience for the clients is paramount since their ability to process most efficiently directly adds to the bottom line. The new way of processing (as I mentioned a while back) will make the workflows more streamlined and efficient. It reminds me of an experience I had. We have so many nodes that I wrote a spider to gather all the node info and then created a DB so I can query the information I needed. There were some who thought a flat file would be best because... Suppose a piece of the ISS broke off, survived reentry and landed right on my DB server, what then??? ____________________________________________________________________________________ Don't get soaked. Take a quick peak at the forecast with the Yahoo! Search weather shortcut. http://tools.search.yahoo.com/shortcuts/#loc_weather From laytonjb at charter.net Tue Mar 6 09:00:24 2007 From: laytonjb at charter.net (Jeffrey B. Layton) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> <20070306155340.GC9998@mcs.anl.gov> Message-ID: <45ED9E28.60002@charter.net> Mark Hahn wrote: >>> writing to different sections of a file is probably wrong on any >>> networked FS, since there will inherently be obscure interactions >>> with the size and alignment of the writes vs client pagecache, >> >> I'm rather surprised to see that sentiment on a mailing list for high >> performance clusters :> > > smiley noted, but I would suggest that HPC is not about convenience > first - simply having each node write to a separate file eliminates > any such issue, > and is hardly an egregious complication to the code. Actually this can greatly complicate code. If I run a CFD run on n number of processes and they each write the solution to a separate file, then if I run 1.5*n processes, how do I read the n files? I can write some code to take the n files, and then write out a single file or 1.5*n files for instance. To me this is a wasteful use of cycles when something like MPI-IO is so much better and I can stick with a single file. While I don't want to speak for the entire CFD community, but I haven't seen anyone write out n files. That concept was proven to be a huge pain many years ago. Other disciplines may have other opinions of course. >> I would contend that writing to different sections of a file *must* be >> supported by any file system deployed on a cluster. How else would >> you get good performance from MPI-IO? > > who uses MPI-IO? straight question - I don't believe any of our 1500 > users do. I do. I also know that some ISV's are moving rapidly to use MPI-IO. >>> in my experience, people who expect it to "just work" have an >>> incredibly naive model of how a network FS works (ie, write() >>> produces an RPC direct to the server) >> >> I agree that the POSIX API and consistency semantics make it difficult >> to achieve high I/O rates for common scientific workloads, and that >> NFS is probably not the best solution for those truly parallel >> workloads. >> >> Fortunately, there are good alternatives out there. > > starting with the question: "do you have a good reason to be writing > in parallel to the same file?". I'm not saying the answer is never yes. As Rob mentioned writing in parallel to the same file gets you good performance. I think this is a fundamental underpinning of parallel IO. You can do this with or without MPI-IO. MPI-IO just makes it easier, standard, and portable. Of course you would not have different processes writing to the same region of a file. But if you can have each process write to a distinct region or section of the file without worrying about having another process stepping on that one, then why not write in parallel? It's easy to do using MPI-IO. Take a look at the tutorials on MPI-IO around the web and give them a try. Jeff From robl at mcs.anl.gov Tue Mar 6 10:44:10 2007 From: robl at mcs.anl.gov (Robert Latham) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] network filesystem In-Reply-To: References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> <20070306155340.GC9998@mcs.anl.gov> Message-ID: <20070306184410.GG9998@mcs.anl.gov> On Tue, Mar 06, 2007 at 11:09:18AM -0500, Mark Hahn wrote: > >I would contend that writing to different sections of a file *must* be > >supported by any file system deployed on a cluster. How else would > >you get good performance from MPI-IO? > > who uses MPI-IO? straight question - I don't believe any of our 1500 users > do. Excellent question. Direct users? Probably not very many. We do find that straight-up MPI-IO isn't a good fit for a lot of scientific applications. The convienence factor you mentioned is indeed important. MPI-IO thinks of data as "stream of bytes", while applications think in terms of "multidimentional typed data" (a slice of upper atmosphere). Libraries like Parallel-HDF5 and Parallel-NetCDF bridge the gap and provide a convienent, familiar API. The app is still using MPI-IO, just not directly. > NFS certainly does as well. you just have to know the constraints. > are you saying you can never get pathological or incorrect results from > parallel operations on the same file on any of those FS's? You observe correctly that file systems offer a set of rules on what to expect from I/O patterns. These consistency semantics are not set in stone: MPI-IO consistency semantics are more relaxed than POSIX, yet generally sufficent for parallel scientific applicaitons. We would consider it a serious bug in PVFS if simultaneous non-overlapping writes corrupted data. If the only file system I had access to was NFS, I'd do one file per process as well. > starting with the question: "do you have a good reason to be writing in > parallel to the same file?". I'm not saying the answer is never yes. > > I guess I tend to value portability by obscurity-avoidance. not if it makes > life utter hell, of course, but... one file per processor falls down on systems like BGL (where even a small run is 1024 processes, and 128k is not unheard of). One file per process also robs the higher layers of the I/O software stack from an opportunity to optimize access patterns. All processes reading a collumn out of a row-major array is noncontiguous (and generally slow) in file-per-processor, but can be contiguous in single-file after applying data shipping or two-phase collective buffering optimizations. Jeff touched on the data management issues of file-per-processor. If file-per-processor really is the most portable and convienent way to work on data, well, I can't argue with that. On NFS, that's probably the only way to get correct results. The single-file approach, however, has significant benefits on the modern parallel file systems available today. As I hope you could tell, this kind of discussion is a lot of fun for me. Thanks! ==rob -- Rob Latham Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Lab, IL USA B29D F333 664A 4280 315B From wt at atmos.colostate.edu Tue Mar 6 11:12:06 2007 From: wt at atmos.colostate.edu (Warren Turkal) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] network filesystem In-Reply-To: <20070306155833.GD9998@mcs.anl.gov> References: <200703011550.49618.jaime.perea@gmail.com> <45EC44B4.5060009@streamline-computing.com> <20070306155833.GD9998@mcs.anl.gov> Message-ID: <200703061212.07250.wt@atmos.colostate.edu> On Tuesday 06 March 2007 08:58, Robert Latham wrote: > I know I beat this drum a lot, but do consider a true parallel file > system like PVFS for your MPI-IO applications. ? GPFS and Luster would > be good options too. ? What about OCFS2? Do you know anything about it? wt -- Warren Turkal, Research Associate III/Systems Administrator Colorado State University, Dept. of Atmospheric Science From wt at atmos.colostate.edu Tue Mar 6 12:23:02 2007 From: wt at atmos.colostate.edu (Warren Turkal) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] network filesystem In-Reply-To: <45ED61D0.9000205@scalableinformatics.com> References: <200703011550.49618.jaime.perea@gmail.com> <45ED61D0.9000205@scalableinformatics.com> Message-ID: <200703061323.02368.wt@atmos.colostate.edu> On Tuesday 06 March 2007 05:42, Joe Landman wrote: > Using iSCSI targets and iSCSI initiators, you could build RAID5 or RAID6 > across boxes using the linux MD device. ?We have proposed this to some > financial customers using our JackRabbit unit. Do you have the md device mounted on many systems? I didn't think the md device was cluster aware. wt -- Warren Turkal, Research Associate III/Systems Administrator Colorado State University, Dept. of Atmospheric Science From landman at scalableinformatics.com Tue Mar 6 12:43:00 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703061323.02368.wt@atmos.colostate.edu> References: <200703011550.49618.jaime.perea@gmail.com> <45ED61D0.9000205@scalableinformatics.com> <200703061323.02368.wt@atmos.colostate.edu> Message-ID: <45EDD254.4090003@scalableinformatics.com> Warren Turkal wrote: > On Tuesday 06 March 2007 05:42, Joe Landman wrote: >> Using iSCSI targets and iSCSI initiators, you could build RAID5 or RAID6 >> across boxes using the linux MD device. We have proposed this to some >> financial customers using our JackRabbit unit. > > Do you have the md device mounted on many systems? I didn't think the md > device was cluster aware. md mounted on one system. Stonith and a HA server pair (triple,...) on the front end doing the md. With GFS/OCFS2/... you can have it active-active. > > wt -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From greg.lindahl at qlogic.com Tue Mar 6 03:34:00 2007 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] IB switches: managed or not? In-Reply-To: References: Message-ID: <20070306113400.GA4734@localhost.localdomain> On Tue, Mar 06, 2007 at 12:06:07AM +1100, Andrew Robbie (GMail) wrote: > But, a managed name brand switch seems to cost a lot > more than a non-managed one using the Mellanox reference design kit > (rebadged, but I suspect made by Flextronics...). Andrew, I know of at least 2 "name brand" unmanaged IB switches, one from QLogic (24 ports) and one from Microway (36 ports). I think Cisco resells the QLogic switch, and perhaps Voltaire has something similar. For larger switches the management board is a tiny fraction of the price. -- greg From greg.lindahl at qlogic.com Tue Mar 6 02:33:07 2007 From: greg.lindahl at qlogic.com (Greg Lindahl) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] network filesystem In-Reply-To: <45ECF2B0.5060101@scalableinformatics.com> References: <200703011550.49618.jaime.perea@gmail.com> <200703041739.30987.csamuel@vpac.org> <200703050902.58468.csamuel@vpac.org> <45ECF2B0.5060101@scalableinformatics.com> Message-ID: <20070306103307.GA4602@localhost.localdomain> >well, developers should be smart enough to know what FS they're using, >and how it's intended to behave. turning off AC is a nice option, but >smarter is to leave it on and not try to cause race conditions. Just yesterday I sat quietly at a customer site while an engineer wasted 30 minutes not understanding that the bizarre behavior he was seeing was due to attribute caching on NFS. The guy was a CFD expert, not a Unix expert. Even experts can have troubles with AC, sometimes. Just Say No to AC. -- greg From walid.shaari at gmail.com Wed Mar 7 08:13:45 2007 From: walid.shaari at gmail.com (Walid) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] IB switches: managed or not? In-Reply-To: References: Message-ID: On 3/5/07, Andrew Robbie (GMail) wrote: > > > My other query is about diagnostic software. With an ethernet switch it is > pretty easy to fire up Ethereal (sorry Wireshark, but it is such a silly > name) or Etherape and get a look at what is going on. If I buy a Cisco or > Voltaire etc do they come with tools that let me get accurate > representations of what is going on? Or are their tools really for large IB > networks? doesn't the fabric management software allows you to do some diagnostics and have an overview of the fabric, silverstorm now bought by Qlogic have also some scripts that helps in configuration of the fabric, and cluster regards Walid -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070307/8853323c/attachment.html From mb at gup.jku.at Tue Mar 6 01:48:15 2007 From: mb at gup.jku.at (Markus Baumgartner) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] IB switches: managed or not? In-Reply-To: References: Message-ID: <45ED38DF.9090907@gup.jku.at> Andrew Robbie (GMail) wrote: > Various vendors (integrators, not switch OEMs) have stated to me that > managed switches are the go, and that OpenSM is (a) buggy, and (b) > very time consuming to set up. But, a managed name brand switch seems > to cost a lot more than a non-managed one using the Mellanox reference > design kit (rebadged, but I suspect made by Flextronics...). > We have a small (3 nodes) cluster with an IB interconnect here. The switch is unmanaged. I cannot confirm either (a) nor (b). OpenSM runs without problems and the set-up is not too complicated. On the other hand, we also have a much more expensive shared-memory system from a well-known vendor that also features an IB interconnect. We never had to use any of the extra features of the managed switch we have there. And in contrast to the open-source and unsupported drivers that we use in the small cluster, the commercial driver stack is buggy and causes our machines to crash every now and then (usually under high load). My advice is to take the unmanaged switch. > My other query is about diagnostic software. With an ethernet switch > it is pretty easy to fire up Ethereal (sorry Wireshark, but it is such > a silly name) or Etherape and get a look at what is going on. If I buy > a Cisco or Voltaire etc do they come with tools that let me get > accurate representations of what is going on? Or are their tools > really for large IB networks? > If you run "IPoIB" you can use Ethernet monitoring tools to get diagnostics of the emulated ethernet devices. Our managed switch did not come with extra diagnostics software. The switch was shipped with the whole system, though (OEM). I do not know what software you would get if you buy a retail IB switch. Regards, Markus From frank.gruellich at mapsolute.com Tue Mar 6 02:17:26 2007 From: frank.gruellich at mapsolute.com (Frank Gruellich) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] IB switches: managed or not? In-Reply-To: References: Message-ID: <45ED3FB6.2050000@mapsolute.com> Hi, Andrew Robbie (GMail) schrieb: > I am building a small (~16) node cluster with an IB interconnect. I need to > decide whether I will buy a cheaper, dumb switch and run OpenSM, or get a > more expensive switch with a built in subnet manager. The largest this > system would every grow is 32 nodes (two 24 port switches). > > Various vendors (integrators, not switch OEMs) have stated to me that > managed switches are the go, and that OpenSM is (a) buggy, and (b) very > time consuming to set up. It's not _that_ buggy and set up is pretty straigt forward. But it lacks several features you'd really like in big systems. For fewer or equal to 24 nodes you can go with a simple switch and OpenSM. For 32 nodes you can use 16 nodes per switch and 8 cables for switch interconnect. So you should have 1/2 bisection bandwith in theory. But OpenSM configures IB forwarding rather static at startup and never adjusts it to actual usage of links and is rather poor to "hotplug" changes in topology. So it is possible that some links are overused but others not. Nevertheless you can still find 24 nodes in your 32 nodes cluster communicating nonblocking (if remaining 8 stay silent), but I don't know a simple way to get this information from OpenSM or switch. You can write a simple MPI program benchmarking it. In addition the versions of OpenSM I know crash silently sometimes (which does not affect anything), so you should monitor it in some way (you can restart it whenever you want). Finally I have to admit that this are all real life experiences without any deep inside knowledge of OpenSM or even Infiniband. So, as a conclusion I would suggest to go with a simple 24port switch and OpenSM for now. If you upgrade to more than 24 nodes you should add a more advanced switch. From my experience you can easily mix Mellanox switches with those formerly known as TopSpin, I don't know about other vendors. As one more hint you should reconsider if you need that many nodes for a job. If you limit your need of nodes for one job to 24 you can easily go with two dump 24 switches up to 48 nodes and both subclusters can communicate nonblocking. But of course this way no node of one subcluster can communicate with one of the other one and you need a resource management system able to assign nodes of subcluster to one job. Kind regards, -- Mapsolute GmbH Frank Gruellich Map24 Systems and Networks Duesseldorfer Strasse 40a 65760 Eschborn Germany Phone: +49 6196 77756-414 Fax: +49 6196 77756-100 http://www.mapsolute.com From camilo.hernandez at gmail.com Tue Mar 6 09:20:58 2007 From: camilo.hernandez at gmail.com (Juan Camilo Hernandez) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435 Message-ID: <4d2c60b30703060920j565b59dcnb7d916d77271d86f@mail.gmail.com> Hello.. I would like to know what server has the best performance for HPC systems between The Dell Poweredge 1950 (Xeon) And 1435SC (Opteron). Please send me suggestions... Here are the complete specifications for both servers: Poweredge 1435SC Dual Core AMD Opteron 2216 2.4GHz 3GB RAM 667MHz, 2x512MB and 2x1GB Single Ranked DIMMs Poweredge 1950 Dual Core Intel Xeon 5130 2.0Ghz 2GB 533MHz (4x512MB), Single Ranked DIMMs -- Juan Camilo Hernandez Ingenieria Sanitaria Universidad de Antioquia GIGAX - http://www.gigax.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070306/8a6916ce/attachment.html From mitch48 at yahoo.com Tue Mar 6 10:29:55 2007 From: mitch48 at yahoo.com (Tom Mitchell) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] IB switches: managed or not? In-Reply-To: References: Message-ID: <20070306182955.GC9529@xtl1.xtl.tenegg.com> On Tue, Mar 06, 2007 at 12:06:07AM +1100, Andrew Robbie (GMail) wrote: > Date: Tue, 6 Mar 2007 00:06:07 +1100 > From: "Andrew Robbie (GMail)" > To: beowulf@beowulf.org > Subject: [Beowulf] IB switches: managed or not? > > > Hi, > I am building a small (~16) node cluster with an IB interconnect. I need to > decide whether I will buy a cheaper, dumb switch and run OpenSM, or get a > more expensive switch with a built in subnet manager. The largest this > system would every grow is 32 nodes (two 24 port switches). A year ago the hands down answer was "built in subnet manager". Today, the OpenSM folk have made big improvements. We may be at or past the tipping point with OpenSM code quality. For LARGE clusters OpenSM or a vendor provided host based SM may be a requirement because the cards for many "built in subnet managers" simply run out of memory someplace beyond a gross (144) nodes and thousands. One big OpenSM bug/challenge is fail over. Make sure that exactly one copy of OpenSM is running. Once things are fine and dandy explore having a second copy but not three+. As far as I know there is nothing like Ethereal/Wireshark that applies to IB. There is no raw packet interface that I know of and if there was the bandwidth/memory issue would be a challenge for all modern processors. The managed switches do give you access to good statistics from the ports in the switch. My opinion is that you should save yourself some gray hair and get a managed switch as your first switch. The second IB switch can be managed by the first switch. Try and get all the IB parts from the same vendor. > Various vendors (integrators, not switch OEMs) have stated to me that > managed switches are the go, and that OpenSM is (a) buggy, and (b) very time > consuming to set up. But, a managed name brand switch seems to cost a lot > more than a non-managed one using the Mellanox reference design kit > (rebadged, but I suspect made by Flextronics...). > My other query is about diagnostic software. With an ethernet switch it is > pretty easy to fire up Ethereal (sorry Wireshark, but it is such a silly > name) or Etherape and get a look at what is going on. If I buy a Cisco or > Voltaire etc do they come with tools that let me get accurate > representations of what is going on? Or are their tools really for large IB > networks? > Regards, > Andrew > [v2] > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- T o m M i t c h e l l Found me a new place to hang my hat :-) Now it got bought. From angelv at iac.es Wed Mar 7 01:22:26 2007 From: angelv at iac.es (Angel de Vicente) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] network filesystem In-Reply-To: <200703011550.49618.jaime.perea@gmail.com> References: <200703011550.49618.jaime.perea@gmail.com> Message-ID: <82r6s1h0h9.fsf@kohji.angelv.es> Hi, > I have a small (16 dual xeon machines) cluster. We are going to add > an additional machine which is only going to serve a big filesystem via > a gigabit interface. > > Does anybody knows what is better for a cluster of this size, exporting the > filesystem via NFS or use another alternative such as a cluster filesystem > like GFS or OCFS? If you have a well-defined set of applications to run in this cluster, you can probably perform some benchmarks and make a more informed decission. If, like us, you don't know in advance what sort of applications you will be running in the cluster, you could have this extra node as a NFS server for /home and /scratch and install a parallel file system as well. This is what we have, with PVFS2, which for some users is just convenient (suddenly they think we bought a 4TB disk), and for some others is down to speed. They decide whether to use NFS or PVFS. We were hammering for weeks at PVFS in a test cluster without a glitch, but in any case I tell our users to not store in PVFS anything that they could not easily recreate. It is quite stable these days, but we don't have RAIDs or HA stuff for the PVFS servers, so if one is down the whole thing suffers. Cheers, Angel de Vicente From jaime.perea at gmail.com Wed Mar 7 05:08:39 2007 From: jaime.perea at gmail.com (Jaime Perea) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] network filesystem In-Reply-To: <82r6s1h0h9.fsf@kohji.angelv.es> References: <200703011550.49618.jaime.perea@gmail.com> <82r6s1h0h9.fsf@kohji.angelv.es> Message-ID: <200703071408.39751.jaime.perea@gmail.com> Hi, Well, it seems that this is a hot topic. I'm impressed for the quality of the answers! I think that since the disk server machine is not yet installed it will be useful to do a few tests in advance. My idea was to put this filesystem as /home, so there is going to be a lot of traffic of small files. I tend to think that for doing parallel intensive work it is better to use a distributed fs such as pvfs2. I did the question to the list because I imagined that one can take advantage of things like local cache for the gfs and ocfs2 cases. We have been smoothly working with gfs but with a two node and a FC san configuration. We also tested ocfs2 exporting a device via AoE (vblade) in a 3 node configuration, and I like the results (and much easier configuration btw) But when going to 16 nodes things can change a lot!, perhaps in that case nfs4 is the best solution (apart from being much easier to configure and maintain) . Thanks a lot! -- Jaime D. Perea Duarte. Linux registered user #10472 Dep. Astrofisica Extragalactica. Instituto de Astrofisica de Andalucia (CSIC) Apdo. 3004, 18080 Granada, Spain. From oplehto at csc.fi Wed Mar 7 08:12:33 2007 From: oplehto at csc.fi (Olli-Pekka Lehto) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] HPL on an ad-hoc cluster Message-ID: <45EEE471.4050404@csc.fi> I'm currently evaluating the possibility of building a ad-hoc cluster (aka. flash mob) at a large computer hobbyist event using Linux live CDs. The "cluster" would potentially feature well over a thousand personal computers connected by a good GigE -network. While thinking up ideas for potential demos, running HPL naturally came up. However the traditional MPI implementation will not cut it as the "cluster" in question is very volatile. It's fairly certain that a number of nodes will drop out from the cluster during the time it would take to run a reasonably-sized HPL benchmark on the system. I have thought up some possible workarounds for this: -Making a purpose-built implementation of HPL with elaborate software checkpointing and migration mechanism. Probably too demanding. -Using FT-MPI to make the HPL more resilient to node failures. I don't have hands-on experience with FT-MPI so I'm not sure how much effort this would take. -Running a short subset (single iteration of the main loop?) of HPL repeatedly until we get lucky and a run completes. Not that elegant but obviously the simplest choice. How well would the single iteration be representative of running the complete benchmark on the system? So, do you think that is this a pipe dream or a feasible project? Which path would you take to implement this? Olli-Pekka -- Olli-Pekka Lehto, Systems Specialist, Systems Services, CSC PO Box 405 02101 Espoo, Finland; tel +358 9 457 2215, fax +358 9 4572302 CSC is the Finnish IT Center for Science, www.csc.fi, e-mail: Olli-Pekka.Lehto@csc.fi From jlb17 at duke.edu Thu Mar 8 04:53:09 2007 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435 In-Reply-To: <4d2c60b30703060920j565b59dcnb7d916d77271d86f@mail.gmail.com> References: <4d2c60b30703060920j565b59dcnb7d916d77271d86f@mail.gmail.com> Message-ID: On Tue, 6 Mar 2007 at 12:20pm, Juan Camilo Hernandez wrote > I would like to know what server has the best performance for HPC systems > between The Dell Poweredge 1950 (Xeon) And 1435SC (Opteron). Please send me > suggestions... > > Here are the complete specifications for both servers: > > Poweredge 1435SC > Dual Core AMD Opteron 2216 2.4GHz > 3GB RAM 667MHz, 2x512MB and 2x1GB Single Ranked DIMMs > > Poweredge 1950 > Dual Core Intel Xeon 5130 2.0Ghz > 2GB 533MHz (4x512MB), Single Ranked DIMMs Here are some benchmarks I did on similar (though non-Dell) systems: http://www.duke.edu/~jlb17/optxeon.pdf -- Joshua Baker-LePain Department of Biomedical Engineering Duke University From i.kozin at dl.ac.uk Thu Mar 8 05:09:13 2007 From: i.kozin at dl.ac.uk (Kozin, I (Igor)) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] number of NFS daemons Message-ID: Hello! I was looking at our NFS server performance recently and was puzzled by the number of the daemons it was running - 33. It might be the default for Suse 10.1 but I am not sure. It's usually recommended to set the number to a multiple of 8 with 32 being perhaps the most popular. I've read that 16 or 32 nfsd daemons produces the most efficient blocking for I/O operations. How many NFS daemons people are using on a dedicated NFS server? Best, Igor I. Kozin (i.kozin at dl.ac.uk) CCLRC Daresbury Laboratory, WA4 4AD, UK skype: in_kozin tel: +44 (0) 1925 603308 http://www.cse.clrc.ac.uk/disco From charliep at cs.earlham.edu Thu Mar 8 05:54:13 2007 From: charliep at cs.earlham.edu (Charlie Peck) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] HPL on an ad-hoc cluster In-Reply-To: <45EEE471.4050404@csc.fi> References: <45EEE471.4050404@csc.fi> Message-ID: <343DBA1F-8F8A-450B-8A99-C524A2A089D6@cs.earlham.edu> On Mar 7, 2007, at 11:12 AM, Olli-Pekka Lehto wrote: > ... > So, do you think that is this a pipe dream or a feasible project? > Which path would you take to implement this? Consider something embarrassingly parallel with a work-pool model. Your assignment servers could be on stable machines, clients come and go as need be, you measure the rate at which work is being done and the total work done. If you are looking for a live CD Linux distro with cluster computing tools built-in consider the Bootable Cluster CD, http:// bccd.cs.uni.edu (full disclosure, I help a little bit on that project). charlie From rgb at phy.duke.edu Thu Mar 8 06:25:16 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] Benchmark between Dell Poweredge 1950 And 1435 In-Reply-To: <4d2c60b30703060920j565b59dcnb7d916d77271d86f@mail.gmail.com> References: <4d2c60b30703060920j565b59dcnb7d916d77271d86f@mail.gmail.com> Message-ID: On Tue, 6 Mar 2007, Juan Camilo Hernandez wrote: > Hello.. > > I would like to know what server has the best performance for HPC systems > between The Dell Poweredge 1950 (Xeon) And 1435SC (Opteron). Please send me > suggestions... > > Here are the complete specifications for both servers: > > Poweredge 1435SC > Dual Core AMD Opteron 2216 2.4GHz > 3GB RAM 667MHz, 2x512MB and 2x1GB Single Ranked DIMMs > > Poweredge 1950 > Dual Core Intel Xeon 5130 2.0Ghz > 2GB 533MHz (4x512MB), Single Ranked DIMMs > Almost certainly the opteron. For a variety of reasons, but higher clock certainly helps -- it would probably have been faster at equivalent clock anyway. Now that I've "answered", let me tell you why you should't believe me and what you should actually do to answer your own question. There is a standard litany we like to chant on this list: "Your Mileage May Vary" "A benchmark in hand is worth any number of anecdotal reports" "The best benchmark is your own application" "What do you plan to do with it?" "It depends..." "In particular, it depends on your application (mix), its memory and disk and network requirements, the topology and type of your network, the communication and memory access pattern used by your application (mix), the compiler and library used, and a few dozen other variable major and minor, which is why nobody is going to tell you one is always better than the other even if >>they<< think it is true..." "And then there is the cost -- the REAL question is which one has the better cost-benefit, not which one is the cheapest or fastest independently. Ask yourself the question -- with a fixed budget to spend, which architecture lets me get the most done in the least time." So if you like, I wouldn't be doing you a favor by telling you >>definitely<< the opteron only partly because it might not be true. If you believed me (because I sound so glib and because you don't know that AMD once sent me a cool tee-shirt and Intel hasn't, although I do have a pair of these cool little contamination-suited Intel dude keychains that come close) then you might be tempted to skip the CORRECT cluster engineering step(s) of: a) Study your application (mix) -- figure out in at least general terms its (their) communication patterns, its (their) memory requirements (size and access pattern), its (their) CPU requirements. Some applications are "I/O bound" -- run at a speed determined by the access speed of disk, for example. Some applications are "memory bound" -- they spend all of their time fetching data from memory, relatively little on actually doing something to it. Some applications (especially parallel cluster applications) are "network bound" and run at a rate that is determined by the latency or bandwidth of a network connection, further complicated in the case of real parallel code by the communication PATTERNS which can cause bottlenecks outside of the system altogether. Some applications (the happiest ones, I tend to think:-) run at a rate that is limited by CPU clock, clock, clock and nothing but clock, although different CPU architectures (e.g. Xeon and Opteron, 32 or 64 bit) have a different BASE performance at any given clock. b) If at all possible, and it nearly always is possible, beg, borrow, steal, buy, or rent a system or two in your competing architectures and run YOUR CODE compiled with YOUR PLANNED COMPILER on those systems and just measure its performance. This is actually a whole lot easier than the stuff in step a) and a lot more likely to be accurate, but I still don't advise skipping a). If you are planning on buying more than a handful of systems, it is actually often worth your while to >>buy<< one of each of two or three or even four candidate system, test them, and then buy the other 127 or however many nodes you plan to put in your cluster of the winner, instead of buying 128 of the wrong kind. You can recycle the losers as servers, really powerful desktops, whatever. A really good vendor will often loan you systems (or network access to systems) to do this testing. A really good compiler vendor (e.g. pathscale) will even/often lend you a compiler for a trial period to do the testing. Or there may be list humans who own a system who will set up a trial account for you -- it's a pretty friendly list;-) c) Don't lock yourself in to only Dells (or any single distributer) while looking over systems. I personally do not dislike Dell, although I know people that do. Their hardware is not the most reliable that I've ever used -- far from it, actually -- but their service plans tend to be very good, their cost is reasonable, and they aren't linux-averse although I think that they're still working on becoming actively linux-friendly. However, there are a number of other tier 1 and tier 2 vendors you should be considering with hardware that is as good or better (in my opinion MUCH better) and with equally attractive prices and service deals. IBM, for example, is also linux-friendly and tends to make excellent if gold-plated hardware. Penguin Computing is my own personal favorite, largely because with the exception of one DOA system out of a good size stack of Altus's we've gotten (no doubt the one that "fell off the truck" and likely not Penguin's fault) I have yet to see an Altus fail in harness. Seriously. Pretty extraordinary, really, given that they run at full load pretty much 24x7 for as long as years at this point. I've heard that their service is really good -- maybe one day I'll have a chance to find out...:-). Penguin will almost certainly let you prototype on their systems d) When you've done all your research above, then DO THE COST BENEFIT ANALYSIS. If your application is network bound, don't worry so much about system clock and speed, worry about getting a really high speed cluster network to match (which is expensive, so you may want to get CHEAPER SLOWER nodes if the app isn't CPU bound anyway). If your application is memory bound, you may want to skip the dual cores and get two single cores or quad single cores -- otherwise you might just be using two cores at a time while the other cores are waiting in line to get at memory, wasting all the money you spent on the dual cores in the first place. If your applications is disk bound then look more closely at disk and less at CPU -- what kind of bus, what kind of disk subsystem, what are the bottlenecks (per system) and the costs of minimizing them. As you can hopefully now see, the RIGHT question to have asked isn't which of two particular systems out of twenty on the market is "best" in some amorphous way, it is which of the twenty systems in the two thousand possible ways of configuring them with network and disk and memory and CPU and compiler will get the most work done for your investment of a fixed amount of money. Answer that, and then make your purchase with confidence. I'm sure that other list-humans have experiences or suggestions to share here. If you are very unsure of your abilities to carry out the list of chores above, there are at least 2 or 3 professional cluster consultants on the list who would probably help you for a moderate fee -- ask them to contact you offline if you are interested as they generally won't spam the list beyond maybe letting you know that they exist while helping to answer your original question. They can do anything from helping you with the prototyping and analysis to provide you with a cost-competitive turnkey cluster, depending on your needs and cluster management skills. I myself provide the kind of dear-abby advice above on-list and charge only beer (should we ever meet). Mind you, at this point if I ever actually received the beer due me according to this rule, I would die in a gutter somewhere inside six months with my liver in complete failure, so it is probably just as well that I generally don't go to cluster meetings and so forth...;-) rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From bill at Princeton.EDU Thu Mar 8 06:29:59 2007 From: bill at Princeton.EDU (Bill Wichser) Date: Fri May 9 01:05:43 2008 Subject: [Beowulf] number of NFS daemons In-Reply-To: References: Message-ID: <45F01DE7.20705@princeton.edu> Igor, Once upon a time there was a hardcoded limit on the number of threads an nfsd could support. Twenty is the number I seem to recall but I haven't researched this for awhile. I've used this as a baseline for how many daemons to start by: (num of mounts * num of nodes) / 20 Bill Kozin, I (Igor) wrote: > Hello! > I was looking at our NFS server performance recently > and was puzzled by the number of the daemons it was > running - 33. It might be the default for Suse 10.1 > but I am not sure. It's usually recommended to set the > number to a multiple of 8 with 32 being perhaps the > most popular. I've read that 16 or 32 nfsd daemons > produces the most efficient blocking for I/O operations. > > How many NFS daemons people are using on a dedicated > NFS server? > > Best, > Igor > > I. Kozin (i.kozin at dl.ac.uk)