From tjrc at sanger.ac.uk Tue May 1 01:16:56 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Sat May 10 01:06:02 2008 Subject: [Beowulf] Sorry sorry sorry Message-ID: <53A48808-E518-4C0B-B003-821790E4599E@sanger.ac.uk> Ouch. I am now the colour of a beetroot/cranberry/red-fruit-of-your- choice. No idea how that happened. I suspect incorrect automatic mail client address autocompletion by Apple Mail, coupled with idiotic user (me) sending mail too late at night. Sorry about that folks. At least it wasn't anything incriminating... Tim From j.boyle at manchester.ac.uk Tue May 1 07:32:26 2007 From: j.boyle at manchester.ac.uk (Jonathan Boyle) Date: Sat May 10 01:06:03 2008 Subject: Fwd: Re: [Beowulf] Why is communication so expensive for very small messages? Message-ID: <200704261216.37229.j.boyle@manchester.ac.uk> Thanks, we're using 1.2.6, so we'll have to look into upgrading. ---------- Forwarded Message ---------- Subject: Re: [Beowulf] Why is communication so expensive for very small messages? Date: Tue, 24 Apr 2007 18:03:51 -0600 From: "Michael H. Frese" To: Jonathan Boyle Cc: beowulf@beowul.org Sorry, the most recent version of mpich1 is 1.2.7. The older version that was doing the message aggregation was 1.2.1. >You don't say which version of mpich you are using, but we found small >messages taking 1 ms last fall. Upgrading from an old version of mpich1 >(ca. 2001) to the most recent version (ca. 2005, 1.2.27?) fixed the >problem. The problem was probably one of the OS holding the teensy little >messages hoping for more data to send it with -- message aggregation, I >suppose it is called. The newer version of mpich must have set the OS >flags properly to prevent that. > >I can't tell you about mpich2, as we have no experience with that yet. Mike Frese At 09:54 AM 4/24/2007, you wrote: >I apologise if this is a naive question, but I'm new to this world of >beowulfs. > >I'm using C++/mpi, to get a feel for communication costs I ran tests using >mpptest and my own programs. > >For 2 processor blocking calls, mpptest indicates a latency of about 30 >microseconds. > >However when I measure communication times in my own program using a loop as >follows.... > >MPI_Barrier(MPI_COMM_WORLD); >start = MPI_Wtime(); >for (unsigned t=1; t<=5000; t++) >{ > if (my_rank==0) > { > MPI_Send(data, size, MPI_INT, 1, tag, MPI_COMM_WORLD); > } > else > { > MPI_Recv(data, size, MPI_INT, 0, tag, MPI_COMM_WORLD, &status); > } >} >end = MPI_Wtime(); > >for size>=4, I get a latency of about 30 microseconds as expected, however >for >size<4, communication costs increase massively, and latency now appears to > be 1ms! > >Firstly, I assume this isn't normal? > >Secondly, can anyone suggest what's going on, or where I can go for more >information. > >Many thanks. > >We're using mpich. > >Processors are Intel(R) Xeon(TM) CPU 3.60GHz. > >Interconnects are Dell PowerConnect 5324 24-port gigabit switches. > > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf ------------------------------------------------------- From mathog at caltech.edu Tue May 1 12:27:46 2007 From: mathog at caltech.edu (David Mathog) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Strange hardware? problems Message-ID: Robert G. Brown" wrote > I've been coding, one way or another, for coming up on 35 years or > thereabouts, starting with paper tape, going through cards (lots of > cards), and up the evolutionary ladder. In all of that time, I've > encountered one -- count it, one -- time that a consistent error in code > I was running was due to a real failure in the hardware I was running on > and not a bug in my own code. RGB has an extra 5 years on me, but my experience has been similar: only very, very, very rarely is a program fault the result of a true hardware issue. (This excludes anything that runs from one box to another over a cable or fiber, where hardware issues are more common.) We once tracked a bug in an FFT subroutine running on an array processor to faulty memory, and right down to a memory pattern suggesting two address pins were shorted together. On opening the beast up, sure enough, the short was right where it had to be, and it was repaired with a scalpel. This was around 1982. Anyway, one caveat. With the proliferation of x86 variants I now on occasion hit a binary which has been compiled for some other processor variant that blows up when it tries to use an instruction which is not supported on the processor it is actually running on. As I mentioned previously, valgrind can catch these for you. Or recompile using switches you know are supported on the target processor. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From orion at cora.nwra.com Tue May 1 14:41:28 2007 From: orion at cora.nwra.com (Orion Poplawski) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Re:Strange hardware? problems In-Reply-To: References: Message-ID: <4637B408.8070706@cora.nwra.com> David Mathog wrote: > Since the same code runs differently on two different Opteron models > it's probably either a memory access issue or the use of a compiler > flag that enables some feature on one model that is not present > on the other. For instance, SSE3 vs. SSE2, although I don't know > enough about these models to tell you what the most likely flag would > be. (The fact that it runs ok on the newer one and blows up on the > older one is consistent with this type of error.) > > Assuming gcc, recompile with: > > -O0 -g -std=c99 -Wall > > and clean up any warnings that result until you get a clean build. > Repeat with -O3 and -O2, as for strange reasons that sometimes uncovers > logic problems not seen at -O0. Then run the resulting binary > within valgrind. Fix any memory access violations which are found. > Valgrind can also alert you to the use of unsupported operations. > The code compiles quite cleanly, but I am seeing different behavior with different compiler flags and different compilers. We'll see if I can bisect the problem into a small enough box. Thanks for the poke in that direction... -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From rgb at phy.duke.edu Tue May 1 16:28:08 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Strange hardware? problems In-Reply-To: References: Message-ID: On Tue, 1 May 2007, David Mathog wrote: > Anyway, one caveat. With the proliferation of x86 variants I now > on occasion hit a binary which has been compiled for some other > processor variant that blows up when it tries to use an instruction > which is not supported on the processor it is actually running on. As I > mentioned previously, valgrind can catch these for you. Or recompile > using switches you know are supported on the target processor. Completely agreed. In fact in my case the problem was a mix of compiler and the fact that I was using an Intel CPU which had an obscure multiplication bug that was eventually worked around in the compiler. The point is that EVEN if the problem turns out to be hardware or compiler or something nominally "beyond your control", the solution is always the same. Instrument the hell out of your code, run it to failure, accumulate data on the failure thereby, reinstrument to catch the failure more tightly, iterate until you know that this run failed between these two instructions and that the values of all possible loop indices and variables at that time was the following vector of numbers and that they all are/are not what they should be and if not they began to diverge here, and here are the values of everything around THAT point. Then you may have to literally single step through the logic to see where either you asked the machine to multiple six (and the variable indeed contained six) times seven (and the variable indeed contained seven) and the stupid computer returned forty-ONE. Or where it returned 42 but three lines further on your index went out of bounds on the dynamic array pointer and you overwrote 42 with 0xA321FD07 (random garbage). The worst bug I can recall ever having to squash in my own code was back in my fortran days, where I had strong type checking and everything. On a single line in a program of several thousand lines I had typed an N instead of an M in a program that used both (in fact N was absolute value of M). Where I did this both made sense, but N was not in fact initialized and hence had a fixed value of zero. Zero was a possible value -- these were angular momentum indices -- and the function values returned were not only plausible they were correct for certain values of the input parameters to the overall program. They where just wrong, usually by a fixed factor, for others (and fortunately I had a pretty strong idea of what right was). EVEN instrumenting the code to where I literally single stepped through the fortran -- and this was code on cards, so there was nothing like an interactive debugger, mind -- I swear I stared at the code for close to a week before the N/M typo finally jumped out of all those lines and smacked me right between the eyes. The point being don't expect the process to be easy. At least you have the advantage of having a high probability of failure. The worst that can happen to you is that as you instrument the code with output lines the problem will disappear. This, too, has happened to me on more than one occassion -- changing the alignment of the code even in small ways sometimes causes a failure to be missed if the problem is with pointers and memory or a failure like the one David describes. I actually debugged a "heisenbug" like this at a different level in scripted code on Friday. A long and complex program was being started up on a dedicated server as the last step in the boot process. The program itself wrote extensively to output during startup and was initiated via a fairly standard init.d script that backgrounded the startup call. The system was booting fine, you could see the application startup occurring "successfully", but post-boot it wasn't running! Consistently. If you started it by hand, it came up perfectly. When I started to instrument the startup script to "watch" it start up, it suddenly started to come up during the boot. It turned out to be a race condition between the program and the completion of the rc boot process by init. The program took five seconds to start and wrote to stdout the whole way in the background. When the rc script finished in the foreground, it went away taking the backgrounded script's tty with it, so it crashed without a trace. Running it by hand from a tty obviously worked fine as long as you watched. Running it by script in the boot worked fine as long as you watched. To get it to work WITHOUT watching, one had to either add a 6 second sleep to the startup script to wait for it to finish writing to stdout (and hence get a /var/log/messages trace) or redirect stdout and stderr to /dev/null (and lose it). Or rewrite the code itself that was being backgrounded to log directly and not write to stdout except in debug mode... but it wasn't my code. Or maybe even move the code up to where at least five seconds worth of other startup remained before rc boot completed. Bugs can be SUBTLE. Bugs can be heisenbugs like this that are other people's "fault" (but your problem). Be patient, be systematic, be meditative and await Enlightenment. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From pxrist at gmail.com Tue May 1 03:34:57 2007 From: pxrist at gmail.com (Panagiotis Christopoulos) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Problem while booting diskless node. In-Reply-To: <82C63820-6368-481D-BE08-74B573EC3E30@ulb.ac.be> References: <82C63820-6368-481D-BE08-74B573EC3E30@ulb.ac.be> Message-ID: <200705011334.57546.pxrist@gmail.com> On Monday 30 April 2007 13:04, Maxime Kinet wrote: > Hi, > > I'm trying to set up my first cluster with diskless nodes. To achieve > that, I'm using PXElinux on a server, running Fedora Core 6, and a > NFS-mounted root partition on the node. > Everything works perfectly > (getting the IP address, loading the kernel and mounting the > filesytem) until the node has to run some binaries located into /sbin > during the boot process. Apparently it's unable to execute them > because they have been compiled with dynamically linked libraries and > not statically. The /sbin directory of the node is a simple copy of > the one of the server. I suppose, that you have done something like, creating a /diskless directory inside your nfs,tftp,dhcp etc... server, copying files(eg. /usr/* ) from the server, inside that /diskless dir with the same hierarchy, and this resulted in a structure which would home, your nfs exported, root fs of your nodes. I'm not an expert, but because you said about dynamic libraries, I cannot understand, why this is a problem, you copied /sbin inside /diskless, but you didn't copy /lib or /usr/lib? The problem with dynamic libraries, starts because your /diskless does not have these libraries. > I tried to avoid the problem using the busybox > tools, and it worked a bit better but then it couldn't execute bash > scripts such as rc.sysinit. Have you created an init in your busybox, to chroot(exec switch_root) inside your nfs root fs after mounting it? > As anybody ever encountered such problems and what should I do to > solve it? recompile the kernel of the node or of the server? change > the distribution? Are there any other simpler method to proceed than > using PXE? There are two things you can do. As Douglas Eadline said, it starts by thinking if you want to reinvent the wheel or not. If you have time, machines, if you know that your teachers won't get annoyed and you can work in a university lab, so you will not pay for the power supply yourself:p continue with fedora and all these brainstorming things. You will learn linux administration and propably you will do amazing things. If you don't have time etc. the guys in warewulf, are doing the same job for about 7 years, and they provide you all this knowledge they gained, in a simple installation process. Back in the technical stuff, from your sayings, I think that something is wrong with your /diskless dir(if you have one, of course). I cannot understand why you want to use busybox. We use busybox when why want an initramfs to do specific jobs(such as unlocking and mounting encrypted partitions,yes, I know this is not the best example I could give) before chrooting inside our real root fs and exec init as Mark Hahn said or if we are running embedded. Also, I don't think that you have to change your distribution, and if you don't like PXE, you can see how the guys in LTSP boot( I think they use both PXE and "etherboot" and you can make a choice), but for me, syslinux is fine! This was my point of view, I hope I helped and if you want to ask something, feel free to send me a mail, or of course, ask again in the list, Panagiotis Christopoulos System Administrator Technological Institute of Athens Department of Informatics From supercomputer at gmail.com Tue May 1 05:34:52 2007 From: supercomputer at gmail.com (Chris Vaughan) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Syslog Server-Traffic Message-ID: <216ee070705010534r7cc4eddr9328b8f8ea223925@mail.gmail.com> Hello, I'm researching setting up a cluster and I'm curious as to whether or not it's a good idea to set up a syslog server. The question I have is whether the traffic created from logging is going to slow down my network to the point of poor performance? What are peoples experiences with logging. The cluster will have a management network and a computational network. This is what I'm thinking: Greater than 64 Nodes, Yes. Greater than 64 and less than 128, Maybe? Greater than 128 No Any input would be great, Thanks! -- ------------------------------ Christopher Vaughan From lbickley at bickleywest.com Tue May 1 08:00:29 2007 From: lbickley at bickleywest.com (Lyle Bickley) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Sorry sorry sorry In-Reply-To: <53A48808-E518-4C0B-B003-821790E4599E@sanger.ac.uk> References: <53A48808-E518-4C0B-B003-821790E4599E@sanger.ac.uk> Message-ID: <200705010800.29733.lbickley@bickleywest.com> On Tuesday 01 May 2007 01:16, Tim Cutts wrote: > Ouch. I am now the colour of a beetroot/cranberry/red-fruit-of-your- > choice. > > No idea how that happened. I suspect incorrect automatic mail client > address autocompletion by Apple Mail, coupled with idiotic user (me) > sending mail too late at night. > > Sorry about that folks. At least it wasn't anything incriminating... Gosh, I thought you included the Beowulf list because you were suggesting that any of us visiting the UK (and Cambridge in particular) were welcome in your home ;-) Lyle -- Lyle Bickley Bickley Consulting West Inc. Mountain View, CA http://bickleywest.com "Black holes are where God is dividing by zero" From dkondo at lri.fr Wed May 2 06:33:43 2007 From: dkondo at lri.fr (Derrick Kondo) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] [CFP] EuroPVM/MPI'07 -- submission deadline extended to May 14th Message-ID: <60ec14620705020633r2ed22633k773717e55c1df862@mail.gmail.com> The full paper submission deadline for EuroPVM/MPI 2007 has been extended to May 14th at 11:59 AM (noon) UTC ***However, please submit paper abstracts by May 7th.*** ************************************************************************ *** *** *** CALL FOR PAPERS *** *** *** ************************************************************************ EuroPVM/MPI 2007 14th European PVMMPI Users' Group Meeting Paris, France, September 30 - October 3, 2007 web: http://www.pvmmpi07.org e-mail: chairs@pvmmpi07.org submission deadline for papers abstracts: May 7th, 2007 submission deadline for full papers and poster abstracts: extended to May 14th, 2007 at 11:59 AM (noon) UTC submission site: http://pvmmpi07.lri.fr/submissions organized by Project Grand-Large (http://grand-large.lri.fr/index.php/Accueil) from INRIA Futurs (http://www-futurs.inria.fr) ------------------------------------------------------------------------------------------- BACKGROUND AND TOPICS PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) have evolved into the standard interfaces for high-performance parallel programming in the message-passing paradigm. EuroPVM/MPI is the most prominent meeting dedicated to the latest developments of PVM and MPI such as new support tools, implementation and applications using these interfaces. The EuroPVM/MPI meeting naturally encourages discussions of new message-passing and other parallel and distributed programming paradigms beyond MPI and PVM. The 14th European PVM/MPI Users' Group Meeting will be a forum for users and developers of PVM, MPI, and other message-passing programming environments. Through the presentation of contributed papers, vendor presentations, poster presentations and invited talks, attendees will have the opportunity to share ideas and experiences to contribute to the improvement and furthering of message-passing and related parallel programming paradigms. Topics of interest for the meeting include, but are not limited to: * PVM and MPI implementation issues and improvements * Latest extensions to PVM and MPI * PVM and MPI for high-performance computing, clusters and grid environments * New message-passing and hybrid parallel programming paradigms * Interaction between message-passing software and hardware * Fault tolerance in message-passing programs * Performance evaluation of PVM and MPI applications * Tools and environments for PVM and MPI * Algorithms using the message-passing paradigm * Applications in science and engineering based on message-passing This year special emphasis will be put on large-scale issues, such as those related to hardware and interconnect techologies, or the potential or demonstrated shortcomings of PVM or MPI. As in the preceding years, the special session 'ParSim' will focus on numerical simulation for parallel engineering environments. EuroPVM/MPI 2007 will also hold the new 'Outstanding Papers' session introduced in 2006, where the best papers selected by the program committee will be presented. SUBMISSION INFORMATION Submission site: http://pvmmpi07.lri.fr/submissions Contributors are invited to submit a full paper as a PDF (or Postscript) document not exceeding 8 pages in English (2 pages for poster abstracts and Late and Breaking Results). The title page should contain an abstract of at most 100 words and five specific keywords. The paper needs to be formatted according to the Springer LNCS guidelines [2]. The usage of LaTeX for preparation of the contribution as well as the submission in camera ready format is strongly recommended. Style files can be found at the URL [2]. New work that is not yet mature for a full paper, short observations, and similar brief announcements are invited for the poster session. Contributions to the poster session should be submitted in the form of a two-page abstract. All these contributions will be fully peer reviewed by the program committee. Submissions to the special session 'Current Trends in Numerical Simulation for Parallel Engineering Environments' (ParSim 2007) are handled and reviewed by the respective session chairs. For more information please refer to the ParSim website [1]. All accepted submissions are expected to be presented at the conference by one of the authors, which requires registration for the conference. IMPORTANT DATES Submission of paper abstracts May 7th, 2007 Submission of full papers and poster abstracts May 14th, 2007 at 11:59 AM (noon) UTC Notification of authors June 19th, 2007 Camera-ready papers July 9th, 2007 Submission of Late and Breaking Results September 15th, 2007 Tutorials September 30th, 2007 Conference October 1st-3rd, 2007 For up-to-date information, visit the conference web site at http//www.pvmmpi07.org. PROCEEDINGS In addition, selected papers of the conference, including those from the 'Outstanding Papers' session, will be considered for publication in a special issue of Parallel Computing in an extended format. GENERAL CHAIR * Jack Dongarra (University of Tennessee) PROGRAM CHAIRS * Franck Cappello (INRIA Futurs) * Thomas Herault (Universite Paris Sud-XI / INRIA Futurs) CONFERENCE VENUE The conference will be held in the historical, cultural and economic center of Paris, the capital of France. The city, which is renowned for its neo-classical architecture, hosts many museums and galleries and has an active nightlife. The symbol of Paris is the 324 metre (1,063 ft) Eiffel Tower on the banks of the Seine. Dubbed "the City of Light" (la Ville Lumiere) since the 19th century, Paris is regarded by many as one of the most beautiful and romantic cities in the world. It is also the most visited city in the world with more than 30 million foreign visitors per year. Paris is easily reachable from any European capital and most of the large European, American and Asian cities. It is an ideal starting point for visiting european institutes and cities. REFERENCES [1] ParSim 2007: http://wwwbode.in.tum.de/Par/arch/events/parsim07/ [2] Springer Guidelines: http://www.springer.de/comp/lncs/authors.html From landman at scalableinformatics.com Wed May 2 07:16:14 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] fast file copying In-Reply-To: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> References: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> Message-ID: <46389D2E.7040305@scalableinformatics.com> Geoff Galitz wrote: > > Hi folks, > > During an HPC talk some years ago, I recall someone mentioned a tool > which can copy large datasets across a cluster using a ring topology. > Perhaps someone here knows of this tool? There are a few, commercial, and open source. On the commercial side is exludus, xcp, and maybe one or two others. Exludus is basically a file pre-caching mechanism. Java based. xcp (by Scalable) is MPI based. It does a pretty good job of moving data. \ On the open source side, I havent seen things other than the udp broadcast based tools (we had written one several years ago, named mcp), but anyone using a cluster will tell you that udp broadcast can be very detrimental to non-udp broadcast usage of the switch, say for logins, NFS, command and control, ...) > > More to the point, we are pushing around datasets that are about > 1Gbyte. The datasets are pushed out to dozens of nodes all at once and > we foresee saturating the I/O system on our cluster as we grow. We are > limited to using just the available disks and are looking for a > reasonable solution that can support this kind of simultaneous access. xcp might help. > Currently we push the data out using rsync, but if I don't get any > better ideas I may simply move to a pull system where the data is > fetched by HTTP. I can get better throttling that way, at least. For a few dozen nodes, this might work. Joe > > -geoff > > > Geoff Galitz > geoff@galitz.org > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From scheinin at crs4.it Wed May 2 07:30:09 2007 From: scheinin at crs4.it (Alan Louis Scheinine) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] fast file copying In-Reply-To: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> References: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> Message-ID: <4638A071.1080008@crs4.it> One possibility is nettee. http://saf.bio.caltech.edu/nettee.html The current version of nettee is 0.1.7, from July 20,2005. nettee is a network "tee" program. It can typically transfer data between N nodes at (nearly) the full bandwidth provided by the switch which connects them. It is handy for cloning nodes or moving large database files. From eugen at leitl.org Wed May 2 07:50:39 2007 From: eugen at leitl.org (Eugen Leitl) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] EuroPVM/MPI'07 -- submission deadline extended to May 14th Message-ID: <20070502145039.GK17691@leitl.org> ----- Forwarded message from Rolf Rabenseifner ----- From: Rolf Rabenseifner Date: Wed, 2 May 2007 16:50:28 +0200 (CEST) To: eugen@leitl.org Subject: EuroPVM/MPI'07 -- submission deadline extended to May 14th Dear HLRS User or member of my course-invitation-list, as member of the program committee, I'm sending you the CFP for the EuroPVM/MPI 2007. The deadline for full paper and poster abstract submissions is May 14th, 2007 (with deadline for abstracts already on May 7th). Best regards Rolf Rabenseifner ------------------------------------------------------------------------- The full paper submission deadline for EuroPVM/MPI 2007 has been extended to May 14th at 11:59 AM (noon) UTC ***However, please submit paper abstracts by May 7th.*** ************************************************************************ *** *** *** CALL FOR PAPERS *** *** *** ************************************************************************ EuroPVM/MPI 2007 14th European PVMMPI Users' Group Meeting Paris, France, September 30 - October 3, 2007 web: http://www.pvmmpi07.org e-mail: chairs@pvmmpi07.org submission deadline for papers abstracts: May 7th, 2007 submission deadline for full papers and poster abstracts: extended to May 14th, 2007 at 11:59 AM (noon) UTC submission site: http://pvmmpi07.lri.fr/submissions organized by Project Grand-Large (http://grand-large.lri.fr/index.php/Accueil) from INRIA Futurs (http://www-futurs.inria.fr) ------------------------------------------------------------------------------------------- BACKGROUND AND TOPICS PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) have evolved into the standard interfaces for high-performance parallel programming in the message-passing paradigm. EuroPVM/MPI is the most prominent meeting dedicated to the latest developments of PVM and MPI such as new support tools, implementation and applications using these interfaces. The EuroPVM/MPI meeting naturally encourages discussions of new message-passing and other parallel and distributed programming paradigms beyond MPI and PVM. The 14th European PVM/MPI Users' Group Meeting will be a forum for users and developers of PVM, MPI, and other message-passing programming environments. Through the presentation of contributed papers, vendor presentations, poster presentations and invited talks, attendees will have the opportunity to share ideas and experiences to contribute to the improvement and furthering of message-passing and related parallel programming paradigms. Topics of interest for the meeting include, but are not limited to: * PVM and MPI implementation issues and improvements * Latest extensions to PVM and MPI * PVM and MPI for high-performance computing, clusters and grid environments * New message-passing and hybrid parallel programming paradigms * Interaction between message-passing software and hardware * Fault tolerance in message-passing programs * Performance evaluation of PVM and MPI applications * Tools and environments for PVM and MPI * Algorithms using the message-passing paradigm * Applications in science and engineering based on message-passing This year special emphasis will be put on large-scale issues, such as those related to hardware and interconnect techologies, or the potential or demonstrated shortcomings of PVM or MPI. As in the preceding years, the special session 'ParSim' will focus on numerical simulation for parallel engineering environments. EuroPVM/MPI 2007 will also hold the new 'Outstanding Papers' session introduced in 2006, where the best papers selected by the program committee will be presented. SUBMISSION INFORMATION Submission site: http://pvmmpi07.lri.fr/submissions Contributors are invited to submit a full paper as a PDF (or Postscript) document not exceeding 8 pages in English (2 pages for poster abstracts and Late and Breaking Results). The title page should contain an abstract of at most 100 words and five specific keywords. The paper needs to be formatted according to the Springer LNCS guidelines [2]. The usage of LaTeX for preparation of the contribution as well as the submission in camera ready format is strongly recommended. Style files can be found at the URL [2]. New work that is not yet mature for a full paper, short observations, and similar brief announcements are invited for the poster session. Contributions to the poster session should be submitted in the form of a two-page abstract. All these contributions will be fully peer reviewed by the program committee. Submissions to the special session 'Current Trends in Numerical Simulation for Parallel Engineering Environments' (ParSim 2007) are handled and reviewed by the respective session chairs. For more information please refer to the ParSim website [1]. All accepted submissions are expected to be presented at the conference by one of the authors, which requires registration for the conference. IMPORTANT DATES Submission of paper abstracts May 7th, 2007 Submission of full papers and poster abstracts May 14th, 2007 at 11:59 AM (noon) UTC Notification of authors June 19th, 2007 Camera-ready papers July 9th, 2007 Submission of Late and Breaking Results September 15th, 2007 Tutorials September 30th, 2007 Conference October 1st-3rd, 2007 For up-to-date information, visit the conference web site at http//www.pvmmpi07.org. PROCEEDINGS In addition, selected papers of the conference, including those from the 'Outstanding Papers' session, will be considered for publication in a special issue of Parallel Computing in an extended format. GENERAL CHAIR * Jack Dongarra (University of Tennessee) PROGRAM CHAIRS * Franck Cappello (INRIA Futurs) * Thomas Herault (Universite Paris Sud-XI / INRIA Futurs) CONFERENCE VENUE The conference will be held in the historical, cultural and economic center of Paris, the capital of France. The city, which is renowned for its neo-classical architecture, hosts many museums and galleries and has an active nightlife. The symbol of Paris is the 324 metre (1,063 ft) Eiffel Tower on the banks of the Seine. Dubbed "the City of Light" (la Ville Lumiere) since the 19th century, Paris is regarded by many as one of the most beautiful and romantic cities in the world. It is also the most visited city in the world with more than 30 million foreign visitors per year. Paris is easily reachable from any European capital and most of the large European, American and Asian cities. It is an ideal starting point for visiting european institutes and cities. REFERENCES [1] ParSim 2007: http://wwwbode.in.tum.de/Par/arch/events/parsim07/ [2] Springer Guidelines: http://www.springer.de/comp/lncs/authors.html +++ We apologize if you receive this CfP more than once +++ ----- End forwarded message ----- -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From hahn at mcmaster.ca Wed May 2 21:54:54 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Syslog Server-Traffic In-Reply-To: <216ee070705010534r7cc4eddr9328b8f8ea223925@mail.gmail.com> References: <216ee070705010534r7cc4eddr9328b8f8ea223925@mail.gmail.com> Message-ID: > I'm researching setting up a cluster and I'm curious as to whether or > not it's a good idea to set up a syslog server. The question I have depends on how much syslog activity you have, and whether you care to look at it (in one spot). there _are_ scalable and more robust system/event logging approaches, but plain old syslog is pretty good. > is whether the traffic created from logging is going to slow down my > network to the point of poor performance? a syslog message is normally a smallish UDP packet - say 100 bytes. if you have 200 nodes each doing 5 per second, that's still only 100KB/s - a pretty small fraction of a server's gigabit bandwidth. and if you actually have 5/s, something's probably wrong... > experiences with logging. The cluster will have a management network > and a computational network. I'm always skeptical about this advice - it's obvious that it might be good in cases where a node sustains a nontrivial stream of management traffic (say, NFS traffic using jumbo frames) which would interfere with possible latency-sensitive/small MPI packets. but how often does that happen? consider that with a non-jumbo gigabit net, a full packet is only 15 us more than the ~40 or so for a minimal one. further, I observe MPI codes mostly getting packed into full nodes, and not interfering with themselves much (distinct MPI and file IO phases to the program.). I could more readily imagine segregating traffic to two nets based on packet size or TOS. or bonding them in the first place. in any case, I think you'd have to work pretty hard to generate enough syslog traffic to matter much. > Greater than 64 Nodes, Yes. > Greater than 64 and less than 128, Maybe? > Greater than 128 No for a smallish cluster like 64 nodes, I don't think I'd worry about syslog even if the net were just 100bT. for going above 200 nodes, I'd probably try to do some measurements and extrapolation, but the numbers above make it look minor. From felix.rauch.valenti at gmail.com Thu May 3 01:06:04 2007 From: felix.rauch.valenti at gmail.com (Felix Rauch Valenti) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] fast file copying In-Reply-To: <4638A071.1080008@crs4.it> References: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> <4638A071.1080008@crs4.it> Message-ID: <4eafc81b0705030106g72471b29u8344516d559e24f7@mail.gmail.com> On 03/05/07, Alan Louis Scheinine wrote: > One possibility is nettee. > http://saf.bio.caltech.edu/nettee.html > > The current version of nettee is 0.1.7, from July 20,2005. > > nettee is a network "tee" program. It can typically transfer > data between N nodes at (nearly) the full bandwidth provided by the switch > which connects them. It is handy for cloning nodes or moving large > database files. As a related side note: If the bandwidth you get is not what you expect, it may well be that your switch is bad (or that your disks are slow). That was my experience a couple of years ago, so we implemented a switch benchmark called "Switchbench", that helps to identify the bandwidth bottleneck in a network. - Felix From erwan at seanodes.com Thu May 3 03:14:14 2007 From: erwan at seanodes.com (Erwan Velu) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Archives looks like broken Message-ID: <4639B5F6.30203@seanodes.com> Hey folks, I just realize the archive website looks like broken : http://www.scyld.com/pipermail/beowulf/ Can anyone fix it ? Thanks, Erwan, From orion at cora.nwra.com Thu May 3 15:48:46 2007 From: orion at cora.nwra.com (Orion Poplawski) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Please help test compiler/hardware issue Message-ID: <463A66CE.3040100@cora.nwra.com> Okay, I have a test case for the problem I reported before that I've attached. We have two pairs of identical machines: - 2 Tyan S2882 Dual Processor 244 stepping 10 - 2 Tyan S2882-D Dual processor dual core Opteron 275 stepping 2 The attached code when compiled with the Portland Group Fortran compiler with -O2 and run on either of the 244's will abort in random locations: [orion@coop00 rams.debug]$ pgf95 -O2 -o testatob testatob.f90 [orion@coop00 rams.debug]$ ./testatob checkatob abort n= 246500 , i= 4685 a(i)= 8712085. b(i)= 8465585. Abort [orion@coop00 rams.debug]$ ./testatob checkatob abort n= 246500 , i= 145817 a(i)= 9592717. b(i)= 8853217. Abort [orion@coop01 rams.debug]$ time ./testatob checkatob abort n= 246500 , i= 118169 a(i)= 9565069. b(i)= 8825569. Aborted real 0m31.842s user 0m16.476s sys 0m0.060s Haven't seen it run longer than 1 minute yet. However, it runs fine on the 275's (or at least I haven't seen it crash yet). It also runs fine on the 244's when compiled with -O1. So, I guess this points to a hardware issue, but it may be a somewhat generalized hardware issue. I'd love to hear reports on other (particularly other Tyan S2882 dual 244's) systems. -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com -------------- next part -------------- A non-text attachment was scrubbed... Name: testatob.f90 Type: text/x-fortran Size: 844 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20070503/3a67034e/testatob.bin From rgb at phy.duke.edu Thu May 3 17:11:45 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Please help test compiler/hardware issue In-Reply-To: <463A66CE.3040100@cora.nwra.com> References: <463A66CE.3040100@cora.nwra.com> Message-ID: On Thu, 3 May 2007, Orion Poplawski wrote: > > Okay, I have a test case for the problem I reported before that I've > attached. > > We have two pairs of identical machines: > > - 2 Tyan S2882 Dual Processor 244 stepping 10 > - 2 Tyan S2882-D Dual processor dual core Opteron 275 stepping 2 > > The attached code when compiled with the Portland Group Fortran compiler with > -O2 and run on either of the 244's will abort in random locations: What about gfortran? Or pathscale? Mind you, I made myself actually look at the code below (shudder) in spite of it being fortran, and it looks ok as far as >>I<< can tell after not doing fortran unless my life depends on it for twenty years or so. To me it is wierd to use a(1) both as the address of a(1) (as an argument to the subroutines) and as the contents of a(1) = 1, but hey. It seems really really odd that any compiler or any program would fail on this piece of code, though. I wonder if a C memcpy would fail? Or what does stream (with a check) do? Stream's copy isn't much more than this. Maybe somebody who has used fortran more recently than the mid-eighties can comment further on the code, but to me it looks like a very odd compiler bug. rgb > > [orion@coop00 rams.debug]$ pgf95 -O2 -o testatob testatob.f90 > [orion@coop00 rams.debug]$ ./testatob > checkatob abort n= 246500 , i= 4685 a(i)= 8712085. > b(i)= 8465585. > Abort > [orion@coop00 rams.debug]$ ./testatob > checkatob abort n= 246500 , i= 145817 a(i)= 9592717. > b(i)= 8853217. > Abort > > [orion@coop01 rams.debug]$ time ./testatob > checkatob abort n= 246500 , i= 118169 a(i)= 9565069. > b(i)= 8825569. > Aborted > > real 0m31.842s > user 0m16.476s > sys 0m0.060s > > > Haven't seen it run longer than 1 minute yet. > > However, it runs fine on the 275's (or at least I haven't seen it crash yet). > It also runs fine on the 244's when compiled with -O1. > > So, I guess this points to a hardware issue, but it may be a somewhat > generalized hardware issue. I'd love to hear reports on other (particularly > other Tyan S2882 dual 244's) systems. > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From orion at cora.nwra.com Thu May 3 17:34:37 2007 From: orion at cora.nwra.com (orion@cora.nwra.com) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Please help test compiler/hardware issue In-Reply-To: <463A66CE.3040100@cora.nwra.com> References: <463A66CE.3040100@cora.nwra.com> Message-ID: <4949.71.208.238.171.1178238877.squirrel@www.cora.nwra.com> > > Okay, I have a test case for the problem I reported before Statically compiled binary at http://www.cora.nwra.com/~orion/testatob.bz2 for those of you without the PGF compiler to try. -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From bill at cse.ucdavis.edu Fri May 4 01:48:50 2007 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] fast file copying In-Reply-To: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> References: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> Message-ID: <463AF372.1010107@cse.ucdavis.edu> Geoff Galitz wrote: > > Hi folks, > > During an HPC talk some years ago, I recall someone mentioned a tool > which can copy large datasets across a cluster using a ring topology. > Perhaps someone here knows of this tool? Not sure about a ring topology, seems kinda silly... why not bit-torrent? It's opensource, extremely common, and already integrated into at least one cluster distribution. There's a zillion implementation, your favorite language is likely to have a few (at least for python, c, and java). I've installed a 170+ node rocks cluster in 10 minutes or so, the RPMs are distributed by bit-torrent so that it doesn't matter if one node dies as part of the install. Nor does it matter if your node list has some strange mapping to your physical network (which is often what you get when you ask a batch queue for 5% of a cluster). > More to the point, we are pushing around datasets that are about > 1Gbyte. The datasets are pushed out to dozens of nodes all at once and How often? I just bit-torrented a 1GB file to 165 nodes in 3 minutes, 1.5 minutes was the lazy why I launched it (the last node didn't start until 1.5 minutes into the run). BTW, 140 or so of those nodes already had 1 job per CPU running. > we foresee saturating the I/O system on our cluster as we grow. We are > limited to using just the available disks and are looking for a > reasonable solution that can support this kind of simultaneous access. There are various ways to maximize I/O with bit-torrent. Various seeders allow uploading each block only once (usually called super seeder mode). Assuming you have a few GB ram on the file server you could even prefetch the file before torrenting (i.e. dd if=file_to_server of=/dev/null) since the limit on bit-torrent bandwidth is often how quickly you can seek. Additionally you can make the chunk size larger to reduce the number of seeks. On the client side preallocation can greatly reduce the number of seeks. > Currently we push the data out using rsync, but if I don't get any > better ideas I may simply move to a pull system where the data is > fetched by HTTP. I can get better throttling that way, at least. > If you have a low churn rate you could generate a diff (with rsync) and distribute that via bit-torrent. What kind of per node bandwidths are you hoping for? 1GB sounds really easy unless you have to do it rather often. From mkinet at ulb.ac.be Wed May 2 08:11:20 2007 From: mkinet at ulb.ac.be (Maxime Kinet) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Problem while booting diskless node. In-Reply-To: <200705011334.57546.pxrist@gmail.com> References: <82C63820-6368-481D-BE08-74B573EC3E30@ulb.ac.be> <200705011334.57546.pxrist@gmail.com> Message-ID: ok, I succeeded. Thanks to all for helpfull comments. ------------------ Maxime Kinet Universit? Libre de Bruxelles Physique Statistique et Plasmas, CP 231 Campus Plaine - Boulevard du Triomphe, 1050 Bruxelles. Tel. : +32-2-650.59.08 e-mail : mkinet@ulb.ac.be On 01 May 2007, at 12:34, Panagiotis Christopoulos wrote: > On Monday 30 April 2007 13:04, Maxime Kinet wrote: >> Hi, >> >> I'm trying to set up my first cluster with diskless nodes. To achieve >> that, I'm using PXElinux on a server, running Fedora Core 6, and a >> NFS-mounted root partition on the node. >> Everything works perfectly >> (getting the IP address, loading the kernel and mounting the >> filesytem) until the node has to run some binaries located into /sbin >> during the boot process. Apparently it's unable to execute them >> because they have been compiled with dynamically linked libraries and >> not statically. The /sbin directory of the node is a simple copy of >> the one of the server. > I suppose, that you have done something like, creating a /diskless > directory > inside your nfs,tftp,dhcp etc... server, copying files(eg. /usr/* ) > from the > server, inside that /diskless dir with the same hierarchy, and this > resulted > in a structure which would home, your nfs exported, root fs of your > nodes. > I'm not an expert, but because you said about dynamic libraries, I > cannot > understand, why this is a problem, you copied /sbin inside / > diskless, but you > didn't copy /lib or /usr/lib? The problem with dynamic libraries, > starts > because your /diskless does not have these libraries. >> I tried to avoid the problem using the busybox >> tools, and it worked a bit better but then it couldn't execute bash >> scripts such as rc.sysinit. > Have you created an init in your busybox, to chroot(exec > switch_root) inside > your nfs root fs after mounting it? >> As anybody ever encountered such problems and what should I do to >> solve it? recompile the kernel of the node or of the server? change >> the distribution? Are there any other simpler method to proceed than >> using PXE? > There are two things you can do. As Douglas Eadline said, it starts by > thinking if you want to reinvent the wheel or not. If you have time, > machines, if you know that your teachers won't get annoyed and you > can work > in a university lab, so you will not pay for the power supply > yourself:p > continue with fedora and all these brainstorming things. You will > learn linux > administration and propably you will do amazing things. If you > don't have > time etc. the guys in warewulf, are doing the same job for about 7 > years, and > they provide you all this knowledge they gained, in a simple > installation > process. > Back in the technical stuff, from your sayings, I think that > something is > wrong with your /diskless dir(if you have one, of course). I cannot > understand why you want to use busybox. We use busybox when why > want an > initramfs to do specific jobs(such as unlocking and mounting > encrypted > partitions,yes, I know this is not the best example I could give) > before > chrooting inside our real root fs and exec init as Mark Hahn said > or if we > are running embedded. Also, I don't think that you have to change your > distribution, and if you don't like PXE, you can see how the guys > in LTSP > boot( I think they use both PXE and "etherboot" and you can make a > choice), > but for me, syslinux is fine! > > This was my point of view, I hope I helped and if you want to ask > something, > feel free to send me a mail, or of course, ask again in the list, > > Panagiotis Christopoulos > System Administrator > Technological Institute of Athens > Department of Informatics From gmkurtzer at gmail.com Wed May 2 09:56:11 2007 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] fast file copying In-Reply-To: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> References: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> Message-ID: <2AF0822D-6311-4004-ACB8-75ECE6E0A4A0@gmail.com> I gave a talk where I referred to a "ring" boot mechanism for Warewulf using "Dolly". http://www.cs.inf.ethz.ch/CoPs/patagonia/dolly.html Thinking back now, I don't remember you being at that talk, so you are probably thinking of something else. ;) On Apr 30, 2007, at 11:35 AM, Geoff Galitz wrote: > > Hi folks, > > During an HPC talk some years ago, I recall someone mentioned a > tool which can copy large datasets across a cluster using a ring > topology. Perhaps someone here knows of this tool? > > More to the point, we are pushing around datasets that are about > 1Gbyte. The datasets are pushed out to dozens of nodes all at once > and we foresee saturating the I/O system on our cluster as we > grow. We are limited to using just the available disks and are > looking for a reasonable solution that can support this kind of > simultaneous access. Currently we push the data out using rsync, > but if I don't get any better ideas I may simply move to a pull > system where the data is fetched by HTTP. I can get better > throttling that way, at least. > > -geoff > > > Geoff Galitz > geoff@galitz.org > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Greg Kurtzer I believe the world would be a better place if people didn't believe in their beliefs. -- gmk From estair at ilm.com Wed May 2 13:30:41 2007 From: estair at ilm.com (Eli Stair) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Syslog Server-Traffic In-Reply-To: <216ee070705010534r7cc4eddr9328b8f8ea223925@mail.gmail.com> References: <216ee070705010534r7cc4eddr9328b8f8ea223925@mail.gmail.com> Message-ID: <4638F4F1.9060705@ilm.com> If you engineer the config well, you won't have any significant amount of "insignificant" traffic. Whether your situation is vulnerable to minute amounts of traffic and client-side processing of packets sent is site-specific. Using syslog-ng (or several other options), you can configure the compute nodes to only send log messages you want to be aware of (or restrict from sending known messages you don't want logged), as well as doing rate-limiting to avoid spamming your network/logserver in the event of a typical freak-out event. /eli Chris Vaughan wrote: > Hello, > > I'm researching setting up a cluster and I'm curious as to whether or > not it's a good idea to set up a syslog server. The question I have > is whether the traffic created from logging is going to slow down my > network to the point of poor performance? What are peoples > experiences with logging. The cluster will have a management network > and a computational network. > > This is what I'm thinking: > > Greater than 64 Nodes, Yes. > Greater than 64 and less than 128, Maybe? > Greater than 128 No > > Any input would be great, Thanks! > > -- > ------------------------------ > Christopher Vaughan > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From victux at gmail.com Wed May 2 15:57:23 2007 From: victux at gmail.com (Victor Gomez) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] SSH without login in nodes Message-ID: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> Hi, Im config a cluster with ssh password less, but the users can login into nodes. I want the clients, use de queue system (Torque, its works fine), without access into nodes. In the past, use rlogin, without rlogin. Thanks. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070502/b07b4055/attachment.html From vaughanc at gmail.com Thu May 3 05:22:51 2007 From: vaughanc at gmail.com (Chris Vaughan) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Syslog Server-Traffic In-Reply-To: References: <216ee070705010534r7cc4eddr9328b8f8ea223925@mail.gmail.com> Message-ID: <216ee070705030522n770e2ef2ka54479449c286194@mail.gmail.com> Thank you all for the input, it's been very helpful in making my clustering decisions. On 5/3/07, Mark Hahn wrote: > > I'm researching setting up a cluster and I'm curious as to whether or > > not it's a good idea to set up a syslog server. The question I have > > depends on how much syslog activity you have, and whether you care > to look at it (in one spot). there _are_ scalable and more robust > system/event logging approaches, but plain old syslog is pretty good. > > > is whether the traffic created from logging is going to slow down my > > network to the point of poor performance? > > a syslog message is normally a smallish UDP packet - say 100 bytes. > if you have 200 nodes each doing 5 per second, that's still only > 100KB/s - a pretty small fraction of a server's gigabit bandwidth. > and if you actually have 5/s, something's probably wrong... > > > experiences with logging. The cluster will have a management network > > and a computational network. > > I'm always skeptical about this advice - it's obvious that it might be good > in cases where a node sustains a nontrivial stream of management traffic > (say, NFS traffic using jumbo frames) which would interfere with possible > latency-sensitive/small MPI packets. > > but how often does that happen? consider that with a non-jumbo gigabit net, > a full packet is only 15 us more than the ~40 or so for a minimal one. > further, I observe MPI codes mostly getting packed into full nodes, > and not interfering with themselves much (distinct MPI and file IO phases > to the program.). > > I could more readily imagine segregating traffic to two nets based on > packet size or TOS. or bonding them in the first place. in any case, > I think you'd have to work pretty hard to generate enough syslog traffic > to matter much. > > > Greater than 64 Nodes, Yes. > > Greater than 64 and less than 128, Maybe? > > Greater than 128 No > > for a smallish cluster like 64 nodes, I don't think I'd worry about > syslog even if the net were just 100bT. for going above 200 nodes, > I'd probably try to do some measurements and extrapolation, but the > numbers above make it look minor. > -- ------------------------------ Christopher Vaughan From james_cuff at harvard.edu Thu May 3 17:52:02 2007 From: james_cuff at harvard.edu (James Cuff) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Please help test compiler/hardware issue In-Reply-To: <4949.71.208.238.171.1178238877.squirrel@www.cora.nwra.com> References: <463A66CE.3040100@cora.nwra.com> <4949.71.208.238.171.1178238877.squirrel@www.cora.nwra.com> Message-ID: <8F859709-B371-4B03-86D2-ACC7E9DB1E5C@harvard.edu> Hi Orion, I'm thinking you may have bad memory/hardware on one of those nodes here mate... Compiles and runs fine here in 32 bit ubuntu fiesty: jcuff@harold:~$ uname -a Linux harold 2.6.20-15-386 #2 Sun Apr 15 07:34:00 UTC 2007 i686 GNU/ Linux jcuff@harold:~$ cat /proc/cpuinfo | grep "model name" model name : Intel(R) Pentium(R) 4 CPU 2.53GHz jcuff@harold:~$ gfortran -O3 -o tt test.f90 jcuff@harold:~$ time ./tt ^C real 8m11.766s user 8m5.862s sys 0m4.280s Also your 64 bit static compiled version runs fine even on a rather crappy "64 bit" Celeron I have on FC 5: [jcuff@gw ~]$ cat /proc/cpuinfo | grep "model name" model name : Intel(R) Celeron(R) CPU 2.93GHz [jcuff@gw ~]$ uname -a Linux 2.6.20-1.2312.fc5 #1 SMP Tue Apr 10 15:14:58 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux [jcuff@gw ~]$ time ./testatob ^C real 5m5.794s user 3m38.785s sys 0m9.890s Hope this helps. Best, j. -- James Cuff, D. Phil. Director of Research Computing, Life Sciences Division. Bauer Laboratory, 7 Divinity Avenue, Cambridge, MA. 02138 Tel: 617-384-5065 Direct Dial: 617-384-7647 On May 3, 2007, at 8:34 PM, wrote: > > > > Okay, I have a test case for the problem I reported before > > Statically compiled binary at http://www.cora.nwra.com/~orion/ > testatob.bz2 > for those of you without the PGF compiler to try. > > -- > Orion Poplawski > Technical Manager 303-415-9701 x222 > NWRA/CoRA Division FAX: 303-415-9702 > 3380 Mitchell Lane orion@cora.nwra.com > Boulder, CO 80301 http://www.cora.nwra.com > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From jtracy at ist.ucf.edu Fri May 4 06:40:37 2007 From: jtracy at ist.ucf.edu (Judd Tracy) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] LVM performance problems Message-ID: <463B37D5.6020008@ist.ucf.edu> I am trying to bring up a small file server and am noticing some serious performance issues when using LVM. I created a software raid /dev/md0 which I can read at ~195MB/s using the raw raid device, but as soon as I put LVM on top of it the read speeds drop to ~95MB/s using a raw lvm partition without a filesystem. When I use and xfs filesystem on top of the lvm partition it drops down to ~40MB/s. This seems like a processing power issue, but the machine is a dual processor opteron system with 4GB of ram in it. Does anyone have some insight into why I might be having such problems with the system. Judd Tracy Institute for Simulation and Training University of Central Florida From peter.st.john at gmail.com Fri May 4 07:53:33 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Please help test compiler/hardware issue In-Reply-To: <8F859709-B371-4B03-86D2-ACC7E9DB1E5C@harvard.edu> References: <463A66CE.3040100@cora.nwra.com> <4949.71.208.238.171.1178238877.squirrel@www.cora.nwra.com> <8F859709-B371-4B03-86D2-ACC7E9DB1E5C@harvard.edu> Message-ID: I don't understand what "allocatable" and "allocate" do. It would seem that atob writes an integer (assigned by a(i) = i) to an address which had also been specified by the a(i)=i assignment, and was not necessarily allocated to a. That would be expected to generate random errors, and since the example has hardcoded numbers like 8460901, it could write to a range that **normally** is writeable in user space, but which is not guaranteed to be by the allocation. If it were C like this: int *a; a = malloc(10); for(i = 0; i< 10; i++) a[i] = i; *(a[5]) = 5; that is, I'm presuming that the contents at the address "5" can be written with the value 5, but "5" is not necessarily in the address space allocated by malloc. I'm thinking this must not be what the subroutine ATOB does, maybe a call by reference instead of call by value confusion (to me). However, the example looks like it was written to show up a compiler dependency and not to stress test a CPU. In fact, it looks like it was written by a malicious C programmer :-) but I have an alibi. Peter On 5/3/07, James Cuff wrote: > > > Hi Orion, > > I'm thinking you may have bad memory/hardware on one of those nodes > here mate... > > Compiles and runs fine here in 32 bit ubuntu fiesty: > > jcuff@harold:~$ uname -a > Linux harold 2.6.20-15-386 #2 Sun Apr 15 07:34:00 UTC 2007 i686 GNU/ > Linux > > jcuff@harold:~$ cat /proc/cpuinfo | grep "model name" > model name : Intel(R) Pentium(R) 4 CPU 2.53GHz > > > jcuff@harold:~$ gfortran -O3 -o tt test.f90 > jcuff@harold:~$ time ./tt > ^C > real 8m11.766s > user 8m5.862s > sys 0m4.280s > > > Also your 64 bit static compiled version runs fine even on a rather > crappy "64 bit" Celeron I have on FC 5: > > [jcuff@gw ~]$ cat /proc/cpuinfo | grep "model name" > model name : Intel(R) Celeron(R) CPU 2.93GHz > > [jcuff@gw ~]$ uname -a > Linux 2.6.20-1.2312.fc5 #1 SMP Tue Apr 10 15:14:58 EDT 2007 x86_64 > x86_64 x86_64 GNU/Linux > > > [jcuff@gw ~]$ time ./testatob > ^C > real 5m5.794s > user 3m38.785s > sys 0m9.890s > > > Hope this helps. > > Best, > > j. > > -- > James Cuff, D. Phil. > Director of Research Computing, Life Sciences Division. > Bauer Laboratory, 7 Divinity Avenue, Cambridge, MA. 02138 > Tel: 617-384-5065 Direct Dial: 617-384-7647 > > > On May 3, 2007, at 8:34 PM, wrote: > > > > > > > Okay, I have a test case for the problem I reported before > > > > Statically compiled binary at http://www.cora.nwra.com/~orion/ > > testatob.bz2 > > for those of you without the PGF compiler to try. > > > > -- > > Orion Poplawski > > Technical Manager 303-415-9701 x222 > > NWRA/CoRA Division FAX: 303-415-9702 > > 3380 Mitchell Lane orion@cora.nwra.com > > Boulder, CO 80301 http://www.cora.nwra.com > > > > _______________________________________________ > > Beowulf mailing list, Beowulf@beowulf.org > > To change your subscription (digest mode or unsubscribe) visit > > http://www.beowulf.org/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070504/e11ba066/attachment.html From mathog at caltech.edu Fri May 4 08:24:28 2007 From: mathog at caltech.edu (David Mathog) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Re: fast file copying Message-ID: Felix Rauch Valenti wrote: > On 03/05/07, Alan Louis Scheinine wrote: > > One possibility is nettee. > > http://saf.bio.caltech.edu/nettee.html > > As a related side note: If the bandwidth you get is not what you > expect, it may well be that your switch is bad (or that your disks are > slow). That was my experience a couple of years ago, so we implemented > a switch benchmark called "Switchbench", that helps to identify the > bandwidth bottleneck in a network. Since both dolly and nettee have been mentioned in this thread, it's probably appropriate to point out that they are very closely related. Felix wrote Dolly, and I forked his code and modified it a bit to arrive at nettee. Felix's point about nettee and slow disks is especially relevant when imaging nodes because historically some of the small linux environments used for such installations did not set DMA on the disks by default, and that slowed down the write speed on the disks dramatically. When file transfers are being carried out on busy systems it's also a good idea to put "buffer" or an equivalent program in the output stream on each node, so that a brief contention for IO writes to the local disk doesn't slow down the entire chain. Always use "buffer" or an equivalent if the local write stream is being piped through "tar -xf -" or an equivalent dearchiving command, as the dearchiving step can slow down the write speed if it has to create a large number of small files in a directory. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From orion at cora.nwra.com Fri May 4 08:43:37 2007 From: orion at cora.nwra.com (Orion Poplawski) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] compiler/hardware issue fixed In-Reply-To: <463A66CE.3040100@cora.nwra.com> References: <463A66CE.3040100@cora.nwra.com> Message-ID: <463B54A9.9060707@cora.nwra.com> Orion Poplawski wrote: > So, I guess this points to a hardware issue, but it may be a somewhat > generalized hardware issue. I'd love to hear reports on other > (particularly other Tyan S2882 dual 244's) systems. I updated the BIOS on the 244's and the problem appears to have gone away. I should have done this long ago, but I had mistakenly thought that I couldn't PXE boot the flash update utility. It's also somewhat gratifying to understand a bit more what the issue was. So, those of you with Tyan S2882s, update your bios! -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From orion at cora.nwra.com Fri May 4 08:48:35 2007 From: orion at cora.nwra.com (Orion Poplawski) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] Please help test compiler/hardware issue In-Reply-To: References: <463A66CE.3040100@cora.nwra.com> <4949.71.208.238.171.1178238877.squirrel@www.cora.nwra.com> <8F859709-B371-4B03-86D2-ACC7E9DB1E5C@harvard.edu> Message-ID: <463B55D3.906@cora.nwra.com> Peter St. John wrote: > I'm thinking this must not be what the subroutine ATOB does, maybe a > call by > reference instead of call by value confusion (to me). Yup. All Fortran calls are by reference. So the routine is just copying one section of A to another section of A. Should be fine as long as your hardware works. -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion@cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From mwill at penguincomputing.com Fri May 4 09:04:50 2007 From: mwill at penguincomputing.com (Michael Will) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] LVM performance problems In-Reply-To: <463B37D5.6020008@ist.ucf.edu> Message-ID: <433093DF7AD7444DA65EFAFE3987879C418346@orca.penguincomputing.com> What model server is this? What level raid is the software raid volume, how many drives and what type? How did you exactly establish the performance number? I have done performance benchmarking with LVM over hardware raid luns had have not been able to show a significant performance degradation, and it is actually faster when you stripe across several raid-protected luns. Some general disk benchmarking rules: Write at least twice as large files as you have RAM (so that's 8G minimum in your case) so caching does not inflate the numbers. Also don't forget to measure the time for it to be synced to disk. Example for large block I/O large file benchmarking on a 4G system: time dd if=/dev/zero of=/mnt/targetfilesystem/largefile bs=1M count=8192 time sync If you divide 8192 by the aggragate of the seconds used by both commands then you have a good MB/s estimate. For read you can just do: time dd of=/dev/null if=/mnt/targetfilesystem/largefile bs=1M count=8192 To do some more intense testing including mixed read/write, I use bonnie++. If you want to turn it into a real science and get very specific data for specific access patterns, you might want to look at iozone. Michael -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org] On Behalf Of Judd Tracy Sent: Friday, May 04, 2007 6:41 AM To: beowulf@beowulf.org Subject: [Beowulf] LVM performance problems I am trying to bring up a small file server and am noticing some serious performance issues when using LVM. I created a software raid /dev/md0 which I can read at ~195MB/s using the raw raid device, but as soon as I put LVM on top of it the read speeds drop to ~95MB/s using a raw lvm partition without a filesystem. When I use and xfs filesystem on top of the lvm partition it drops down to ~40MB/s. This seems like a processing power issue, but the machine is a dual processor opteron system with 4GB of ram in it. Does anyone have some insight into why I might be having such problems with the system. Judd Tracy Institute for Simulation and Training University of Central Florida _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From bill at cse.ucdavis.edu Fri May 4 11:41:21 2007 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] LVM performance problems In-Reply-To: <463B37D5.6020008@ist.ucf.edu> References: <463B37D5.6020008@ist.ucf.edu> Message-ID: <463B7E51.5020305@cse.ucdavis.edu> Judd Tracy wrote: > I am trying to bring up a small file server and am noticing some serious > performance issues when using LVM. I created a software raid /dev/md0 How many drives? Which raid level? What stripesize? What RAID controller? > which I can read at ~195MB/s using the raw raid device, but as soon as I You did benchmark reads using accesses significantly larger than ram... right? > put LVM on top of it the read speeds drop to ~95MB/s using a raw lvm Exactly what config in LVM? What stripesize? > partition without a filesystem. When I use and xfs filesystem on top of > the lvm partition it drops down to ~40MB/s. This seems like a Exactly what mkfs.xfs parameters did you use? In particular switch, sunit, and related parameters. > processing power issue, but the machine is a dual processor opteron > system with 4GB of ram in it. Does anyone have some insight into why I > might be having such problems with the system. I've not seen many differences in my testing. Drives are capable of 30-60MB/sec each for sequential reads/writes, the trick is to balance the load across all disks for your workload. Beware, optimizing your setup for dd, and bonnie might lead to worse performance if your workload doesn't use similar access patterns. The interaction between N disks, reads/writes of M size, and various strip sizes (in the raid and in the filesystem) is rather complex. So my guess is that your access pattern used to hit many of your disks in parallel, but now is bottlenecked by a single disk. Not that you are somehow CPU limited. From peter.st.john at gmail.com Fri May 4 13:06:51 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] SSH without login in nodes In-Reply-To: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> References: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> Message-ID: There was a typogrphical error in the question. I had a brief exchange with se?or Gomez and he confirmed this translation: I am configuring a cluster with ssh (but without passwords) and currently the users can log in to compute nodes. I wish the clients to use the queue system (Torque, it works fine) without being able to access the compute nodes. In the past, we used rsh without allowing rlogin. Unfortunately my Spanish seems to be better than my Beowulfry so I can't help much :-) Peter On 5/2/07, Victor Gomez wrote: > > Hi, > > Im config a cluster with ssh password less, but the users can login into > nodes. > > I want the clients, use de queue system (Torque, its works fine), without > access into nodes. > > In the past, use rlogin, without rlogin. > > Thanks. > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070504/dece5e08/attachment.html From kilian at stanford.edu Fri May 4 13:40:50 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] SSH without login in nodes In-Reply-To: References: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> Message-ID: <200705041340.50461.kilian@stanford.edu> Hi, On Friday 04 May 2007 01:06:51 pm Peter St. John wrote: > There was a typogrphical error in the question. I had a brief exchange > with se?or Gomez and he confirmed this translation: > > I am configuring a cluster with ssh (but without passwords) and > currently the users can log in to compute nodes. > I wish the clients to use the queue system (Torque, it works fine) > without being able to access the compute nodes. > In the past, we used rsh without allowing rlogin. What you can do is configure PAM on the nodes, to only allow login for a specific set of users, if any. It should come with any modern distro. Be sure your /etc/pam.d/authconfig contains reference to pam_access, like: account required /lib/security/$ISA/pam_access.so And configure /etc/security/access.conf to match your needs, like: # Allow administrative login from everywhere +:wheel staff:ALL # Prevent user logins -:users:ALL You can give a look at http://www.informit.com/articles/article.asp?p=165226&seqNum=12&rl=1 for more info. Cheers, -- Kilian From siegert at sfu.ca Fri May 4 15:53:49 2007 From: siegert at sfu.ca (Martin Siegert) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] MPI application benchmarks Message-ID: <20070504225349.GC17163@stikine.ucs.sfu.ca> Hi, this is partially triggered by the mentioniong of the SPEC MPI2007 benchmark: what are people using as a benchmark suite for RFP purposes? We will be purchasing a shared cluster for a wide community (currently more than 1000 users). Thus, the common response on this list to evaluate hardware - "use your own application as benchmark" - does not work: users change, users' applications change, etc., etc. Thus, I need a benchmark suite that tests a wide spectrum of properties. In that respect the SPEC MPI2007 benchmark appears to be ideal. Alas, it does not appear to be open source (please correct me if I am wrong) - so far I have not even been able to figure out which applications are being used in that benchmark suite. Thus, although RFP evaluations are mentioned first under likely uses for SPEC MPI2007, not being open source seems to contradict that statement. Thus: what are people using? I have seen the HPC Challenge Benchmark. We almost certainly will include gromacs as an application benchmark. Which other applications do you suggest? Thanks in advance! Cheers, Martin -- Martin Siegert Head, HPC@SFU WestGrid Site Lead Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert@sfu.ca Canada V5A 1S6 From csamuel at vpac.org Fri May 4 21:42:58 2007 From: csamuel at vpac.org (Chris Samuel) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] SSH without login in nodes In-Reply-To: References: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> Message-ID: <200705051442.59044.csamuel@vpac.org> On Sat, 5 May 2007, Peter St. John wrote: > I am configuring a cluster with ssh (but without passwords) and currently > the users can log in to compute nodes. I wish the clients to use the queue > system (Torque, it works fine) without being able to access the compute > nodes. In the past, we used rsh without allowing rlogin. We use a very ugly hack (this was already in place when I arrived) which has been very effective over the past few years at doing that and doesn't prevent people using SSH based MPI launchers (though we don't recommend them being used). Basically it's just the following in /etc/profile on our compute nodes. if echo $HOSTNAME | egrep -q '^node' ; then if [ ! $PBS_ENVIRONMENT ]; then if [ $USER != "root" ]; then if [ "$GROUP" != "systems" ]; then exit; fi; fi; fi; fi; How's that ? cheers, Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia From kilian at stanford.edu Sat May 5 09:24:37 2007 From: kilian at stanford.edu (Kilian CAVALOTTI) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] SSH without login in nodes In-Reply-To: <200705051442.59044.csamuel@vpac.org> References: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> <200705051442.59044.csamuel@vpac.org> Message-ID: <200705050924.37869.kilian@stanford.edu> On Friday 04 May 2007 21:42:58 Chris Samuel wrote: > We use a very ugly hack (this was already in place when I arrived) which > has been very effective over the past few years at doing that and > doesn't prevent people using SSH based MPI launchers (though we don't > recommend them being used). > > Basically it's just the following in /etc/profile on our compute nodes. > > if echo $HOSTNAME | egrep -q '^node' ; then > if [ ! $PBS_ENVIRONMENT ]; > then if [ $USER != "root" ]; > then if [ "$GROUP" != "systems" ]; > then exit; > fi; > fi; > fi; > fi; > > > How's that ? Not that ugly, actually. But what if users do a ssh node -t "bash --noprofile"? ;) To handle of SSH based MPI launchers, we've disabled user logins from our frontend node to the compute nodes, but allowed them between compute nodes. So that the scheduler takes care of dispatching the initial process on a first node (no SSH involved), and then SSH connections can be used to dispatch the MPI daemons on the other nodes, from the initial one. Cheers, -- Kilian From brian.ropers.huilman at gmail.com Sat May 5 10:10:03 2007 From: brian.ropers.huilman at gmail.com (Brian D. Ropers-Huilman) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] MPI application benchmarks In-Reply-To: <20070504225349.GC17163@stikine.ucs.sfu.ca> References: <20070504225349.GC17163@stikine.ucs.sfu.ca> Message-ID: On 5/4/07, Martin Siegert wrote: > We will be purchasing a shared cluster for a wide community (currently > more than 1000 users). Thus, the common response on this list to evaluate > hardware - "use your own application as benchmark" - does not work: > users change, users' applications change, etc., etc. Thus, I need a > benchmark suite that tests a wide spectrum of properties. My answer is still to "use your own application(s)." Poll your users and find out what they have and what they are going to run. Find some who already have codes that scale well (>1000 cores) and ask them to participate. Many vendors will allow you to run your own codes on systems they have at their own sites before you decide to purchase. These vendor-hosted systems are typically only 256 cores or less, but it gives you some idea as to how your codes might run. I also suggest picking some representative synthetic benchmarks to test floating point and integer operations, memory bandwidth, MPI ping-pongs (the SPEC MPI2007, among others, would fit here), the HPC Challenge codes, and the like. Many sites will then take all of these results (synthetic + their own applications) and aggregate the results, possibly with weighting factors, into a single number. If you do this over a number of years and number of systems, with the same benchmarks, you can even start to normalize against a "base" system and take things like different core counts and costs into account. -- Brian D. Ropers-Huilman, Director Systems Administration and Technical Operations Supercomputing Institute 599 Walter Library +1 612-626-5948 (V) 117 Pleasant Street S.E. +1 612-624-8861 (F) University of Minnesota Twin Cities Campus Minneapolis, MN 55455-0255 http://www.msi.umn.edu/ From csamuel at vpac.org Sun May 6 00:58:54 2007 From: csamuel at vpac.org (Chris Samuel) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] SSH without login in nodes In-Reply-To: <200705050924.37869.kilian@stanford.edu> References: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> <200705051442.59044.csamuel@vpac.org> <200705050924.37869.kilian@stanford.edu> Message-ID: <200705061758.54756.csamuel@vpac.org> On Sun, 6 May 2007, Kilian CAVALOTTI wrote: > Not that ugly, actually. But what if users do a > ssh node -t "bash --noprofile"? ;) Then if any of the 500 odd tried we would spot them with some other scripts and chase them about it. We've not had to do that yet, though, fortunately! > To handle of SSH based MPI launchers, we've disabled user logins from our > frontend node to the compute nodes, but allowed them between compute > nodes. So that the scheduler takes care of dispatching the initial process > on a first node (no SSH involved), and then SSH connections can be used to > dispatch the MPI daemons on the other nodes, from the initial one. Now that there's the Torque PAM module (pam_pbssimpleauth) that Garrick wrote I'm tempted to set that up, but given our current system works I haven't dared break it. :-) cheers! Chris -- Christopher Samuel - (03)9925 4751 - VPAC Deputy Systems Manager Victorian Partnership for Advanced Computing http://www.vpac.org/ Bldg 91, 110 Victoria Street, Carlton South, VIC 3053, Australia From bernd-schubert at gmx.de Fri May 4 08:12:48 2007 From: bernd-schubert at gmx.de (Bernd Schubert) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] LVM performance problems In-Reply-To: <463B37D5.6020008@ist.ucf.edu> References: <463B37D5.6020008@ist.ucf.edu> Message-ID: <200705041712.48667.bernd-schubert@gmx.de> On Friday 04 May 2007 15:40:37 Judd Tracy wrote: > I am trying to bring up a small file server and am noticing some serious > performance issues when using LVM. I created a software raid /dev/md0 > which I can read at ~195MB/s using the raw raid device, but as soon as I > put LVM on top of it the read speeds drop to ~95MB/s using a raw lvm > partition without a filesystem. When I use and xfs filesystem on top of > the lvm partition it drops down to ~40MB/s. This seems like a > processing power issue, but the machine is a dual processor opteron > system with 4GB of ram in it. Does anyone have some insight into why I > might be having such problems with the system. try something like this: blockdev --setra 8192 /dev/{volumegroup}/{logical_volume} You should also try smaller and larger numbers. Hope it helps, Bernd From jlforrest at berkeley.edu Fri May 4 08:17:08 2007 From: jlforrest at berkeley.edu (Jon Forrest) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] fast file copying In-Reply-To: <4eafc81b0705030106g72471b29u8344516d559e24f7@mail.gmail.com> References: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> <4638A071.1080008@crs4.it> <4eafc81b0705030106g72471b29u8344516d559e24f7@mail.gmail.com> Message-ID: <463B4E74.2020604@berkeley.edu> Felix Rauch Valenti wrote: > As a related side note: If the bandwidth you get is not what you > expect, it may well be that your switch is bad (or that your disks are > slow). That was my experience a couple of years ago, so we implemented > a switch benchmark called "Switchbench", that helps to identify the > bandwidth bottleneck in a network. I've downloaded this program and it looks very promising. One suggestion - many of us have set up our clusters to allow "rsh" access without requiring passwords. Your program is written to use "ssh" for accessing the cluster nodes. It would be nice if your program could let the user choose which of these methods to use. Cordially, -- Jon Forrest Unix Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest@berkeley.edu From jtracy at ist.ucf.edu Fri May 4 08:37:20 2007 From: jtracy at ist.ucf.edu (Judd Tracy) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] LVM performance problems In-Reply-To: <200705041712.48667.bernd-schubert@gmx.de> References: <463B37D5.6020008@ist.ucf.edu> <200705041712.48667.bernd-schubert@gmx.de> Message-ID: <463B5330.9080000@ist.ucf.edu> That seemed to work just fine, it now averages ~205MB/s read thanks. Judd Bernd Schubert wrote: > On Friday 04 May 2007 15:40:37 Judd Tracy wrote: > >> I am trying to bring up a small file server and am noticing some serious >> performance issues when using LVM. I created a software raid /dev/md0 >> which I can read at ~195MB/s using the raw raid device, but as soon as I >> put LVM on top of it the read speeds drop to ~95MB/s using a raw lvm >> partition without a filesystem. When I use and xfs filesystem on top of >> the lvm partition it drops down to ~40MB/s. This seems like a >> processing power issue, but the machine is a dual processor opteron >> system with 4GB of ram in it. Does anyone have some insight into why I >> might be having such problems with the system. >> > > > try something like this: > blockdev --setra 8192 /dev/{volumegroup}/{logical_volume} > > You should also try smaller and larger numbers. > > Hope it helps, > Bernd > > From rosing at peakfive.com Fri May 4 08:49:36 2007 From: rosing at peakfive.com (Matt) Date: Sat May 10 01:06:03 2008 Subject: [Beowulf] programming questions here? Message-ID: <17979.22032.204709.28791@lala.site> Hi, I'm looking for a mailing list, newsgroup, or whatever, where people are interested in talking about practical programming issues on parallel machines. Optimizing and maintaining code, software engineering for high performance code, or debugging, are the types of things I'm interested in. I'm not really interested in research so much as practical information that can be put to use now. Are there many programmers lurking on this list? It looks mostly to be OS issues (which makes sense, given the subject). Comp.parallel originally used to be a place for what I'm looking for but, I'm not sure how to say this, it's basically just a place to announce conferences anymore. Thanks, Matt From victux at gmail.com Fri May 4 10:57:12 2007 From: victux at gmail.com (Victor Gomez) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] SSH without log in to nodes Message-ID: <48f9d1380705041057nc8bf5dejad97d6a253783909@mail.gmail.com> Hi, I am configuring a cluster with ssh (but without passwords) and currently the users can log in to compute nodes. I wish the clients to use the queue system (Torque, it works fine) without being able to access the compute nodes. In the past, we used rsh without allowing rlogin. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070504/a937cead/attachment.html From openlinuxsource at gmail.com Sat May 5 08:16:33 2007 From: openlinuxsource at gmail.com (Amy Lee) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] Help: About EMBOSS Message-ID: <463C9FD1.7020700@gmail.com> Hello, I'm a Chinese student in an agricultural university. And I built a Linux Beowulf cluster for Bioinformatics Department. According to plan, we should use EMBOSS in this cluster. I'd like to know, whether I can use EMBOSS to do a big problem on these nodes? I mean the whole nodes can solve one problem together at the same time, not arrange different jobs to nodes. And how about Sun Grid Engine? Is it useful in this plan? Thanks in advance! Amy Lee From mark.t.79 at gmail.com Sat May 5 23:48:02 2007 From: mark.t.79 at gmail.com (Mark Thompson) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] new to clusters... Message-ID: <213386ce0705052348q6c1bd4eatb78fb307c7c6ff38@mail.gmail.com> I am new to clusters and I am trying to learn as much as possible. I currently run the new ubuntu feisty fawn and am learning as much as I can about networking and such. I have several older computers ranging from a pentium 3 500 mhz to a athlon 1.8 ghz and even have a dual athlon 1.6 ghz server. I am working on getting more second hand computers for purposes of learning clustering and would like to know others input on my idea as well as suggestions for doing this. Any input would be helpful. Also, I would like suggestions for what to do with it once I have it up and running. I wouldn't mind providing services for others....would give me good experience with administration and such. thanks for your time. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070506/6408783d/attachment.html From toon at moene.indiv.nluug.nl Sun May 6 05:15:56 2007 From: toon at moene.indiv.nluug.nl (Toon Moene) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] MPI application benchmarks In-Reply-To: <20070504225349.GC17163@stikine.ucs.sfu.ca> References: <20070504225349.GC17163@stikine.ucs.sfu.ca> Message-ID: <463DC6FC.4060202@moene.indiv.nluug.nl> Martin Siegert wrote: > Hi, > > this is partially triggered by the mentioniong of the SPEC MPI2007 > benchmark: what are people using as a benchmark suite for RFP purposes? Our own code. Anyone who does something different for serious RFP purposes is playing with their lives (at least in our surroundings - civil servants are heavily watched as far as fraud / or attempted fraud in these case go). -- Toon Moene - e-mail: toon@moene.indiv.nluug.nl - phone: +31 346 214290 Saturnushof 14, 3738 XG Maartensdijk, The Netherlands At home: http://moene.indiv.nluug.nl/~toon/ Who's working on GNU Fortran: http://gcc.gnu.org/ml/gcc/2007-01/msg00059.html From tjrc at sanger.ac.uk Sun May 6 08:08:26 2007 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] Help: About EMBOSS In-Reply-To: <463C9FD1.7020700@gmail.com> References: <463C9FD1.7020700@gmail.com> Message-ID: <748A4510-5AB4-45A4-A7E4-49E0C58D3AC2@sanger.ac.uk> On 5 May 2007, at 4:16 pm, Amy Lee wrote: > Hello, > > I'm a Chinese student in an agricultural university. And I built a > Linux Beowulf cluster for Bioinformatics Department. According to > plan, we should use EMBOSS in this cluster. > > I'd like to know, whether I can use EMBOSS to do a big problem on > these nodes? I mean the whole nodes can solve one problem together > at the same time, not arrange different jobs to nodes. > > And how about Sun Grid Engine? Is it useful in this plan? As far as I am aware, EMBOSS consists almost entirely of single threaded programs, so its role in a cluster environment is going to be for solving embarrassingly parallel problems where you can split the workload into many small independent jobs. Without knowing what the actual problem you are trying to solve is, I can't really advise on whether any of the programs in EMBOSS are what you want. Tim From rgb at phy.duke.edu Sun May 6 08:40:04 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] SSH without login in nodes In-Reply-To: <200705061758.54756.csamuel@vpac.org> References: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> <200705051442.59044.csamuel@vpac.org> <200705050924.37869.kilian@stanford.edu> <200705061758.54756.csamuel@vpac.org> Message-ID: On Sun, 6 May 2007, Chris Samuel wrote: > On Sun, 6 May 2007, Kilian CAVALOTTI wrote: > >> Not that ugly, actually. But what if users do a >> ssh node -t "bash --noprofile"? ;) > > Then if any of the 500 odd tried we would spot them with some other scripts > and chase them about it. We've not had to do that yet, though, fortunately! Yes, this is the other solution. Do nothing fancy in script-land. Just tell your user base "Do Not Login To The Nodes Directly And Run Jobs". Put up a TRIVIAL script to monitor and mail admin if someone should do so. Then keep a sucker rod handy to punish offenders (with the direct support and authorization to chasten from the cluster's owner(s)). In most cases with a moderate size user base, you'll have at most one or two offenses, will whack the offenders upside the head mouthing phrases like "loss of privileges to use the cluster at all", word will get out, and things will be just fine. If you organize the cluster on an isolated network so that the nodes are only visible "through" the head node, most users will never even bother to work out "how" they can login to nodes directly, especially if you tell them that You Will Be Watching and They'd Better Not If They Know What Is Good For Them. This MIGHT not work for a cluster with a very large, very dynamic, user base -- a Grid-like environment or a large public cluster with 1000 potential users. I would bet that one could make it work even then with minimal effort, but there is no doubt that you'd be bopping folks more often as a large population is bound to have a wise-ass would-be hacker in it. Find them, bop them, offer them a job. rgb > >> To handle of SSH based MPI launchers, we've disabled user logins from our >> frontend node to the compute nodes, but allowed them between compute >> nodes. So that the scheduler takes care of dispatching the initial process >> on a first node (no SSH involved), and then SSH connections can be used to >> dispatch the MPI daemons on the other nodes, from the initial one. > > Now that there's the Torque PAM module (pam_pbssimpleauth) that Garrick wrote > I'm tempted to set that up, but given our current system works I haven't > dared break it. :-) > > cheers! > Chris > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sun May 6 08:46:46 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] MPI application benchmarks In-Reply-To: <463DC6FC.4060202@moene.indiv.nluug.nl> References: <20070504225349.GC17163@stikine.ucs.sfu.ca> <463DC6FC.4060202@moene.indiv.nluug.nl> Message-ID: On Sun, 6 May 2007, Toon Moene wrote: > Martin Siegert wrote: >> Hi, >> >> this is partially triggered by the mentioniong of the SPEC MPI2007 >> benchmark: what are people using as a benchmark suite for RFP purposes? > > Our own code. > > Anyone who does something different for serious RFP purposes is playing with > their lives (at least in our surroundings - civil servants are heavily > watched as far as fraud / or attempted fraud in these case go). Wow do YOU have enlightened guardians of the public trust. Here, one is lucky if one's grant or corporate review officers understand anything but "Megaflops". Oops, that shows my age. I meant "Gigaflops" or maybe even "Teraflops". If you ask them what a "Flop" is, sometimes they might even be able to answer! One dimension is all that these folks can usually handle, at least without an extensive education process. Computer science grant review and in SOME cases general scientific grant review excepted, of course -- in some branches of physics there are actually folks that have heard of clusters at this point who know better. But there are also a whole lot of people stuck back there at "Flops", or "MIPs" for whom aggregate capacity in one simple measure is understandable, and for whom the nonlinear task scaling of a cluster will forever be a closed book. rgb > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sun May 6 08:53:54 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] programming questions here? In-Reply-To: <17979.22032.204709.28791@lala.site> References: <17979.22032.204709.28791@lala.site> Message-ID: On Fri, 4 May 2007, Matt wrote: > Hi, > > I'm looking for a mailing list, newsgroup, or whatever, where people > are interested in talking about practical programming issues on > parallel machines. Optimizing and maintaining code, software > engineering for high performance code, or debugging, are the types of > things I'm interested in. I'm not really interested in research so > much as practical information that can be put to use now. > > Are there many programmers lurking on this list? It looks mostly to be > OS issues (which makes sense, given the subject). There are many programmers on the list. Indeed, I'd guess that the majority of posters are coders as well as admins, cluster engineers, researchers. If you want to see the coders in action, just post a question like "Is C or Pascal a better language for writing parallel code." Just be sure to put on a biohazard suit first. Actually (more seriously) if you look back at the list archives you'll see lots of discussions on writing PVM and MPI code, a number of "friendly" religious wars on which language is the best, a fair number of articles on specific programming issues associated with a piece of code. All completely reasonable usage of the list, even if the issue isn't strictly about parallel code -- folks are happy to help out or talk about coding because they ARE coders and therefore like to show off their chops;-). There are also introductory resources on parallel coding methodology archived and delivered by places like the cluster monkey website saved from the short-lived Cluster World Magazine columns. Lots of articles on e.g. MPI, a few on PVM (mostly written by me:-). This kind of discussion doesn't usually 'dominate' the list traffic, but it is definitely an important component. rgb > > Comp.parallel originally used to be a place for what I'm looking for > but, I'm not sure how to say this, it's basically just a place to > announce conferences anymore. > > Thanks, > > Matt > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Sun May 6 09:29:18 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] quickie OFED build question Message-ID: <463E025E.90409@scalableinformatics.com> Will ask this on the OFED mailing lists in a bit, but has anyone successfully built OFED 1.x against OpenSuSE 10.2 ? I have 1.0, 1.1, and 1.2-rc2 failing. Works on OpenSuSE 10.1. Nature of the problem appears to be some kernel changes between versions (2.6.16 vs 2.6.18) that pulled some specific things out. Also, I want to turn off the -m32 switch usage, badly. Any clues/patches out there? Google hasn't been very helpful here (must be searching for wrong things). Thanks. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From siegert at sfu.ca Sun May 6 12:05:20 2007 From: siegert at sfu.ca (Martin Siegert) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] MPI application benchmarks In-Reply-To: <463DC6FC.4060202@moene.indiv.nluug.nl> References: <20070504225349.GC17163@stikine.ucs.sfu.ca> <463DC6FC.4060202@moene.indiv.nluug.nl> Message-ID: <20070506190520.GB28930@stikine.ucs.sfu.ca> On Sun, May 06, 2007 at 02:15:56PM +0200, Toon Moene wrote: > Martin Siegert wrote: > >Hi, > > > >this is partially triggered by the mentioniong of the SPEC MPI2007 > >benchmark: what are people using as a benchmark suite for RFP purposes? > > Our own code. Sigh. I thought I could avoid that response. Our own code (due to the no. of users who all believe that their code is the most important and therefore must be benchmarked) is so massive that any potential RFP respondent would have to work a year to run the code. Thus, we have to find a sensible cross section that is respresentative of the applications we care about. Since we will be purchasing several facilities with different performance characteristics (which are somewhat flexible depending on the price/performance ratio) we would like to setup a benchmark suite that covers the whole spectrum. > Anyone who does something different for serious RFP purposes is playing > with their lives (at least in our surroundings - civil servants are > heavily watched as far as fraud / or attempted fraud in these case go). I believe we can handle that. I did not say that we wouldn't run, test, modify/add applications for our own purposes. That's why we require open source. Since "own code" is the only answer that I appear to be getting, let me rephrase the question: which applications do you include as "your own code"? E.g., we will almost certainly include gromacs (which still leaves the question of the input parameters, etc.). Cheers, Martin -- Martin Siegert Head, HPC@SFU WestGrid Site Lead Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert@sfu.ca Canada V5A 1S6 From James.P.Lux at jpl.nasa.gov Sun May 6 16:33:44 2007 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] programming questions here? In-Reply-To: References: <17979.22032.204709.28791@lala.site> Message-ID: <6.2.3.4.2.20070506162939.02f3d6b0@mail.jpl.nasa.gov> At 08:53 AM 5/6/2007, Robert G. Brown wrote: >O >There are many programmers on the list. Indeed, I'd guess that the >majority of posters are coders as well as admins, cluster engineers, >researchers. If you want to see the coders in action, just post a >question like "Is C or Pascal a better language for writing parallel >code." Just be sure to put on a biohazard suit first. Actually, if you really want to see a lot of traffic, you need to post a question like: "I know that SNOBOL is the best language for parallel programming, but because distro X has a lame implementation, am I really stuck using the brain-dead constructs of Ada, or would I be better using distro Y and APL, and rewriting the compiler for my needs." {Langauge names changed to avoid stepping on too many landmines} It's the combination of the absolute assertion of superiority for one choice and the denigration of multiple other choices that will really bring out the comments. As you say, fireproof suit required. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From felix.rauch.valenti at gmail.com Sun May 6 20:20:38 2007 From: felix.rauch.valenti at gmail.com (Felix Rauch Valenti) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] fast file copying In-Reply-To: <463AF372.1010107@cse.ucdavis.edu> References: <48867455-3FE7-469D-99B3-6E5E9B54D507@galitz.org> <463AF372.1010107@cse.ucdavis.edu> Message-ID: <4eafc81b0705062020v199aa5abh2fe4517ed8afc123@mail.gmail.com> On 04/05/07, Bill Broadley wrote: > Geoff Galitz wrote: > > During an HPC talk some years ago, I recall someone mentioned a tool > > which can copy large datasets across a cluster using a ring topology. > > Perhaps someone here knows of this tool? > > Not sure about a ring topology, seems kinda silly... Why would that be silly? To clarify: The transmission through the ring happens in parallel, i.e., while a node n receives the data stream from node n-1, it writes the stream to disk and at the same time forwards it to node n+1. I have yet to see a tool that can achieve better data rates in practice, for reliable, high speed and large scale data distribution in clusters. > > More to the point, we are pushing around datasets that are about > > 1Gbyte. The datasets are pushed out to dozens of nodes all at once and > > How often? I just bit-torrented a 1GB file to 165 nodes in 3 minutes, > 1.5 minutes was the lazy why I launched it (the last node didn't > start until 1.5 minutes into the run). BTW, 140 or so of those nodes > already had 1 job per CPU running. 1 GB file in 1.5 minutes translates to about 11 MB/s, which sounds a lot like Fast Ethernet (100 mbps). By today's standards that's relatively slow and it's quite likely that the network will be the bottleneck for almost any tool. > There are various ways to maximize I/O with bit-torrent. Various > seeders allow uploading each block only once (usually called super > seeder mode). Assuming you have a few GB ram on the file server > you could even prefetch the file before torrenting (i.e. dd if=file_to_server > of=/dev/null) since the limit on bit-torrent bandwidth is often how > quickly you can seek. > > Additionally you can make the chunk size larger to reduce the number > of seeks. On the client side preallocation can greatly reduce > the number of seeks. More advantages of the ring topology: It uploads every block on every node exactly once, no prefetching and no seeks are required (if you replicate a whole partition or a single large file). If you are interested in more details about the technology, like models and performance measurements (somewhat old by now), check out the second paper in this list: http://www.cs.inf.ethz.ch/cops/patagonia/#relmat - Felix From hahn at mcmaster.ca Sun May 6 21:36:34 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] SSH without login in nodes In-Reply-To: References: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> <200705051442.59044.csamuel@vpac.org> <200705050924.37869.kilian@stanford.edu> <200705061758.54756.csamuel@vpac.org> Message-ID: > Yes, this is the other solution. Do nothing fancy in script-land. Just right on! we have >1800 users from more institutions than I can count, and we just tell users to submit everything nontrivial through the queueing system. only rarely do we need to scold users for running nontrivial jobs on login nodes, and even more rarely do any of them mess with compute nodes directly. we make no effort to actually _prevent_ users from ssh'ing to compute nodes. IMO, part of this is making it easy to submit jobs. in our environment, just prefix your command by "sqsub" and it goes onto a compute node. (that is, extra job scripts are a mistake, and it's important to propogate cwd and env transparently to jobs.) if compilers took more than milliseconds to run, people could certainly define FC="sqsub f90" if they wanted, and it would work correctly. From hahn at mcmaster.ca Sun May 6 22:13:45 2007 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] MPI application benchmarks In-Reply-To: <20070506190520.GB28930@stikine.ucs.sfu.ca> References: <20070504225349.GC17163@stikine.ucs.sfu.ca> <463DC6FC.4060202@moene.indiv.nluug.nl> <20070506190520.GB28930@stikine.ucs.sfu.ca> Message-ID: > Sigh. I thought I could avoid that response. Our own code (due to the no. > of users who all believe that their code is the most important and > therefore must be benchmarked) is so massive that any potential RFP > respondent would have to work a year to run the code. Thus, we have to sure. the suggestion is only useful if the cluster is dedicated to a single purpose or two. for anything else, I really think that microbenchmarks are the only way to go. after all, your code probably doesn't do anything which is truely unique, but rather is some combination of a theoretical microbenchmark "basis set". no, I don't know how to establish the factor weights, or whether this approach really provides a good predictor. but isn't it the obvious way, even the only tractable way? >> Anyone who does something different for serious RFP purposes is playing >> with their lives (at least in our surroundings - civil servants are >> heavily watched as far as fraud / or attempted fraud in these case go). I don't really understand this statement. no one is really going to audit your decision and make you prove that you bought from the "correct" vendor - you simply need to have a plausible rationale for the decision. > E.g., we will almost certainly include gromacs (which still leaves the > question of the input parameters, etc.). that's what makes the "your own code" suggestion so uselessly narrow. I'd be surprised if gromacs couldn't be persuaded (through varied inputs and config) to prefer most any particular hardware: IB vs 10G, x86_64 vs ia64 vs power, even more-cheaper-smaller vs fewer-fatter nodes. this is, of course, complicated by the fact that some workloads use 5 MB/core, and others would like 6000x that much. the former are probably serial, and the latter are probably not large-tight-mpi. I know of no really good way to grok this in its fullness. From landman at scalableinformatics.com Sun May 6 22:43:06 2007 From: landman at scalableinformatics.com (Joe Landman) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] quickie OFED build question In-Reply-To: <463E025E.90409@scalableinformatics.com> References: <463E025E.90409@scalableinformatics.com> Message-ID: <463EBC6A.3010701@scalableinformatics.com> Joe Landman wrote: > Will ask this on the OFED mailing lists in a bit, but has anyone > successfully built OFED 1.x against OpenSuSE 10.2 ? > > I have 1.0, 1.1, and 1.2-rc2 failing. Works on OpenSuSE 10.1. Nature > of the problem appears to be some kernel changes between versions > (2.6.16 vs 2.6.18) that pulled some specific things out. Also, I want > to turn off the -m32 switch usage, badly. Any clues/patches out there? > Google hasn't been very helpful here (must be searching for wrong things). So the answer is ... update the kernel. 2.6.18 in the OpenSuSE 10.2 flavor is not compatible (currently) with OFED-1.2 (nightly, past rc2). Looks like they are focussing on specific RH/SuSE builds. Haven't seen it on Debian/Ubuntu. Built it on OpenSuSe with updated 2.6.20.11 kernel. Thought you would like to know. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 or +1 866 888 3112 cell : +1 734 612 4615 From rgb at phy.duke.edu Sun May 6 23:22:26 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] SSH without login in nodes In-Reply-To: References: <48f9d1380705021557i1196eb25w44cf4ef8ebca0572@mail.gmail.com> <200705051442.59044.csamuel@vpac.org> <200705050924.37869.kilian@stanford.edu> <200705061758.54756.csamuel@vpac.org> Message-ID: On Mon, 7 May 2007, Mark Hahn wrote: >> Yes, this is the other solution. Do nothing fancy in script-land. Just > > right on! we have >1800 users from more institutions than I can count, > and we just tell users to submit everything nontrivial through the queueing > system. only rarely do we need to scold users for running nontrivial jobs on > login nodes, and even more rarely do any of them mess with compute nodes > directly. we make no effort to actually _prevent_ users from ssh'ing to > compute nodes. > > IMO, part of this is making it easy to submit jobs. in our environment, > just prefix your command by "sqsub" and it goes onto a compute node. > (that is, extra job scripts are a mistake, and it's important to propogate > cwd and env transparently to jobs.) if compilers took more than milliseconds > to run, people could certainly define FC="sqsub f90" if they wanted, and it > would work correctly. Ya. In the olde days they would describe this as the difference between fascist administration style and -- not so fascist administration style. It is important to differentiate between security requirements and the human desire to control other humans. Even security is a cost-benefit trade-off, but it is one where being a bit more "stringent" is appropriate. Where it isn't security, you'll probably save energy effort time ulcers if you just back off on rigid control and try instead to use human measures. In fact, even to use a queuing system in the first place requires enough users and conflicting requirements to make it worth all the hassle. You can go a long ways on a small mixed-ownership cluster where people tend to run long jobs on their own nodes all the time and share other people's when they are idle with the shout-down-the hall "hey Joe -- you using those nodes?" scheduler. rgb > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From toon.knapen at fft.be Mon May 7 00:56:10 2007 From: toon.knapen at fft.be (Toon Knapen) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] MPI application benchmarks In-Reply-To: References: <20070504225349.GC17163@stikine.ucs.sfu.ca> <463DC6FC.4060202@moene.indiv.nluug.nl> <20070506190520.GB28930@stikine.ucs.sfu.ca> Message-ID: <463EDB9A.6000009@fft.be> Mark Hahn wrote: >> Sigh. I thought I could avoid that response. Our own code (due to the no. >> of users who all believe that their code is the most important and >> therefore must be benchmarked) is so massive that any potential RFP >> respondent would have to work a year to run the code. Thus, we have to > > sure. the suggestion is only useful if the cluster is dedicated to a > single purpose or two. for anything else, I really think that > microbenchmarks are the only way to go. after all, your code probably > doesn't do anything which is truely unique, but rather is some > combination of a theoretical microbenchmark "basis set". no, I don't > know how to establish the factor weights, or whether this approach > really provides a good predictor. but isn't it the obvious way, > even the only tractable way? > Agreed. On one hand you need micro-benchmark. OTOH you need your users to specify what are the sensitive points of their application. First of all, I suppose their applications are parallel, but are they BW-bound or latency-bound. How much time do the applications spend on communication? Are the app's capable of running in mixed-mode (MPI combined with multithreading), ... Why do'nt you make a list of multiple-choice questions in a style as described above and ask your users to fill that in. This solves also the 'weighting factor' because the users that respond to your question _care_ about the machine being suitable while the others care less. t From peter.st.john at gmail.com Mon May 7 07:23:35 2007 From: peter.st.john at gmail.com (Peter St. John) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] new to clusters... In-Reply-To: <213386ce0705052348q6c1bd4eatb78fb307c7c6ff38@mail.gmail.com> References: <213386ce0705052348q6c1bd4eatb78fb307c7c6ff38@mail.gmail.com> Message-ID: Mark, You might be interested in the discussion that followed Kyle Spaan "A Start in Parallel Programming" from March 13. I believe we had some discussion of relatively simple parallelizable apps, as well as a language flame war pitting the Righteous against the Misguided :-) I believe RGB actually detonated a small fissionable device. Peter On 5/6/07, Mark Thompson wrote: > > I am new to clusters and I am trying to learn as much as possible. I > currently run the new ubuntu feisty fawn and am learning as much as I can > about networking and such. I have several older computers ranging from a > pentium 3 500 mhz to a athlon 1.8 ghz and even have a dual athlon 1.6 ghz > server. I am working on getting more second hand computers for purposes of > learning clustering and would like to know others input on my idea as well > as suggestions for doing this. Any input would be helpful. Also, I would > like suggestions for what to do with it once I have it up and running. I > wouldn't mind providing services for others....would give me good experience > with administration and such. thanks for your time. > > Mark > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20070507/8269a3af/attachment.html From elken at pathscale.com Mon May 7 08:30:42 2007 From: elken at pathscale.com (Tom Elken) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] MPI application benchmarks Message-ID: <463F4622.2000803@pathscale.com> "Martin Siegert" wrote: > In that respect the SPEC MPI2007 benchmark appears to be ideal. Alas, > it does not appear to be open source (please correct me if I am wrong) SPEC MPI2007 will not be open source, but the price will not be very high either (and that price could be left up to the vendors to pay, bidders on your procurement). > - so far I have not even been able to figure out which applications are > being used in that benchmark suite. The names of the SPEC MPI2007 benchmarks have not been made public yet. SPEC confidentiality rules do not allow that until the suite is released. One reason for that is that it is still possible to drop a candidate code from the suite (though unlikely) and SPEC wants to spare the author of such a code possible embarrassment. The expected release date is late June this year. To give more of a hint about what MPI2007 will be like, it will consist of real applications that have been made more portable and had correctness tests defined. The types of applications of the CPU2006 floating point suite ( http://www.spec.org/cpu2006/CFP2006/ ) are similar to what will be in the MPI2007 suite, though only a few are in common between the two suites. I don't think I'm allowed to reveal the price of the MPI2007 suite yet, but it should be somewhat similar to the price structure for CPU2006 (hint, hint). > From: "Brian D. Ropers-Huilman" Subject: To: "Martin Siegert" Cc: beowulf@beowulf.org Message-ID: Content-Type: text/plain; charset=ISO-8859-1; format=flowed On 5/4/07, Martin Siegert wrote: >> > We will be purchasing a shared cluster for a wide community (currently >> > more than 1000 users). Thus, the common response on this list to evaluate >> > hardware - "use your own application as benchmark" - does not work: >> > users change, users' applications change, etc., etc. Thus, I need a >> > benchmark suite that tests a wide spectrum of properties. > > My answer is still to "use your own application(s)." Poll your users > and find out what they have and what they are going to run. Find some > who already have codes that scale well (>1000 cores) and ask them to > participate. Many vendors will allow you to run your own codes on > systems they have at their own sites before you decide to purchase. > These vendor-hosted systems are typically only 256 cores or less, but > it gives you some idea as to how your codes might run. > > I also suggest picking some representative synthetic benchmarks to > test floating point and integer operations, memory bandwidth, MPI > ping-pongs (the SPEC MPI2007, among others, would fit here), SPEC MPI2007 will fit in the class of what I call standard benchmarks, but not "synthetic" benchmarks. SPEC starts with the real applications as you download them, and then tries to make as few changes as possible to make it so the codes can be built, and the results can be validated on a wide range of compilers and platforms. > the HPC > Challenge codes, and the like. Many sites will then take all of these > results (synthetic + their own applications) and aggregate the > results, possibly with weighting factors, into a single number. Agreed. This is a good strategy that us used quite often. Benchmarks like SPEC CPU2006 are often added to various synthetic benchmarks and user applications as benchmark requirements for an RFP. We hope that the same will happen with SPEC MPI2007. Hopefully having a suite of applications available will make you feel like you don't have to develop quite as large a (portable) suite of your own applications as you would otherwise, or to deal as many questions from vendors as they try to get your codes to run. -Tom Elken QLogic Corp. and the SPEC High Performance Group. From wrankin at ee.duke.edu Mon May 7 14:07:05 2007 From: wrankin at ee.duke.edu (Bill Rankin) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] MPI application benchmarks In-Reply-To: <463EDB9A.6000009@fft.be> References: <20070504225349.GC17163@stikine.ucs.sfu.ca> <463DC6FC.4060202@moene.indiv.nluug.nl> <20070506190520.GB28930@stikine.ucs.sfu.ca> <463EDB9A.6000009@fft.be> Message-ID: <93630D73-4A52-40B6-B983-D45AA4A8BD94@ee.duke.edu> Toon Knapen wrote: > Mark Hahn wrote: >> sure. the suggestion is only useful if the cluster is dedicated >> to a single purpose or two. for anything else, I really think >> that microbenchmarks are the only way to go. I'm not sure that I agree with this - there are just so many different micro benchmarks that I would worry that relying upon them for anything other than basic system validation (which they are very good at) leaves the potential for some very big holes in your requirements. Especially in a general purpose system like the one proposed. > Why don't you make a list of multiple-choice questions in a style > as described above and ask your users to fill that in. This solves > also the 'weighting factor' because the users that respond to your > question _care_ about the machine being suitable while the others > care less. This is an excellent idea. Even with 1000's of users, you still need to understand the mix of application types you will see. Then you can make some informed judgments on the overall system architecture. Some questions that you will need to consider: - What is the "experience level" of your audience. ? - Are they writing their own MPI or pthreads code? Or are they just using canned apps? What apps and how do you handle licenses? - Do you have a significant population who are looking to do larger scale (100+ process) MPI jobs? Is it worth the expense to invest in high-speed interconnects, or is GigE sufficient? - Pay attention to the storage system, both in scale and performance. It's often both a hidden bottleneck and a single point of failure. - Do you have users who want to toss around Terabytes off data? - How do you plan on backing up these Terabytes of data? These are just a few off the top of my head. There are more. But I think that a quick fairly simple survey will be invaluable towards planning this facility. Good luck, -bill From rgb at phy.duke.edu Mon May 7 14:59:18 2007 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat May 10 01:06:04 2008 Subject: [Beowulf] MPI application benchmarks In-Reply-To: <93630D73-4A52-40B6-B983-D45AA4A8BD94@ee.duke.edu> References: <20070504225349.GC17163@stikine.ucs.sfu.ca> <463DC6FC.4060202@moene.indiv.nluug.nl> <20070506190520.GB28930@stikine.ucs.sfu.ca> <463EDB9A.6000009@fft.be> <93630D73-4A52-40B6-B983-D45AA4A8BD94@ee.duke.edu> Message-ID: On Mon, 7 May 2007, Bill Rankin wrote: > > Toon Knapen wrote: >> Mark Hahn wrote: >>> sure. the suggestion is only useful if the cluster is dedicated to a >>> single purpose or two. for anything else, I really think that >>> microbenchmarks are the only way to go. > > I'm not sure that I agree with this - there are just so many different micro > benchmarks that I would worry that relying upon them for anything other than > basic system validation (which they are very good at) leaves the potential > for some very big holes in your requirements. Especially in a general > purpose system like the one proposed. I think that there is something t