From brockp at umich.edu Mon Mar 1 07:19:43 2010 From: brockp at umich.edu (Brock Palen) Date: Mon, 1 Mar 2010 10:19:43 -0500 Subject: [Beowulf] which mpi library should I focus on? In-Reply-To: References: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> <20100223154639.GB695@sopalepc> Message-ID: <1BB43403-C149-49CE-85D6-33131ED70677@umich.edu> Just as a follow up to this message the MPICH2 show is now out, if you want to hear Bill and Rusty talk about MPICH2, what it does and where it came from: http://www.rce-cast.com/index.php/Podcast/rce-28-mpich2.html Thanks Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Feb 24, 2010, at 12:40 AM, Sangamesh B wrote: > Hi, > > I hope you are developing MPI codes and wants to run in cluster > environment. If so, I prefer you to use Open MPI. Because, > > Open MPI is well developed and its stable > Has a very good FAQ section, where you will get clear your doubts > easily. > It has a in-built tight-integration method with cluster > schedulers- SGE, PBS, LSF etc. > It has an option to choose ETHERNET or INFINIBAND network > connectivity during run-time. > > Thanks, > Sangamesh > > On Tue, Feb 23, 2010 at 9:16 PM, Douglas Guptill > wrote: > On Tue, Feb 23, 2010 at 09:25:45AM -0500, Brock Palen wrote: > > > (shameless plug) if you want, listen to our podcast on OpenMPI > > http://www.rce-cast.com/index.php/Podcast/rce01-openmpi.html > > > > The MPICH2 show is recorded (edited it last night, almost done!), > and > > will be released this Saturday Midnight Eastern. > > If you want to hear the rough cut, to compare to OpenMPI, email me > and I > > will send you the unfinished mp3. > > That sounds like a nice pair. OpenMPI vs MPICH2. > > Douglas. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From ljdursi at scinet.utoronto.ca Mon Mar 1 08:29:49 2010 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Mon, 01 Mar 2010 11:29:49 -0500 Subject: [Beowulf] confidential data on public HPC cluster Message-ID: <4B8BEB7D.9050904@scinet.utoronto.ca> Hi; We're a fairly typical academic HPC centre, and we're starting to have users talk to us about using our new clusters for projects that have various requirements for keeping data confidential. We expect these to be the first of many requests, so we want to think now about how we can and can't help such users. We have people here quite familiar in general cluster security issues, but as is usually the case in academia, we're normally concerned about hardening the cluster from the outside, and less about protecting the users from each other. We've started doing some research, but presumably people on this list have run into these issues in the past and can give us some guidance. Obviously, the degree to which we and our clusters can be of use to these users depend on the details and stringency of their legal, contractual, or other requirements. If even having small fractions of the data unencrypted in memory on a node that someone else could login to (even if only as root) is not allowed, then I imagine it's going to be hard for them to use any machine they don't physically control. But presumably many other users will have less strict conditions on what is and isn't allowed. Are there good discussions of this somewhere? What resources do you point users to when they have such requirements, and what sorts of things can we put in place on our end to make life easier for such users without imposing new requirements on the rest of our user base? - Jonathan -- Jonathan Dursi From jlforrest at berkeley.edu Mon Mar 1 08:51:37 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon, 01 Mar 2010 08:51:37 -0800 Subject: [Beowulf] confidential data on public HPC cluster In-Reply-To: <4B8BEB7D.9050904@scinet.utoronto.ca> References: <4B8BEB7D.9050904@scinet.utoronto.ca> Message-ID: <4B8BF099.6050709@berkeley.edu> On 3/1/2010 8:29 AM, Jonathan Dursi wrote: > Are there good discussions of this somewhere? What resources do you > point users to when they have such requirements, and what sorts of > things can we put in place on our end to make life easier for such users > without imposing new requirements on the rest of our user base? My suggestion is to follow the wit and wisdom of Nancy Reagan, and "just say no". That is getting intimate knowledge of all the cracks and crevices of sensitive/confidential data rules will be a huge time sink, and will probably take your attention away from the presumably more enjoyable benefits of a modern HPC cluster. Everytime I read about some embarrassing break-in to a confidential data storage environment, I count my lucky stars that I don't have any such data. I know that I couldn't do any better than the people whose names are now on the front page of the paper. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From hahn at mcmaster.ca Mon Mar 1 09:35:07 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 1 Mar 2010 12:35:07 -0500 (EST) Subject: [Beowulf] confidential data on public HPC cluster In-Reply-To: <4B8BEB7D.9050904@scinet.utoronto.ca> References: <4B8BEB7D.9050904@scinet.utoronto.ca> Message-ID: > requirements for keeping data confidential. We expect these to be the it's critically important to pin down exactly what they mean by that. for instance, anything involving human subjects, not limited to clinical data, needs to be blinded. that's a standard requirement from any research-ethics board. it's also worth going over the basics of permissions, since researchers often don't understand what rwxrx-- means ;) > other requirements. If even having small fractions of the data unencrypted > in memory on a node that someone else could login to (even if only as root) > is not allowed, then I imagine it's going to be hard for them to use any > machine they don't physically control. But presumably many other users will > have less strict conditions on what is and isn't allowed. researchers also don't think like a security person: there's no way someone can expect confidentiality from root unless the machine is completely under their control (bare machine + install media, etc). we have a facility on campus that has data from StatsCan, and does indeed go to these sort of lengths. but that's completely incompatible with any sort of shared facility. it's easy to imagine "security theater" which might make people feel better though. for instance, one might offer them VM hosting, instead of the traditional just-another-unix-user approach. or even a deal to wipe the machine and install from scratch at the begining of the job - reboot when you're done! but these are simply making it harder to compromise, and IMO would just lead to a tar pit of obfuscation, not real security. (for instance, compromising a running VM is probably not hard, but tweaking the image before it runs would be easier. does the occupant then try to validate the integrity of the VM? how hard is it to intercept that check? can they then detect the interception? this applies to installing a node from media, as well.) ultimately, someone somewhere needs admin access, so it's not really a question of whether disclosure is possible, but rather who you trust. as a sysadmin, I wouldn't be upset about being asked to go through a background check, and my employer could obtain bonding for me. an audit-trail is "post-coital", but may still make sensitive clients more comfortable (though it's likely to be security theatre as well...) consider, for instance, if a group's storage is on a separate server, whose access is limited to specific admins, and whose mountd logs are available for the group's perusal. even setting up jobs to use sshfs back to the group's own server may make them feel better because they'll be able to look at the logs (again, not impregnable, just harder to get.) regards, mark hahn PS: my apology to anyone allergic to innuendo! From john.hearns at mclaren.com Mon Mar 1 10:30:08 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Mon, 1 Mar 2010 18:30:08 -0000 Subject: [Beowulf] confidential data on public HPC cluster In-Reply-To: <4B8BEB7D.9050904@scinet.utoronto.ca> References: <4B8BEB7D.9050904@scinet.utoronto.ca> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com> I think Mark Hahn has given a lot of good advice here. It depends on the nature of the data. Is it: a) industrially confidential b) clinically confidential (ie patient identifiable data) c) government confidential (ie government department stats) d) Nuclear eyes-only If (d) you're on your own. Actually, the way you should look at this is "what happens if this data does leak out" and this depends on who gets hold of it Data from (c) leaking would probably cause little lasting or real damage - but the headlines in the press along the lines of "Government cannot keep data safe" are pretty embarrassing. I think the response is to demonstrate you took reasonable steps to secure the data. The audit trail concept is good. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From james.p.lux at jpl.nasa.gov Mon Mar 1 10:57:45 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Mon, 1 Mar 2010 10:57:45 -0800 Subject: [Beowulf] confidential data on public HPC cluster In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com> References: <4B8BEB7D.9050904@scinet.utoronto.ca> <68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com> Message-ID: Don't forget export controls, too. (both ITAR, internationally, and also (at least for US) Commerce department > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Hearns, John > Sent: Monday, March 01, 2010 10:30 AM > To: beowulf at beowulf.org > Subject: RE: [Beowulf] confidential data on public HPC cluster > > I think Mark Hahn has given a lot of good advice here. > > It depends on the nature of the data. Is it: > > a) industrially confidential > > b) clinically confidential (ie patient identifiable data) > > c) government confidential (ie government department stats) > > d) Nuclear eyes-only > > If (d) you're on your own. > > > Actually, the way you should look at this is > "what happens if this data does leak out" and this depends on who gets > hold of it > > Data from (c) leaking would probably cause little lasting or real damage > - but the headlines in the press > along the lines of "Government cannot keep data safe" are pretty > embarrassing. > > I think the response is to demonstrate you took reasonable steps to > secure the data. > > The audit trail concept is good. > > > The contents of this email are confidential and for the exclusive use of the intended recipient. If > you receive this email in error you should not copy it, retransmit it, use it or disclose its contents > but should return it to the sender immediately and delete your copy. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From ljdursi at scinet.utoronto.ca Mon Mar 1 11:08:43 2010 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Mon, 01 Mar 2010 14:08:43 -0500 Subject: [Beowulf] confidential data on public HPC cluster In-Reply-To: References: <4B8BEB7D.9050904@scinet.utoronto.ca> <68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B8C10BB.1080907@scinet.utoronto.ca> These are all good things to keep in mind. There must be people out there with users who do biomed work with its attendant confidentiality issues, or users who work on commercial confidential data sets -- engineering or otherwise. What do those users do on your systems, and have you had to implement things on the system side to help them? - Jonathan -- Jonathan Dursi From ashley at pittman.co.uk Mon Mar 1 15:24:11 2010 From: ashley at pittman.co.uk (Ashley Pittman) Date: Mon, 1 Mar 2010 23:24:11 +0000 Subject: [Beowulf] confidential data on public HPC cluster In-Reply-To: <4B8C10BB.1080907@scinet.utoronto.ca> References: <4B8BEB7D.9050904@scinet.utoronto.ca> <68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com> <4B8C10BB.1080907@scinet.utoronto.ca> Message-ID: <9175CD08-82D5-4848-88B4-DBF253BE9380@pittman.co.uk> On 1 Mar 2010, at 19:08, Jonathan Dursi wrote: > These are all good things to keep in mind. > > There must be people out there with users who do biomed work with its attendant confidentiality issues, or users who work on commercial confidential data sets -- engineering or otherwise. When we put this very question to the medical ethics board the conclusion was that it was ok to send patient data (3d scans in this case) over the wider academic network as long as the data was not traceable to an individual patient. I don't think confidentially of data was ever assumed as it's very difficult to do with shared resources, merely care was taken that the data would not be of potential use to 3rd parties. As I recall the concern was about the data initially leaving the hospital, once it had done that there was little distinction between it being on a cluster or traveling over the wire somewhere to get there. Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From gmpc at sanger.ac.uk Tue Mar 2 02:25:56 2010 From: gmpc at sanger.ac.uk (Guy Coates) Date: Tue, 02 Mar 2010 10:25:56 +0000 Subject: [Beowulf] confidential data on public HPC cluster In-Reply-To: <9175CD08-82D5-4848-88B4-DBF253BE9380@pittman.co.uk> References: <4B8BEB7D.9050904@scinet.utoronto.ca> <68A57CCFD4005646957BD2D18E60667B0F89B9DF@milexchmb1.mil.tagmclarengroup.com> <4B8C10BB.1080907@scinet.utoronto.ca> <9175CD08-82D5-4848-88B4-DBF253BE9380@pittman.co.uk> Message-ID: <4B8CE7B4.6010304@sanger.ac.uk> Ashley Pittman wrote: > On 1 Mar 2010, at 19:08, Jonathan Dursi wrote: > >> These are all good things to keep in mind. >> >> There must be people out there with users who do biomed work with its attendant confidentiality issues, > or users who work on commercial confidential data sets -- engineering or otherwise. Hi all, The usual answer you will get from lawyers and compliance officers is that: "You should take reasonable care to ensure that data is kept appropriately." However, most (all?) biomedical projects should have some sort of data-access agreement (DAA). That document states what patients have given consent for, who should have access to the data and under what conditions. That should give you a good starting point for working out what your security policy should be. (If you are going to be doing systems stuff for the group, you should also have signed the agreement.) Generally speaking, the greater the chance of being to trace data back to a specific individual, then the more paranoid you have to be about the data. It is up to the primary investigators, lawyers, compliance officers and sys-admins to turn that into a security policy. At Sanger, we run through the whole range of security policies. We have projects that deal routinely with full medical histories. They run on a set of machines physically separated from the rest of our datacentre infrastructure, with data held in encrypted databases with 2 factor logins. Data is not allowed to be removed from that setting. We have other projects that are using anonymised datasets, and that data can be held on our main cluster with the appropriate unix access controls. In the future we will probably have projects whose security requirements would be somewhere in the middle of those two extremes. The key do dealing with those projects are the words "reasonable care". Would we worry about data being kept un-encrypted in memory? Probably not. Would we put in place an automated audit process to ensure data kept on filesystems have appropriate ACLs set? Probably yes. And remember, if someone goes out of their way to get access to data that they should not, then that is a contravention of the AUP and/or local computer crime laws. (You do make your users sign an AUP, right...?) There are some example DAAs below. https://www.wtccc.org.uk/info/access_to_data_samples.shtml https://www.wtccc.org.uk/docs/Data_Access_Agreement_v17.pdf http://www.icgc.org/icgc_document/policies_and_guidelines/informed_consent_access_and_ethical_oversight Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From rigved.sharma123 at gmail.com Mon Mar 1 02:54:33 2010 From: rigved.sharma123 at gmail.com (rigved sharma) Date: Mon, 1 Mar 2010 16:24:33 +0530 Subject: [Beowulf] error while make mpijava on amd_64 Message-ID: hi, i am getting this error when i do make for mpijava: make[2]: Leaving directory `/misc/local/mpiJAVA/mpiJava/src/Java' --- Making C make[2]: Entering directory `/misc/local/mpiJAVA/mpiJava/src/C' /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_MPI.o mpi_MPI.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Comm.o mpi_Comm .c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Op.o mpi_Op.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Datatype.o mpi_ Datatype.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Intracomm.o mpi _Intracomm.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Intercomm.o mpi _Intercomm.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Cartcomm.o mpi_ Cartcomm.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Graphcomm.o mpi _Graphcomm.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Group.o mpi_Gro up.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Status.o mpi_St atus.c mpi_Status.c:244:8: warning: extra tokens at end of #endif directive /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Request.o mpi_R equest.c /usr/local/mpich-1.2.6/bin/mpicc -c -I/usr/java/j2sdk1.4.2/include -I/usr/java/j2sdk1.4.2/include/ -I/usr/local/mpich-1.2.6/include -o mpi_Errhandler.o mp i_Errhandler.c rm -f ../../lib/libmpijava.so /usr/local/mpich-1.2.6/bin/mpicc -o ../../lib/libmpijava.so \ -L/usr/local/mpich-1.2.6/lib mpi_MPI.o mpi_Comm.o mpi_Op.o mpi_Datatype.o mpi_Intracomm.o mpi_Intercomm.o mpi_Cartcomm.o mpi_Gr aphcomm.o mpi_Group.o mpi_Status.o mpi_Request.o mpi_Errhandler.o ; /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crt1.o(.text+0x21): In function `_start': : undefined reference to `main' collect2: ld returned 1 exit status make[2]: *** [../../lib/libmpi.so] Error 1 make[2]: Leaving directory `/misc/local/mpiJAVA/mpiJava/src/C' make[1]: *** [all] Error 2 make[1]: Leaving directory `/misc/local/mpiJAVA/mpiJava/src' make: *** [all] Error 2 ----------------------------------- uname -a : Linux,testmc,2.6.9-42.0.2.EL_lustre.1.4.7.3smp #1 SMP 2006 x86_64 x86_64 x86_64 GNU/Linux, mpich :/usr/local/mpich-1.2.6 java : /usr/java/j2sdk1.4.2 both are part of path variable...what is wrong? -------------- next part -------------- An HTML attachment was scrubbed... URL: From chenyon1 at iit.edu Mon Mar 1 12:38:38 2010 From: chenyon1 at iit.edu (Yong Chen) Date: Mon, 01 Mar 2010 14:38:38 -0600 Subject: [Beowulf] [hpc-announce] Submission deadline of P2S2-2010 extended to 3/10/2010 Message-ID: [Apologies if you got multiple copies of this email. If you'd like to opt out of these announcements, information on how to unsubscribe is available at the bottom of this email.] Dear Colleague: We would like to inform you that the paper submission deadline of the Third International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) has been extended to March 10th, 2010. A full CFP can be found below. Thank you. CALL FOR PAPERS =============== Third International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Sept. 13th, 2010 To be held in conjunction with ICPP-2010: The 39th International Conference on Parallel Processing, Sept. 13-16, 2010, San Diego, CA, USA Website: http://www.mcs.anl.gov/events/workshops/p2s2 SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Systems software for high-end scientific and enterprise computing architectures o Communication sub-subsystems for high-end computing o High-performance file and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms for high-end computing * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others o Hybrid Programming Models * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms PROCEEDINGS ----------- Proceedings of this workshop will be published in CD format and will be available at the conference (together with the ICPP conference proceedings) . SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/ for the submission link. JOURNAL SPECIAL ISSUE --------------------- The best papers of P2S2'10 will be included in a special issue of the International Journal of High Performance Computing Applications (IJHPCA) on Programming Models, Software and Tools for High-End Computing. IMPORTANT DATES --------------- Paper Submission: March 10th, 2010 Author Notification: May 3rd, 2010 Camera Ready: June 14th, 2010 PROGRAM CHAIRS -------------- * Pavan Balaji, Argonne National Laboratory * Abhinav Vishnu, Pacific Northwest National Laboratory PUBLICITY CHAIR --------------- * Yong Chen, Illinois Institute of Technology STEERING COMMITTEE ------------------ * William D. Gropp, University of Illinois Urbana-Champaign * Dhabaleswar K. Panda, Ohio State University * Vijay Saraswat, IBM Research PROGRAM COMMITTEE ----------------- * Ahmad Afsahi, Queen's University * George Almasi, IBM Research * Taisuke Boku, Tsukuba University * Ron Brightwell, Sandia National Laboratory * Franck Cappello, INRIA, France * Yong Chen, Illinois Institute of Technology * Ada Gavrilovska, Georgia Tech * Torsten Hoefler, Indiana University * Zhiyi Huang, University of Otago, New Zealand * Hyun-Wook Jin, Konkuk University, Korea * Zhiling Lan, Illinois Institute of Technology * Doug Lea, State University of New York at Oswego * Jiuxing Liu, IBM Research * Heshan Lin, Virginia Tech * Guillaume Mercier, INRIA, France * Scott Pakin, Los Alamos National Laboratory * Fabrizio Petrini, IBM Research * Bronis de Supinksi, Lawrence Livermore National Laboratory * Sayantan Sur, Ohio State University * Rajeev Thakur, Argonne National Laboratory * Vinod Tipparaju, Oak Ridge National Laboratory * Jesper Traff, NEC, Europe * Weikuan Yu, Auburn University If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov ======================================================================== You can unsubscribe from the hpc-announce mailing list here: https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce ======================================================================== From hahn at mcmaster.ca Tue Mar 2 20:37:06 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 2 Mar 2010 23:37:06 -0500 (EST) Subject: [Beowulf] error while make mpijava on amd_64 In-Reply-To: References: Message-ID: > i am getting this error when i do make for mpijava: isn't mpijava very old? > /usr/local/mpich-1.2.6/bin/mpicc -o ../../lib/libmpijava.so \ > -L/usr/local/mpich-1.2.6/lib mpi_MPI.o mpi_Comm.o > mpi_Op.o mpi_Datatype.o mpi_Intracomm.o mpi_Intercomm.o > mpi_Cartcomm.o mpi_Gr > aphcomm.o mpi_Group.o mpi_Status.o mpi_Request.o mpi_Errhandler.o ; > /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crt1.o(.text+0x21): > In function `_start': > : undefined reference to `main' which is gcc's way of saying "I'm trying to link an executable not a shared library." it needs -shared in there. likely it also needs -fPIC when compiling the .o files. or maybe just stick to static archives, which are generally simpler... From hahn at mcmaster.ca Wed Mar 3 12:05:01 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 3 Mar 2010 15:05:01 -0500 (EST) Subject: [Beowulf] error while make mpijava on amd_64 In-Reply-To: References: Message-ID: > we r not getting latest free download version for mpijava for linux for this is version 1.2.5 circa jan 2003, right? right away this should set off some alarms, since any maintained package would have had some updates since then. it's slightly unreasonable to expect such an old package, especially one which is inherently "glueware" to be viable after sitting so long. > AMD 64 bit. Also can u suggest the solution n for the error i forwarded.ur > explaination is not very clear 2 me..:( using a static library would be easy ("ar r mpijava.a *.o"), but now that I give it another thought, you probably want this to act as a java extension, which probably requires being a shared library. in principle, to get the .so to work, you need to add -fPIC to each of the component compiles (which produce .o files), then add -shared to the last link-like stage which combines the .o files into a .so file. you'll have to look at the Makefile to find out where to add these flags. offhand, I'd think you should add -fPIC to CFLAGS and -static to LDFLAGS (but I don't have a copy of the Makefile.) as a testament to the whimsical nature of getting 2003-vintage software to compile, I can't even find a working download link for mpijava. web sites also wither and die if not cared for... -mark From mm at yuhu.biz Thu Mar 4 04:27:38 2010 From: mm at yuhu.biz (Marian Marinov) Date: Thu, 4 Mar 2010 14:27:38 +0200 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: References: <4B56246E.2050505@abdn.ac.uk> Message-ID: <201003041427.47481.mm@yuhu.biz> On Wednesday 20 January 2010 01:06:27 Rahul Nabar wrote: > On Tue, Jan 19, 2010 at 3:30 PM, Tony Travis wrote: > > I responded to Rahul who started this thread because his requirements > > seemed to be similar to mine: i.e. a small-scale DIY Beowulf cluster. In > > this context, every penny counts and we do not throw things away until > > they are actually dead: Old servers become new compute nodes, and so on. > > I think that lot of people reading this list are interested in running > > small Beowulf clusters for relatively small projects, like me. I've found > > the Beowulf list to be a mine of useful information, but we are not all > > running huge Beowulf clusters or supporting them commerically. > > I don't know about the others on the list, but you describe my > situation pretty accurately Tony! :) Small budget, primitive hardware > that's rarely retired etc. Sounds familiar. > Linux-Mag had a very good article about Software RAID0 vs LVM Stripe performance: http://www.linux-mag.com/cache/7582/1.html You should read it. -- Best regards, Marian Marinov -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: This is a digitally signed message part. URL: From rigved.sharma123 at gmail.com Wed Mar 3 11:49:42 2010 From: rigved.sharma123 at gmail.com (rigved sharma) Date: Thu, 4 Mar 2010 01:19:42 +0530 Subject: [Beowulf] error while make mpijava on amd_64 In-Reply-To: References: Message-ID: Hi Mark/Friends, we r not getting latest free download version for mpijava for linux for AMD 64 bit. Also can u suggest the solution n for the error i forwarded.ur explaination is not very clear 2 me..:( On Wed, Mar 3, 2010 at 10:07 AM, Mark Hahn wrote: > i am getting this error when i do make for mpijava: >> > > isn't mpijava very old? > > > /usr/local/mpich-1.2.6/bin/mpicc -o ../../lib/libmpijava.so \ >> -L/usr/local/mpich-1.2.6/lib mpi_MPI.o mpi_Comm.o >> mpi_Op.o mpi_Datatype.o mpi_Intracomm.o mpi_Intercomm.o >> mpi_Cartcomm.o mpi_Gr >> aphcomm.o mpi_Group.o mpi_Status.o mpi_Request.o mpi_Errhandler.o ; >> >> /usr/lib/gcc/x86_64-redhat-linux/3.4.6/../../../../lib64/crt1.o(.text+0x21): >> In function `_start': >> : undefined reference to `main' >> > > which is gcc's way of saying "I'm trying to link an executable not a shared > library." it needs -shared in there. likely it also needs -fPIC when > compiling the .o files. or maybe just stick to static archives, which are > generally simpler... > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Fri Mar 5 07:23:47 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 5 Mar 2010 10:23:47 -0500 Subject: [Beowulf] copying data between clusters Message-ID: How does one copy large (20TB) amounts of data from one cluster to another? Assuming that each node in the cluster can only do about 30MB/sec between clusters and i want to preserve the uid/gid/timestamps, etc I know how i do it, but i'm curious what methods other people use... Just a general survey... From landman at scalableinformatics.com Fri Mar 5 08:00:03 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 05 Mar 2010 11:00:03 -0500 Subject: [Beowulf] copying data between clusters In-Reply-To: References: Message-ID: <4B912A83.5090703@scalableinformatics.com> Michael Di Domenico wrote: > How does one copy large (20TB) amounts of data from one cluster to another? > > Assuming that each node in the cluster can only do about 30MB/sec > between clusters and i want to preserve the uid/gid/timestamps, etc > > I know how i do it, but i'm curious what methods other people use... I am biased of course, but Fedex-net with one of these: http://scalableinformatics.com/jackrabbit 1GB @ 30 MB/s is about 33s. 1TB @ 30 MB/s is about 33000s. Or more than 1/3 of a day. 20TB @ 30 MB/s ... you are looking at ~7 days to write. If you have a 1GB/s disk write speed (less than the above unit can do), 1TB takes ~1000s, 20TB takes 20000s, about 1/4 of a day. If the clusters are close enough (same data center) this could be a shared storage but you will need a fast network between them. If the clusters are far enough to avoid direct connection, chances are 30 MB/s may be optimistic on getting data between them. BTW: 30 MB/s sounds suspiciously like either a) 1GbE sustained NFS speed for some nodes or b) the speed of an IDE drive. > > Just a general survey... > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From kyron at neuralbs.com Fri Mar 5 08:18:54 2010 From: kyron at neuralbs.com (kyron) Date: Fri, 05 Mar 2010 11:18:54 -0500 Subject: [Beowulf] copying data between clusters In-Reply-To: <4B912A83.5090703@scalableinformatics.com> References: <4B912A83.5090703@scalableinformatics.com> Message-ID: <58a7213ba2ad240c2da27b8d33311f57@localhost> On Fri, 05 Mar 2010 11:00:03 -0500, Joe Landman wrote: > Michael Di Domenico wrote: >> How does one copy large (20TB) amounts of data from one cluster to >> another? >> >> Assuming that each node in the cluster can only do about 30MB/sec >> between clusters and i want to preserve the uid/gid/timestamps, etc >> >> I know how i do it, but i'm curious what methods other people use... Could you clarify? Are-you actually sending from NodeXX-clusterA to NodeXX-ClusterB ? Are-we to assume aggregate bandwidth of Node*BW (as long as you don't saturate the switch fabric)? Also, given my comment below, I am assuming the 20TB of data is actually segmented (20TB/NodeCount) across the nodes and not 20TB*NodeCount. > I am biased of course, but Fedex-net with one of these: > http://scalableinformatics.com/jackrabbit > > 1GB @ 30 MB/s is about 33s. 1TB @ 30 MB/s is about 33000s. Or more > than 1/3 of a day. 20TB @ 30 MB/s ... you are looking at ~7 days to write. > > If you have a 1GB/s disk write speed (less than the above unit can do), > 1TB takes ~1000s, 20TB takes 20000s, about 1/4 of a day. > > If the clusters are close enough (same data center) this could be a > shared storage but you will need a fast network between them. If the > clusters are far enough to avoid direct connection, chances are 30 MB/s > may be optimistic on getting data between them. > > BTW: 30 MB/s sounds suspiciously like either a) 1GbE sustained NFS speed > for some nodes or b) the speed of an IDE drive. Given I haven't seen single 20TB drives out there yet, I doubt it to be the case. I wouldn't throw in NFS as a limiting factor (just yet) as I have been able to have sustained 250MB/s data transfer rates (2xGigE using channel bonding). And this figure is without jumbo frames so I do have some protocol overhead loss. The sending server is a PERC 5/i raid with 4*300G*15kRPM drives while the receiving well...was loading onto RAM ;) Eric Thibodeau From jmdavis1 at vcu.edu Fri Mar 5 08:22:14 2010 From: jmdavis1 at vcu.edu (Mike Davis) Date: Fri, 05 Mar 2010 11:22:14 -0500 Subject: [Beowulf] copying data between clusters In-Reply-To: References: Message-ID: <4B912FB6.5010109@vcu.edu> Michael Di Domenico wrote: > How does one copy large (20TB) amounts of data from one cluster to another? > > Assuming that each node in the cluster can only do about 30MB/sec > between clusters and i want to preserve the uid/gid/timestamps, etc > If the clusters are co-lo I wouldn't copy I would use shared storage. If they are not co-located I would use patience. Seriously though, for a one time copy, I would consider copying to an external system and then physically moving that system. To do this and preserve ownerships you will need to duplicate accounts and groups. From landman at scalableinformatics.com Fri Mar 5 08:27:22 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 05 Mar 2010 11:27:22 -0500 Subject: [Beowulf] copying data between clusters In-Reply-To: <58a7213ba2ad240c2da27b8d33311f57@localhost> References: <4B912A83.5090703@scalableinformatics.com> <58a7213ba2ad240c2da27b8d33311f57@localhost> Message-ID: <4B9130EA.8000100@scalableinformatics.com> kyron wrote: > Given I haven't seen single 20TB drives out there yet, I doubt it to be > the case. I wouldn't throw in NFS as a limiting factor (just yet) as I have I was commenting on the 30 MB/s figure. Not whether or not he had 20TB attached to it (though if he did ... that would be painful). > been able to have sustained 250MB/s data transfer rates (2xGigE using > channel bonding). And this figure is without jumbo frames so I do have some > protocol overhead loss. The sending server is a PERC 5/i raid with > 4*300G*15kRPM drives while the receiving well...was loading onto RAM ;) We are getting sustained 1+GB/s over 10GbE with NFS on a per unit basis. For IB its somewhat faster. Backing store is able to handle this easily. I think Michael may be thinking about the performance of a single node GbE or IDE rather than the necessary r/w performance to populate 20+ TB of data for data motion. > > > Eric Thibodeau -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From kyron at neuralbs.com Fri Mar 5 08:30:44 2010 From: kyron at neuralbs.com (kyron) Date: Fri, 05 Mar 2010 11:30:44 -0500 Subject: [Beowulf] copying data between clusters In-Reply-To: <4B912FB6.5010109@vcu.edu> References: <4B912FB6.5010109@vcu.edu> Message-ID: On Fri, 05 Mar 2010 11:22:14 -0500, Mike Davis wrote: > Michael Di Domenico wrote: >> How does one copy large (20TB) amounts of data from one cluster to >> another? >> >> Assuming that each node in the cluster can only do about 30MB/sec >> between clusters and i want to preserve the uid/gid/timestamps, etc >> > If the clusters are co-lo I wouldn't copy I would use shared storage. If > they are not co-located I would use patience. > > Seriously though, for a one time copy, I would consider copying to an > external system and then physically moving that system. To do this and > preserve ownerships you will need to duplicate accounts and groups. ...and we are all assuming non-compressibility; otherwise, use pbzip2 ;) From akshar.bhosale at gmail.com Thu Mar 4 11:14:22 2010 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Fri, 5 Mar 2010 00:44:22 +0530 Subject: [Beowulf] error while make mpijava on amd_64 In-Reply-To: References: Message-ID: Hi Mark, Many thanks 2 u.. Regards, Rigved On Thu, Mar 4, 2010 at 1:35 AM, Mark Hahn wrote: > we r not getting latest free download version for mpijava for linux for >> > > this is version 1.2.5 circa jan 2003, right? right away this should set > off some alarms, since any maintained package would have had some updates > since then. it's slightly unreasonable to expect such an old > package, especially one which is inherently "glueware" to be viable after > sitting so long. > > > AMD 64 bit. Also can u suggest the solution n for the error i forwarded.ur >> explaination is not very clear 2 me..:( >> > > using a static library would be easy ("ar r mpijava.a *.o"), but now that I > give it another thought, you probably want this to act as a java extension, > which probably requires being a shared library. > > in principle, to get the .so to work, you need to add -fPIC to each of the > component compiles (which produce .o files), then add -shared > to the last link-like stage which combines the .o files into a .so file. > you'll have to look at the Makefile to find out where to add these flags. > offhand, I'd think you should add -fPIC to CFLAGS and -static to LDFLAGS > (but I don't have a copy of the Makefile.) > > as a testament to the whimsical nature of getting 2003-vintage software > to compile, I can't even find a working download link for mpijava. > web sites also wither and die if not cared for... > > -mark > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wrankin.1 at gmail.com Fri Mar 5 07:59:08 2010 From: wrankin.1 at gmail.com (Bill Rankin) Date: Fri, 5 Mar 2010 10:59:08 -0500 Subject: [Beowulf] copying data between clusters In-Reply-To: References: Message-ID: Umm, you have your network guys pull a fiber run (or two) from your cluster's file server over to the other cluster's core network switch? Alternately, you unbolt and pull the shelf of FC disks out of the rack, put them on a cart and wheel them over to the other cluster's filer. (1/2 :-) It's an ill defined problem. What's your network topology? Per-node bandwidth is pretty meaningless if you are oversubscribed (and most clusters are). What's the biggest pipe between cluster A and cluster B? -bill On Fri, Mar 5, 2010 at 10:23 AM, Michael Di Domenico wrote: > How does one copy large (20TB) amounts of data from one cluster to another? > > Assuming that each node in the cluster can only do about 30MB/sec > between clusters and i want to preserve the uid/gid/timestamps, etc > > I know how i do it, but i'm curious what methods other people use... > > Just a general survey... > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Bill Rankin wrankin1 at gmail.com -- Bill Rankin wrankin1 at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Fri Mar 5 09:32:37 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 5 Mar 2010 12:32:37 -0500 Subject: [Beowulf] copying data between clusters In-Reply-To: References: <4B912FB6.5010109@vcu.edu> Message-ID: As i expect from the smartest sysadmins on the planet, everyone has over analyzed the issue... :) lets see if i can clarify assuming there are two clusters - clusterA and clusterB Each cluster is 32nodes and has 50TB of storage attached the aggregate network bandwidth between the clusters is 800MB/sec the problem is the per-node bandwidth on clusterB is 30MB/sec so i use a single node to copy the 20TB of data from clusterB, yes it's going to take me 7days to copy everything I'd like to paralyze that across multiple nodes to drive the aggregate up I was hoping someone would pop up say, hey use this magical piece of software. (of which im unable to locate).. On Fri, Mar 5, 2010 at 11:30 AM, kyron wrote: > On Fri, 05 Mar 2010 11:22:14 -0500, Mike Davis wrote: >> Michael Di Domenico wrote: >>> How does one copy large (20TB) amounts of data from one cluster to >>> another? >>> >>> Assuming that each node in the cluster can only do about 30MB/sec >>> between clusters and i want to preserve the uid/gid/timestamps, etc >>> >> If the clusters are co-lo I wouldn't copy I would use shared storage. If > >> they are not co-located I would use patience. >> >> Seriously though, for a one time copy, I would consider copying to an >> external system and then physically moving that system. To do this and >> preserve ownerships you will need to duplicate accounts and groups. > > > ...and we are all assuming non-compressibility; otherwise, use pbzip2 ;) > From john.hearns at mclaren.com Fri Mar 5 10:05:38 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 5 Mar 2010 18:05:38 -0000 Subject: [Beowulf] copying data between clusters In-Reply-To: References: <4B912FB6.5010109@vcu.edu> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F9E7F4C@milexchmb1.mil.tagmclarengroup.com> > > I'd like to paralyze that across multiple nodes to drive the aggregate > up > > I was hoping someone would pop up say, hey use this magical piece of > software. (of which im unable to locate).. > My recommendation also would be to use an external storage device - a USB drive would be useful, and I have been involved in a couple of industrial projects where data has been brought to a cluster on an external USB drive. It is as people say quite an efficient way to transfer the data. I gather that for high def digital cinema a RAID array is physically shipped to the cinema - I guess that also helps with data security, as you could do some sort of encryption on the drives, though I might be wrong. In the digital media world, there are some fast parallel SCP boxes which are an industry standard - I gather they cost $$$$ but do make transfers faster. I forget the name, and if they don't really do parallel SCP forgive me - its something along those lines. Re. moving data to/from a cluster over a WAN link, I did look at this recently. You can set up a fuse filesystem running over SSH. This actually works quite well from the point of view of ease of setting up and usability, but I didn't try any serious data transfer over it - and of course it cannot be faster than ssh anyway! I did also have a look at the types of tools used by grids for bulk data transfer, but not much more than looking. Here's an interesting link I found: http://fasterdata.es.net/tools.html ps. you don't say how you are transferring the data - if via rsync you have looked at the encryption options you are using? John Hearns The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From john.hearns at mclaren.com Fri Mar 5 10:07:14 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 5 Mar 2010 18:07:14 -0000 Subject: [Beowulf] copying data between clusters In-Reply-To: References: <4B912FB6.5010109@vcu.edu> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F9E7F51@milexchmb1.mil.tagmclarengroup.com> These might be the boxes used by post production/animation: http://www.rocketstream.com/company/overview/default.aspx The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From bill at cse.ucdavis.edu Fri Mar 5 10:10:36 2010 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Fri, 05 Mar 2010 10:10:36 -0800 Subject: [Beowulf] copying data between clusters In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F9E7F51@milexchmb1.mil.tagmclarengroup.com> References: <4B912FB6.5010109@vcu.edu> <68A57CCFD4005646957BD2D18E60667B0F9E7F51@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B91491C.6040604@cse.ucdavis.edu> Grid-ftp? http://www.globus.org/toolkit/docs/3.2/gridftp/key/index.html From jlforrest at berkeley.edu Fri Mar 5 10:34:55 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Fri, 05 Mar 2010 10:34:55 -0800 Subject: [Beowulf] copying data between clusters In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F9E7F4C@milexchmb1.mil.tagmclarengroup.com> References: <4B912FB6.5010109@vcu.edu> <68A57CCFD4005646957BD2D18E60667B0F9E7F4C@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4B914ECF.60102@berkeley.edu> On 3/5/2010 10:05 AM, Hearns, John wrote: > > My recommendation also would be to use an external storage device - a > USB drive would be useful, and I have been involved in a couple of > industrial projects where data has been brought to a cluster on an > external USB drive. It is as people say quite an efficient way to > transfer the data. Yes, except the speed of even USB 2.0 would make this an unpleasant experience. These days many external drives support eSATA, which runs at regular SATA speeds so you're not facing the USB bottleneck. If your host doesn't have an eSATA connector you can buy a PCI card for not much money. Once USB 3 is ubiquitous this problem (e.g USB 2.0 vs eSATA) will go away. Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From ljdursi at scinet.utoronto.ca Fri Mar 5 11:16:12 2010 From: ljdursi at scinet.utoronto.ca (Jonathan Dursi) Date: Fri, 05 Mar 2010 14:16:12 -0500 Subject: [Beowulf] copying data between clusters In-Reply-To: <4B91491C.6040604@cse.ucdavis.edu> References: <4B912FB6.5010109@vcu.edu> <68A57CCFD4005646957BD2D18E60667B0F9E7F51@milexchmb1.mil.tagmclarengroup.com> <4B91491C.6040604@cse.ucdavis.edu> Message-ID: <4B91587C.2070004@scinet.utoronto.ca> On 03/05/2010 01:10 PM, Bill Broadley wrote: > Grid-ftp? http://www.globus.org/toolkit/docs/3.2/gridftp/key/index.html If you don't already have the globus framework set up on both ends, getting it installed just for gridftp is a huge amount of work; especially since the advantage of gridftp doesn't derive from its grid-nature at all, it's just multi-channel and the protocol has big windows. We've had good luck with rsync over hpn-ssh http://www.psc.edu/networking/projects/hpn-ssh/ There's a java package out of CERN called FDT http://monalisa.cern.ch/FDT/ which looks promising but we've not had much luck getting it to be particularly fast; but maybe we're doing something wrong. - Jonathan -- Jonathan Dursi From richard.walsh at comcast.net Fri Mar 5 11:16:56 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Fri, 5 Mar 2010 19:16:56 +0000 (UTC) Subject: [Beowulf] Configuring PBS for a mixed CPU-GPU and QDR-DDR cluster ... Message-ID: <316095273.10450501267816616006.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> All, I am augmenting a DDR switched SGI ICE system with one that largely network-separate (a few 4x DDR links connect them) and QDR switched. The QDR "half" also includes GPUs (one per socket). Has anyone configured PBS to manage these kinds of natural divisions as a single cluster. Some part of the QDR-GPU "half" will be dedicated to GPU work, but I would like the rest of that part of the system to run either category of work. rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Fri Mar 5 14:14:25 2010 From: mathog at caltech.edu (David Mathog) Date: Fri, 05 Mar 2010 14:14:25 -0800 Subject: [Beowulf] Re: copying data between clusters Message-ID: Michael Di Domenico wrote: > lets see if i can clarify > > assuming there are two clusters - clusterA and clusterB > > Each cluster is 32nodes and has 50TB of storage attached Attached how? Is the 50TB sitting on one file server on each cluster, or is it distributed across the cluster? We need more details. > > the aggregate network bandwidth between the clusters is 800MB/sec > > the problem is the per-node bandwidth on clusterB is 30MB/sec Is there a switch on each cluster so that each node can write directly to the interconnect between clusters? Specifically, can node A12 write to node B12? Sounds like there might be, and since you seem to care about the per-node bandwidth on the target it sounds like you have a situation where the data is distributed on A and will again be distributed across nodes on B. If that's what you mean, then you just need to queue up a job on each node to do something like: (cd $DATADIRECTORY ; tar -cf - . ) \ | ssh matching_target_node 'cd $DATADIRECTORY; tar -xf - ) It will run in parallel using up all of your interconnect bandwidth. If on the other hand, the only per node rate you care about is the one fileserver on B, then it is a different problem. On the other, other hand, if you can temporarily store the data on each node of B, and the cumulative bandwidth that way is 800MB/s you could conceivably transfer it in parallel from A to all 32 destinations in B, and put the mess back together in B later. However, if you are still rate limited to 30Mb/sec on a single B fileserver then the total time to complete this operation will not change, only the time the data is in transit between the clusters will be reduced. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From ntmoore at gmail.com Sat Mar 6 08:27:01 2010 From: ntmoore at gmail.com (Nathan Moore) Date: Sat, 6 Mar 2010 10:27:01 -0600 Subject: [Beowulf] best sophomore-level FPGA reference? Message-ID: <6009416b1003060827s39e717dbve97b8de1c559492a@mail.gmail.com> Hi All, I regularly teach a sophomore/junior level course on digital circuits. I've recently started paying attention to PLD/FPGA hardware, particularly Actel's bargain-basement igloo-nano development board, http://www.actel.com/products/hardware/devkits_boards/igloonano_starter.aspx, which is offered at a price that my students could conceivably buy for the course. I'd like to spend some time talking about FPGA's but have run into two small problems - asking you-all seems like the easiest solution: - What's your favorite text that serves as a technical introduction to FPGA's? (I'm thinking of something parallel to Essick's "Introduction to LabVIEW") My main text for the course is Floyd's "Digital Fundametals" which I think is mediocre overall. - What's your favorite class project involving FPGA programming? The obvios targets for a project right now seem like they should feature VHDL (obviously), and tend towards the architectural "sweet-spot" of FPGA computation, which to me seems like massively parallel computation of something. I've been thinking about a cartoon of the features in a digital camera, eg zoom in on an image (bitmap), rotate an image, etc. If you have something built that you're willing to share I'd be most grateful. I hope this isn't too off-topic. I didn't think of FPGA's in the same terms as beowulfs until I started reading the amazon reviews for http://www.amazon.com/gp/product/0471687839/ref=oss_product Nathan Moore From lindahl at pbm.com Sat Mar 6 15:36:07 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Sat, 6 Mar 2010 15:36:07 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: <4A1DA2D8-E75F-46C0-9CDA-64BD204A0CCA@gmail.com> References: <265537950.7749891267205793764.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> <4A1DA2D8-E75F-46C0-9CDA-64BD204A0CCA@gmail.com> Message-ID: <20100306233607.GA5410@bx9.net> On Fri, Feb 26, 2010 at 01:20:49PM -0500, Lawrence Stewart wrote: > Personally, I believe our thinking about interconnects has been > poisoned by thinking that NICs are I/O devices. We would be better > off if they were coprocessors. Threads should be able to send > messages by writing to registers, and arriving packets should > activate a hyperthread that has full core capabilities for acting on > them, and with the ability to interact coherently with the memory > hierarchy from the same end as other processors. I'm up for dedicating 1+ normal processor cores to doing the special stuff. Nodes have a lot of cores these days, and all-2-sided programs don't have to dedicate a core & thus would pay nothing. In the MPI 1-sided model, you'd probably want to run all the cores on separate programs and have the dedicated core get access to the appropriate process' address space. -- greg From fitz at cs.earlham.edu Fri Mar 5 10:00:49 2010 From: fitz at cs.earlham.edu (Andrew Fitz Gibbon) Date: Fri, 5 Mar 2010 12:00:49 -0600 Subject: [Beowulf] copying data between clusters In-Reply-To: References: <4B912FB6.5010109@vcu.edu> Message-ID: <853EAB81-DB3D-42F2-8467-F6EC5F5A8B21@cs.earlham.edu> On Mar 5, 2010, at 11:32 AM, Michael Di Domenico wrote: > I was hoping someone would pop up say, hey use this magical piece of > software. (of which im unable to locate).. You might want to take a look at GridFTP from Globus (http:// globus.org). Among other things, it has support for parallel data streams and is specifically designed for transferring lots of data between clusters. It's distributed as part of the Toolkit, and it's not too hard to build /just/ GridFTP. As with any recommended software, YMMV. ---------------- Andrew Fitz Gibbon fitz at cs.earlham.edu From scott at cse.ucdavis.edu Fri Mar 5 10:21:01 2010 From: scott at cse.ucdavis.edu (Scott Beardsley) Date: Fri, 05 Mar 2010 10:21:01 -0800 Subject: [Beowulf] copying data between clusters In-Reply-To: References: <4B912FB6.5010109@vcu.edu> Message-ID: <4B914B8D.8090206@cse.ucdavis.edu> > I'd like to paralyze that across multiple nodes to drive the aggregate up > > I was hoping someone would pop up say, hey use this magical piece of > software. (of which im unable to locate).. Sounds like what we are doing with hadoop and gridftp-hdfs for our LHC cluster. Basically you would run N gridftp servers where N is the number of nodes (and preferably uncontended uplinks as well) on the destination cluster. Then run something like bestman[1] (it'll act as a director). It is a non-trivial stack of software but it should get the job done. Scott ----------- [1] https://sdm.lbl.gov/bestman/ From dgs at slac.stanford.edu Fri Mar 5 14:54:48 2010 From: dgs at slac.stanford.edu (David Simas) Date: Fri, 5 Mar 2010 14:54:48 -0800 Subject: [Beowulf] copying data between clusters In-Reply-To: References: <4B912FB6.5010109@vcu.edu> Message-ID: <20100305225448.GB31413@horus.slac.stanford.edu> On Fri, Mar 05, 2010 at 12:32:37PM -0500, Michael Di Domenico wrote: > As i expect from the smartest sysadmins on the planet, everyone has > over analyzed the issue... :) > > lets see if i can clarify > > assuming there are two clusters - clusterA and clusterB > > Each cluster is 32nodes and has 50TB of storage attached > > the aggregate network bandwidth between the clusters is 800MB/sec > > the problem is the per-node bandwidth on clusterB is 30MB/sec > > so i use a single node to copy the 20TB of data from clusterB, yes > it's going to take me 7days to copy everything > > I'd like to paralyze that across multiple nodes to drive the aggregate up > > I was hoping someone would pop up say, hey use this magical piece of > software. (of which im unable to locate).. You might be able to use "dar" for this: http://dar.linux.free.fr/ Dar will let you slice up your 20 TB of data into even sized pieces that you can transfer in parallel, than re-construct on the receiving side. David S. > > > > On Fri, Mar 5, 2010 at 11:30 AM, kyron wrote: > > On Fri, 05 Mar 2010 11:22:14 -0500, Mike Davis wrote: > >> Michael Di Domenico wrote: > >>> How does one copy large (20TB) amounts of data from one cluster to > >>> another? > >>> > >>> Assuming that each node in the cluster can only do about 30MB/sec > >>> between clusters and i want to preserve the uid/gid/timestamps, etc > >>> > >> If the clusters are co-lo I wouldn't copy I would use shared storage. If > > > >> they are not co-located I would use patience. > >> > >> Seriously though, for a one time copy, I would consider copying to an > >> external system and then physically moving that system. To do this and > >> preserve ownerships you will need to duplicate accounts and groups. > > > > > > ...and we are all assuming non-compressibility; otherwise, use pbzip2 ;) > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From markcf at rocketmail.com Sun Mar 7 07:08:27 2010 From: markcf at rocketmail.com (MArk Fennema) Date: Sun, 7 Mar 2010 07:08:27 -0800 (PST) Subject: [Beowulf] Windows Master, Linux Slaves Message-ID: <901979.89098.qm@web65303.mail.ac2.yahoo.com> I'm not sure at all if this would be at all beowulfy, or even if it would be possible. That's why I'm asking you. What I want to set up is a cluster computer that can run standard windows applications. Random download games, Microsoft office, etc. So I was wondering, is it at all possible to run a windows master computer that's controlling Linux slaves, and if I did, would it improve the performance of usual applications (or make it possible to run more of them at the same time). I know this isn't the most useful or the cheapest way to make a computer like this, but it's kind of an experiment. __________________________________________________________________ Yahoo! Canada Toolbar: Search from anywhere on the web, and bookmark your favourite sites. Download it now http://ca.toolbar.yahoo.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Sun Mar 7 12:05:52 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Sun, 7 Mar 2010 15:05:52 -0500 (EST) Subject: [Beowulf] Windows Master, Linux Slaves In-Reply-To: <901979.89098.qm@web65303.mail.ac2.yahoo.com> References: <901979.89098.qm@web65303.mail.ac2.yahoo.com> Message-ID: > I'm not sure at all if this would be at all beowulfy, or even if it would >be possible. That's why I'm asking you. What I want to set up is a cluster >computer that can run standard windows applications. Random download games, >Microsoft office, etc. So I was wondering, is it at all possible to run a >windows master computer that's controlling Linux slaves, and if I did, would >it improve the performance of usual applications (or make it possible to run >more of them at the same time). I know this isn't the most useful or the >cheapest way to make a computer like this, but it's kind of an experiment. beowulf is mainly about leveraging (commodity hardware, open software), but it's not explicitly _anti_ windows. the problem here is that to gain an advantage from parallelism (shared or distributed memory), an app almost certainly needs to be parallelism-aware, indeed, _designed_ for it. if you wanted to run a bunch of non-interacting windows programs, and wanted to distribute them across cluster nodes (with, for instance, a shared filesystem), this could certainly be done. personally, I'd start out with a linux cluster and run VMs so that the whole would be managable and secure. each windows instance would see itself alone in the VM, of course (separte desktops, registries, etc) as far as I know, MSFT will want their pound of flesh for each instance, though. in general, there are easy hardware fixes to speed up your programs. SSDs, lots of ram, 4-socket, 48-core motherboards, etc. even things like software-based shared memory single-image systems like ScaleMP. they cost money, but so does person-time. the main thing about beowulf is that there's a LOT of programs already written for MPI clusters, so the task is "just" tuning (for some combination of performance, greenness, price, convenience, etc). From eagles051387 at gmail.com Sun Mar 7 21:37:32 2010 From: eagles051387 at gmail.com (Jonathan Aquilina) Date: Mon, 8 Mar 2010 06:37:32 +0100 Subject: [Beowulf] Windows Master, Linux Slaves In-Reply-To: References: <901979.89098.qm@web65303.mail.ac2.yahoo.com> Message-ID: @Mark F i am not sure if it can be done with xp. i know windows have a HPC version and server 2003 include clustering capabilities. I agree with mark H that you should go with linux and run windows in a vm. -- Jonathan Aquilina -------------- next part -------------- An HTML attachment was scrubbed... URL: From michf at post.tau.ac.il Mon Mar 8 07:14:00 2010 From: michf at post.tau.ac.il (Micha Feigin) Date: Mon, 8 Mar 2010 17:14:00 +0200 Subject: [Beowulf] assigning cores to queues with torque Message-ID: <20100308171400.5128b849@vivalunalitshi.luna.local> I have a small local cluster in our lab that I'm trying to setup with minimum hustle to support both cpu and gpu processing where only some of the nodes have a gpu and those have only two gpu for four cores. It is currently setup using torque from ubuntu (2.3.6) with the torque supplied scheduler (set it up with maui initially but it was a bit of a pain for such a small cluster so I switched) This cluster is used by very few people in a very controlled environment so I don't really need any protection from each other, the queues are just for convenience to allow remote execution The problem: I want to allow gpu related jobs to run only on the gpu equiped nodes (i.e more jobs then GPUs will be queued), I want to run other jobs on all nodes with either 1. a priority to use the gpu equiped nodes last 2. or better, use only two out of four cores on the gpu equiped nodes It doesn't seem though that I can map nodes or cores to queues with torque as far as I can tell (i.e cpu queue uses 2 cores on gpu1, 2 cores on gpu2, all cores on everything else gpu queue uses 2 cores on gpu1, 2 cores on gpu2) I can't seem to set user defined resources so that I can define gpu machines as having gpu resource and schedule according to that. Is it possible to achieve any of these two with torque, or is there any other simple enough queue manager that can do this (preferably with a debian package in some way to simplify maintanance). I only manage this cluster since no one else knows how to and it's supposed to take as little of my time as possible I'm looking for the simplest solution to implement and not the most versatile one. Thanks From Glen.Beane at jax.org Mon Mar 8 07:39:08 2010 From: Glen.Beane at jax.org (Glen Beane) Date: Mon, 8 Mar 2010 10:39:08 -0500 Subject: [Beowulf] assigning cores to queues with torque In-Reply-To: <20100308171400.5128b849@vivalunalitshi.luna.local> Message-ID: On 3/8/10 10:14 AM, "Micha Feigin" wrote: I have a small local cluster in our lab that I'm trying to setup with minimum hustle to support both cpu and gpu processing where only some of the nodes have a gpu and those have only two gpu for four cores. It is currently setup using torque from ubuntu (2.3.6) with the torque supplied scheduler (set it up with maui initially but it was a bit of a pain for such a small cluster so I switched) This cluster is used by very few people in a very controlled environment so I don't really need any protection from each other, the queues are just for convenience to allow remote execution The problem: I want to allow gpu related jobs to run only on the gpu equiped nodes (i.e more jobs then GPUs will be queued), I want to run other jobs on all nodes with either 1. a priority to use the gpu equiped nodes last 2. or better, use only two out of four cores on the gpu equiped nodes It doesn't seem though that I can map nodes or cores to queues with torque as far as I can tell (i.e cpu queue uses 2 cores on gpu1, 2 cores on gpu2, all cores on everything else gpu queue uses 2 cores on gpu1, 2 cores on gpu2) I can't seem to set user defined resources so that I can define gpu machines as having gpu resource and schedule according to that. Is it possible to achieve any of these two with torque, or is there any other simple enough queue manager that can do this (preferably with a debian package in some way to simplify maintanance). I only manage this cluster since no one else knows how to and it's supposed to take as little of my time as possible I'm looking for the simplest solution to implement and not the most versatile one. you can define a resource "gpu" in your TORQUE nodes file: hostname np=4 gpu and then users can request -l nodes=1:ppn=4:gpu to get assigned a node with a gpu, but to do anything more advanced you'll need Maui or Moab. You should try the maui users mailing list, or the torque users mailing list to see if anyone else has some ideas -------------- next part -------------- An HTML attachment was scrubbed... URL: From vallard at benincosa.com Sun Mar 7 14:42:06 2010 From: vallard at benincosa.com (Vallard Benincosa) Date: Sun, 7 Mar 2010 14:42:06 -0800 Subject: [Beowulf] Windows Master, Linux Slaves In-Reply-To: <901979.89098.qm@web65303.mail.ac2.yahoo.com> References: <901979.89098.qm@web65303.mail.ac2.yahoo.com> Message-ID: We do it the opposite way quite frequently: Linux master node with Windows slaves. You could at the very least install the Linux servers from Windows over the network. Setup a Windows DHCP/HTTP server. Then you could at the very least 'kickstart' or 'autoyast' the nodes. Not sure about the software level of controlling the apps. On Sun, Mar 7, 2010 at 7:08 AM, MArk Fennema wrote: > I'm not sure at all if this would be at all beowulfy, or even if it would > be possible. That's why I'm asking you. What I want to set up is a cluster > computer that can run standard windows applications. Random download games, > Microsoft office, etc. So I was wondering, is it at all possible to run a > windows master computer that's controlling Linux slaves, and if I did, would > it improve the performance of usual applications (or make it possible to run > more of them at the same time). I know this isn't the most useful or the > cheapest way to make a computer like this, but it's kind of an experiment. > > ------------------------------ > > *Yahoo! Canada Toolbar :* Search from anywhere on the web and bookmark > your favourite sites. Download it now! > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.walsh at comcast.net Mon Mar 8 09:20:32 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Mon, 8 Mar 2010 17:20:32 +0000 (UTC) Subject: [Beowulf] assigning cores to queues with torque In-Reply-To: <20100308171400.5128b849@vivalunalitshi.luna.local> Message-ID: <1394126184.11221941268068832543.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Micha Feigin wrote: >The problem: > >I want to allow gpu related jobs to run only on the gpu >equiped nodes (i.e more jobs then GPUs will be queued), >I want to run other jobs on all nodes with either: > > 1. a priority to use the gpu equiped nodes last > 2. or better, use only two out of four cores on the gpu equiped nodes In PBS Pro you would do the following (torque may have something similar): 1. Create a custom resource called "ngpus" in the resourcedef file as in: ngpus type=long flag=nh 2. This resource should then be explicitly set on each node that includes a GPU to the number it includes: set node compute-0-5 resources_available.ncpus = 8 set node compute-0-5 resources_available.ngpus = 2 Here I have set the number of cpus per node (8) explicitly to defeat hyper-threading and the actual number of gpus per node (2). On the other nodes you might have: set node compute-0-5 resources_available.ncpus = 8 set node compute-0-5 resources_available.ngpus = 0 Indicating that there are no gpus to allocate. 3. You would then use the '-l select' option in your job file as follows: #PBS -l select=4:ncpus=2:ngpus=2 This requests 4 PBS resource chunks. Each includes 2 cpus and 2 gpus. Because the resource request is "chunked" these 2 cpu x 2 gpu chunks would be placed together on one physical node. Because you marked some nodes as having 2 gpus in the nodes file and some to have 0 gpus, only those that have them will get allocated. As a consumable resource, as soon as 2 were allocated the total available would drop to 0. In total you would have asked for 4 chunks distributed to 4 physical nodes (because only one of these chunks can fit on a single node). This also ensures a 1:1 mapping of cpus to gpus, although it does nothing about tying each cpu to a different socket. You would to do that in the script with numactl probably. There are other ways to approach by tying physical nodes to queues, which you might wish to do to set up a dedicate slice for GPU development. You may also be able to do this in PBS using the v-node abstraction. There might be some reason to have two production routing queues that map to slight different parts of the system. Not sure how this could be approximated in Torque, but perhaps this will give you some leads. rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Mon Mar 8 12:28:09 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 08 Mar 2010 12:28:09 -0800 Subject: [Beowulf] SATA L shaped cable terminology question Message-ID: SATA cables may be purchased with an L shaped connector. I just looked at 10 of them and they were all like this when looking into the open end of the angled connector (ASCII art): +-------------------+ | | | --------------+ | | | | +------+ +-------+ | | | | | | | | There are some cables out there labeled as "left angle" cables. SOME of those have associated pictures like the one above, and others like the one below: +-------------------+ | | | | +-------------+ | | | +------+ +-------+ | | | | | | | | The question is, are all cables labeled as "Left angle" supposed to look like the lower illustration? (My guess is that this is the case and the illustrations which aren't like this are erroneously from the "right angle" variant of the cable.) Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mdidomenico4 at gmail.com Mon Mar 8 14:44:27 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 8 Mar 2010 17:44:27 -0500 Subject: [Beowulf] SATA L shaped cable terminology question In-Reply-To: References: Message-ID: my guess would be there are two cables because the sata cables are not overly flexible at the joint having the connector tail in either the up direction or down direction could save a loop of cable in the chassis On Mon, Mar 8, 2010 at 3:28 PM, David Mathog wrote: > SATA cables may be purchased with an L shaped connector. ?I just looked > at 10 of them and they were all like this when looking into the open end > of the angled connector (ASCII art): > > ?+-------------------+ > ?| ? ? ? ? ? ? ? ? ? | > ?| ?--------------+ ?| > ?| ? ? ? ? ? ? ? ?| ?| > ?+------+ ? ?+-------+ > ? ? ? ? | ? ?| > ? ? ? ? | ? ?| > ? ? ? ? | ? ?| > ? ? ? ? | ? ?| > > > There are some cables out there labeled as "left angle" cables. ?SOME of > those have associated pictures like the one above, and others like the > one below: > > ?+-------------------+ > ?| ?| ? ? ? ? ? ? ? ?| > ?| ?+-------------+ ?| > ?| ? ? ? ? ? ? ? ? ? | > ?+------+ ? ?+-------+ > ? ? ? ? | ? ?| > ? ? ? ? | ? ?| > ? ? ? ? | ? ?| > ? ? ? ? | ? ?| > > The question is, are all cables labeled as "Left angle" supposed to look > like the lower illustration? ?(My guess is that this is the case and the > illustrations which aren't like this are erroneously from the "right > angle" variant of the cable.) > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From gdjacobs at gmail.com Tue Mar 9 13:15:43 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Tue, 09 Mar 2010 15:15:43 -0600 Subject: [Beowulf] Arima motherboards with SATA2 drives In-Reply-To: References: Message-ID: <4B96BA7F.9060509@gmail.com> David Mathog wrote: > Have any of you seen a patched BIOS for the Arima HDAM* motherboards > that resolves the issue of the Sil 3114 SATA controller locking up when > it sees a SATA II disk? (Even a disk jumpered to Sata I speeds.) > Silicon Image released a BIOS fix for this, but since all of these > motherboards use a Phoenix BIOS, it is not like an AMI or Award BIOS, > where there are published methods for swapping out the broken chunk of > BIOS (5.0.49) for the one with the fix (5.4.0.3). Sure, one could work > around this on a single disk system, at least, with an IDE to SATA2 > converter, or a PCI(X) Sata(2) controller, but reflashing the BIOS would > be easier. Or it would be if Flextronics, who bought this product line > from Arima, would issue another BIOS update :-(. > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech I'm assuming that the boffin Flextronics has handling legacy support for Arima is not being very responsive? Well, editing the BIOS image for the mainboard seams kind of dodgy. If chassis space isn't a problem, I would think replacing the controller would be a better solution. I'm also unsure if Coreboot is a viable option, although it seems the HDAMA is supported. I'm not 100% sure if the Sil controller is, though. Interesting problem, though. If you want to, find a copy of BNTBTC (Bog Number Two's BIOS Tools Collection) and install Phoenix Bios Editor. Hopefully you have no missing VBVM 6.0 files. You'll need to find the correct module for replacement. I was just checking the ROM image myself, but I was using an older BIOS editor and things were a little gnarly. We'll see with the new version... -- Geoffrey D. Jacobs From mathog at caltech.edu Tue Mar 9 13:41:00 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 09 Mar 2010 13:41:00 -0800 Subject: [Beowulf] cpufreq, multiple cores, load Message-ID: I am currently configuring a new (for us) dual Opteron 280 system. cpufreq works on this system, moving each pair of cores between 1000 and 2400MHz using the "ondemand" governor. The interesting thing, at least from my point of view, is how rapidly the power savings degrade as CPU load increases. Naively one might have thought it would go up in steps corresponding to the power difference between 2400MHz full CPU load on a core and 1000MHz idle on the same core. Not so. Here is the data measured with our Kill-A-Watt, view the table with a fixed width font: Watts CPU0 MHz CPU1 MHz cpuburn# governor core0 core1 core0 core1 115 1000 1000 1000 1000 0 ondemand 157 2400 2400 1000 1000 1 ondemand 199 2400 2400 2400 2400 2 ondemand 214 2400 2400 2400 2400 3 ondemand 228 2400 2400 2400 2400 4 ondemand 172 2400 2400 2400 2400 0 performance 186 2400 2400 2400 2400 1 performance 199 2400 2400 2400 2400 2 performance 214 2400 2400 2400 2400 3 performance 228 2400 2400 2400 2400 4 performance Starting one cpuburn flips both of the cores on the first CPU to the faster clock speed, resulting in a 42W (37%) increase in power consumption. Starting a second cpuburn apparently schedules it on one of the cores on the unused second processor, rather than on the equally unused, but already sped up, second core on the first CPU. This flips the remaining two cores also to 2400 MHz, negating any further benefit from "ondemand" as more cpuburn processes are added. That is, there are 5 states for increasing cpuburn load but, only the lowest two have different power consumption for "ondemand" than for "performance". I have not repeated this experiment yet with the "conservative" governor, but since cpuburn is intentionally such a cpu hog, I expect the results would be about the same. Anyway, this is just 4 cores total, but it makes me wonder what happens with a system having say, two quad core processors, where if the same sort of scheduling/cpufreq logic apply, two CPU saturating jobs (still only 1/4 of total available CPU capacity) will effectively negate the energy saving modes. For instance, imagine some future 24 core behemoth that acts the same way. One might almost dispense with power saving modes altogether if one CPU intensive job is going to kick the other 23 cores into a high power state. Or do the newer CPUs, either AMD's or Intel's, allow different frequencies on each core of a CPU? (Kernel 2.6.31.12, Arima HDAMAI motherboard, Mandriva 2010.0). Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From bill at cse.ucdavis.edu Tue Mar 9 13:50:05 2010 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 09 Mar 2010 13:50:05 -0800 Subject: [Beowulf] cpufreq, multiple cores, load In-Reply-To: References: Message-ID: <4B96C28D.1000900@cse.ucdavis.edu> David Mathog wrote: > Starting a second cpuburn apparently schedules it > on one of the cores on the unused second processor, rather than > on the equally unused, but already sped up, second core on the first > CPU. Since that gives the most additional performance that seems a reasonable default. So you add an additional cache and memory system. Numactl or the related system calls would let you schedule it on the first CPU if desired. > This flips the remaining two cores also to 2400 MHz, negating any > further benefit from "ondemand" as more cpuburn processes are added. > That is, there are 5 states for increasing cpuburn load but, only the > lowest two have different power consumption for "ondemand" than for > "performance". Unless you choose to do otherwise. If you are willing to ignore the 2nd chips cache and memory system you could keep power lower until over half the system is busy. > Anyway, this is just 4 cores total, but it makes me wonder what happens > with a system having say, two quad core processors, where if the same > sort of scheduling/cpufreq logic apply, two CPU saturating jobs (still > only 1/4 of total available CPU capacity) will effectively negate the > energy saving modes. For instance, imagine some future 24 core behemoth > that acts the same way. One might almost dispense with power saving > modes altogether if one CPU intensive job is going to kick the other 23 > cores into a high power state. Or do the newer CPUs, either AMD's or > Intel's, allow different frequencies on each core of a CPU? I seem to recall that one of the recent AMD tweaks was to allow additional tweaks. I forget if it was voltage, or clock speed that can now be controller per core. I believe the north bridge and memory bus also can enter a lower power state when not in use. Next time I have a nehalem dual socket in my office I'll test it. From gdjacobs at gmail.com Tue Mar 9 15:17:47 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Tue, 09 Mar 2010 17:17:47 -0600 Subject: [Beowulf] Arima motherboards with SATA2 drives In-Reply-To: References: Message-ID: <4B96D71B.9080504@gmail.com> David Mathog wrote: >> I'm assuming that the boffin Flextronics has handling legacy support for >> Arima is not being very responsive? > > If by "very" you mean "at all", then you would be accurate. > >> Well, editing the BIOS image for the mainboard seams kind of dodgy. > > That's what I ended up doing though, and it worked. By any chance is the flash ROM socketed and did you have a spare board for hot swapping? That sort of insurance make's me breathe easier when doing weird firmware updates. >> If >> chassis space isn't a problem, I would think replacing the controller >> would be a better solution. > > That's an option for the one machine I moved to a different case, the > others I'm thinking about getting are in strange little cases, and they > will only have room for one disk, so just getting the one controller to > see a SATA II disk will be good enough. Yeah, so you were stuck. >> If you want to, find a copy of BNTBTC (Bog Number Two's BIOS Tools >> Collection) and install Phoenix Bios Editor. Hopefully you have no >> missing VBVM 6.0 files. You'll need to find the correct module for >> replacement. I was just checking the ROM image myself, but I was using >> an older BIOS editor and things were a little gnarly. We'll see with the >> new version... That would be Borg... > Found this thread: > > http://forums.mydigitallife.info/threads/13358-How-to-Use-New-Phoenix-Bios-Mod-Tool-to-Modify-Phoenix-Dell-Insyde-EFI-Bios-Files?s=016f5c20a0849a623a806a9a440db2fb > > with a link to PBE 2.1.0.0. Used that on the 1.11 BIOS (downloaded > from Flextronic), stuffed in the SiI stuff (5403.bin), flashed the ROM, > and it worked, seeing the SATA II disk. Note, there were two > complications. 1st, PBE installs owned by the installer, and there are > access issues, solved those by changing ownership to Everbody:FULL at > the top level of that directory tree. Second, replacing a module. I > tried that using the menu interface, and it seemed to have done it, but > the resulting BIOS still had the old SiI section. So used PBE to > unpack, from another window, copied 5403.bin to ....\TEMP\OPROM3.ROM > (where it lived in this BIOS), changed and unchanged a string to enable > the build command, then BUILD, then save BIOS. It seems BIOS modding has become somewhat of a cottage industry as a way of getting around WGA and/or being able to boot Windows 7. Examining the module binaries, I see they did us the favor of providing copyrights in the first line. Very considerate of them. Distinguishing between the various addon modules is easy this way. > Tried another tool called Phoenix_Tool_1.24 (from the thread cited > above) and it could break out the pieces of the BIOS, then you could > replace one, and run prepare and catenate. Except that what comes out > won't flash with phlash.exe. See my entry near the end of the forum > cited above. I used a little tool of mine called "intercept" to > intercept all program calls, and it wasn't doing anything special with > prepare.exe, catenate.exe, or the other programs. The only thing left > was that PBE itself must be appending some stuff after the catenate > output, and that seems to be metadata for phlash16. Probably one could > use another flash program that didn't need these, but I wasn't going to > try that. In any case, these appended bytes were the same when PBE > built an unmodified and a modified BIOS, so it would apparently be safe > to just use "dd" and hack the difference in length off the original BIOS > image and append it to the output from catenate. The phlash command I > used was: > > phlash16 HDAMAI.11C /C /Mode=3 /CS /EXIT /PN /V > > Since the legal status of these versions of PBE is somewhat dubious, I > didn't post these last few results in public anywhere. I won't tell if you won't. > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech Hey, glad it's working. I guess the war horses will soldier on some more. -- Geoffrey D. Jacobs From hahn at mcmaster.ca Tue Mar 9 20:14:59 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 9 Mar 2010 23:14:59 -0500 (EST) Subject: [Beowulf] cpufreq, multiple cores, load In-Reply-To: References: Message-ID: > 2400MHz using the "ondemand" governor. The interesting thing, at least > from my point of view, is how rapidly the power savings degrade as CPU > load increases. well, one of the big themes in recent chip development is factoring out pieces that can be separately clocked. it's also true that the platforms themselves have improved (lower voltage ddr3, power/flop, etc) > Watts CPU0 MHz CPU1 MHz cpuburn# governor > core0 core1 core0 core1 > 115 1000 1000 1000 1000 0 ondemand > 157 2400 2400 1000 1000 1 ondemand > 172 2400 2400 2400 2400 0 performance > 186 2400 2400 2400 2400 1 performance > 199 2400 2400 2400 2400 2 ondemand/performance > 214 2400 2400 2400 2400 3 ondemand/performance > 228 2400 2400 2400 2400 4 ondemand/performance I rearranged your table a bit. 157-115=42W ramps up a socket from 1-2.4. 186-157=29W gets the second socket going (bit surprising). 172-115=57W is the clock-related savings when idle. 13/15/14W for adding core load once the sockets are spun up. I'm guessing it's the latter that surprised you - per core power savings are minor, dominated by socket-related power. but Intel and AMD have been talking along those lines for a few years... > consumption. Starting a second cpuburn apparently schedules it > on one of the cores on the unused second processor, rather than > on the equally unused, but already sped up, second core on the first well, that's a kernel/scheduler choice, probably pretty hackable. or else just use numactl to direct the second cpuburn away from the second socket. in fact, using numactl to control memory allocation would probably be a good idea anyway. > Anyway, this is just 4 cores total, but it makes me wonder what happens > with a system having say, two quad core processors, where if the same > sort of scheduling/cpufreq logic apply, two CPU saturating jobs (still > only 1/4 of total available CPU capacity) will effectively negate the > energy saving modes. I'm guessing that we can have pretty high-res power savings if we're willing to do some extra things like controlling which procs use which cores, which memory banks, perhaps a syscall interface to control clock modulation... > cores into a high power state. Or do the newer CPUs, either AMD's or > Intel's, allow different frequencies on each core of a CPU? I've seen that on AMD presentations, but iirc it was an istanbul thing. -mark From gmpc at sanger.ac.uk Wed Mar 10 01:20:57 2010 From: gmpc at sanger.ac.uk (Guy Coates) Date: Wed, 10 Mar 2010 09:20:57 +0000 Subject: [Beowulf] cpufreq, multiple cores, load In-Reply-To: References: Message-ID: <4B976479.40802@sanger.ac.uk> >> consumption. Starting a second cpuburn apparently schedules it >> on one of the cores on the unused second processor, rather than >> on the equally unused, but already sped up, second core on the first > > well, that's a kernel/scheduler choice, probably pretty hackable. > or else just use numactl to direct the second cpuburn away from > the second socket. in fact, using numactl to control memory allocation > would probably be a good idea anyway. > You can change that schedular behaviour by twiddling sched_mc_power_savings echo 1 > /sys/devices/system/cpu/sched_mc_power_savings http://www.lesswatts.org/tips/cpu.php "'sched_mc_power_savings' tunable under /sys/devices/system/cpu/ controls the Multi-core related tunable. By default, this is set to '0' (for optimal performance). By setting this to '1', under light load scenarios, the process load is distributed such that all the cores in a processor package are busy before distributing the process load to other processor packages." Cheers, Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1HH, UK Tel: +44 (0)1223 834244 x 6925 Fax: +44 (0)1223 496802 -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From john.hearns at mclaren.com Wed Mar 10 01:46:33 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 10 Mar 2010 09:46:33 -0000 Subject: [Beowulf] cpufreq, multiple cores, load In-Reply-To: <4B976479.40802@sanger.ac.uk> References: <4B976479.40802@sanger.ac.uk> Message-ID: <68A57CCFD4005646957BD2D18E60667B0FA56FD0@milexchmb1.mil.tagmclarengroup.com> > > > You can change that schedular behaviour by twiddling > sched_mc_power_savings > > echo 1 > /sys/devices/system/cpu/sched_mc_power_savings > > > http://www.lesswatts.org/tips/cpu.php > Sorry to set off at a slight tangent - on a related matter, does anyone have a good writeup on the CPU scaling controls in Linux for Nehalem - ie controlling TurboBoost. Yes, I have Googled but as above someone here normally had a blindlingly good resource. The reason I ask is that we have some Nehalem boxes, which will principally run a serial application. Could be useful to see what effect turning on turboboost has. I also saw there is a Gnome taskbar applet for showing CPU frequencies - anyone got recommendations for a good monitor for that sort of thing? Gkrellem springs to mind. John Hearns McLaren Racing The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From cap at nsc.liu.se Wed Mar 10 04:39:41 2010 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Wed, 10 Mar 2010 13:39:41 +0100 Subject: [Beowulf] cpufreq, multiple cores, load In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0FA56FD0@milexchmb1.mil.tagmclarengroup.com> References: <4B976479.40802@sanger.ac.uk> <68A57CCFD4005646957BD2D18E60667B0FA56FD0@milexchmb1.mil.tagmclarengroup.com> Message-ID: <201003101339.45496.cap@nsc.liu.se> On Wednesday 10 March 2010, Hearns, John wrote: ... > Sorry to set off at a slight tangent - on a related matter, does anyone > have a good writeup on the CPU scaling controls in Linux for Nehalem - > ie controlling TurboBoost. > Yes, I have Googled but as above someone here normally had a blindlingly > good resource. Turbo mode shows up as its own frequency step (one Mhz over normal max): cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies 2268000 2267000 2133000 2000000 1867000 1733000 1600000 2267(Mhz) is the highest "normal" setting on this E5520. 2268(Mhz) is the turbo mode setting. Linux will treat 2268 as a normal available frequency (and since it's the highest this is what will get chosen under load). If you wan't to run without turbo mode just set the frequency statically to 2267 (of course this will be true even at idle then). When the system is running at 2268 it's using turbo mode, the actual frequency will be somewhere between 2267 and MAX_TURBO. Where it will end up between these two depends on power/thermal head room and what MAX_TURBO is depends on the specific cpu model. For E55* MAX_TURBO is "one step up", that is a 2.267 E5520 can go to 2.4 (normal max of E5530). For X55* it's two steps up (a 2.667 can go to 2.93 etc.). > The reason I ask is that we have some Nehalem boxes, which will > principally run a serial application. > Could be useful to see what effect turning on turboboost has. > I also saw there is a Gnome taskbar applet for showing CPU frequencies - > anyone got recommendations for a good monitor for that sort of thing? > Gkrellem springs to mind. Note that when running turbo mode linux will think it's running a normal frequency mode one Mhz up (2268 Mhz for the E5520) not at the true frequency. /Peter > John Hearns > McLaren Racing -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From mathog at caltech.edu Wed Mar 10 13:10:09 2010 From: mathog at caltech.edu (David Mathog) Date: Wed, 10 Mar 2010 13:10:09 -0800 Subject: [Beowulf] cpufreq, multiple cores, load Message-ID: Guy Coates wrote: > You can change that schedular behaviour by twiddling sched_mc_power_savings > > echo 1 > /sys/devices/system/cpu/sched_mc_power_savings > > > http://www.lesswatts.org/tips/cpu.php Good link. Setting this option made a slight difference when 2 cpuburn processes were running, and not for any other number. Instead of 199W power consumption was reduced to 171W, and two cores showed 2400 MHz and two 1000 MHz, instead of all 4 at 2400 MHz. Hopefully AMD has addressed this in later processors, because the 40W hit induced by rev'ing up all cores on a processor is much bigger than the ~16W hit associated with actually running the program. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From tom.ammon at utah.edu Wed Mar 10 14:14:26 2010 From: tom.ammon at utah.edu (Tom Ammon) Date: Wed, 10 Mar 2010 15:14:26 -0700 Subject: [Beowulf] ARP timers on RHEL4 vs. RHEL5 Message-ID: <4B9819C2.10100@utah.edu> Hi, I've been trying to figure out how to adjust the ARP timeout on kernel 2.6.9 and I found the following in /proc/sys/net/ipv4/neigh/ib0 (its an IB interface I am interested in changing) with the following values. This is on kernel 2.6.9-89ELsmp (RHEL4) : [root at up255 ib0]# cat anycast_delay 99 [root at up255 ib0]# cat app_solicit 0 [root at up255 ib0]# cat base_reachable_time 30 [root at up255 ib0]# cat delay_first_probe_time 5 [root at up255 ib0]# cat gc_stale_time 60 [root at up255 ib0]# cat locktime 99 [root at up255 ib0]# cat mcast_solicit 3 [root at up255 ib0]# cat proxy_delay 79 [root at up255 ib0]# cat proxy_qlen 64 [root at up255 ib0]# cat retrans_time 99 [root at up255 ib0]# cat ucast_solicit 3 [root at up255 ib0]# cat unres_qlen 3 When I test this, along with per-flow ECMP (using the iproute2 utils), I see that the ARP cache is timing out about every 10 minutes (I observe this by load balancing an iperf flow between two different gateway machines and then graphing the interface traffic) On a newer kernel, 2.6.18-164.11.1.el5 (RHEL5), I see mostly the same parms available, but a few new ones have been added. However, all of the parms that are the same name between the two kernels are the same values: [root at gateway2 ib0]# cat anycast_delay 99 [root at gateway2 ib0]# cat app_solicit 0 [root at gateway2 ib0]# cat base_reachable_time 30 [root at gateway2 ib0]# cat base_reachable_time_ms 30000 [root at gateway2 ib0]# cat delay_first_probe_time 5 [root at gateway2 ib0]# cat gc_stale_time 60 [root at gateway2 ib0]# cat locktime 99 [root at gateway2 ib0]# cat mcast_solicit 3 [root at gateway2 ib0]# cat proxy_delay 79 [root at gateway2 ib0]# cat proxy_qlen 64 [root at gateway2 ib0]# cat retrans_time 99 [root at gateway2 ib0]# cat retrans_time_ms 1000 [root at gateway2 ib0]# cat ucast_solicit 3 [root at gateway2 ib0]# cat unres_qlen 3 Yet when I observe the same traffic flow with this machine, the ARP cache times out about once per minute. Is there another set of parameters somewhere that govern how often the kernel times out the ARP cache? If so, where might I find that? Is there any kernel documentation that talks about changing ARP timers on the linux kernel? Tom Ammon -- -------------------------------------------------------------------- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu From gdjacobs at gmail.com Thu Mar 11 10:13:49 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Thu, 11 Mar 2010 12:13:49 -0600 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA8F107.7010805@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <20090910073957.GA8487@gretchen.aei.mpg.de> <4AA8F107.7010805@avalon.umaryland.edu> Message-ID: <4B9932DD.4050502@gmail.com> psc wrote: > Thank you all for the answers. Would you guys please share with me some > good brands of those > 200+ 1GB Ethernet switches? I think I'll leave our current clusters > alone , but the new cluster I > will design for about 500 to 1000 nodes --- I don't think that we will > go much above since for big jobs > our scientists using outside resources. We do all our calculations and > analysis on the nodes and only the final produce > we sent to the frontend , also we don't run jobs across the nodes , so I > don't need to get too much creative with the network > beside being sure that I can expand the cluster without having the > switches as a limitation (our current situation) > > thank you again! In alphabetical order... Alcatel, Cisco, Extreme, Force10, Foundry, HP, Juniper, Nortel (although they're not terribly stable, financially). All these companies make the requisite hardware in a variety of configurations, from small switches, to big modular chassis. I have only personally used Procurves. > Henning Fehrmann wrote: >> Hi >> >> On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: >> >>> I wonder what would be the sensible biggest cluster possible based on >>> 1GB Ethernet network . >>> >> Hmmm, may I cheat and use a 10Gb core switch? >> >> If you setup a cluster with few thousand nodes you have to ask yourself >> whether this network should be non-blocking or not. >> >> For a non blocking network you need the right core-switch technology. >> Unfortunately, there are not many vendors out there which provide >> non-blocking Ethernet based core switches but I am aware of at least >> two. One provides or will provide 144 10Gb Ethernet ports. Another one >> sells switches with more than 1000 1 GB ports. >> You could buy edge-switches with 4 10Gb uplinks and 48 1GB ports. If >> you just use 40 of them you end up with a 1440 non-blocking 1Gb ports. >> >> It might be also possible to cross connect two of these core-switches >> with the help of some smaller switches so that one ends up with 288 >> 10Gb ports and, in principle, one might connect 2880 nodes in a >> non-blocking way, but we did not have the possibility to test it >> successfully yet. One of problems is that the internal hash table can >> not store that many mac addresses. Anyway, one probably needs to change >> the mac addresses of the nodes to avoid an overflow of the hash tables. >> An overflow might cause arp storms. >> >> Once this works one runs into some smaller problems. One of them is the arp >> cache of the nodes. It should be adjusted to hold as many mac addresses >> as you have nodes in the cluster. >> >> >> >>> And especially how would you connect those 1GB >>> switches together -- now we have (on one of our four clusters) Two 48 >>> ports gigabit switches connected together with 6 patch cables and I just >>> ran out of ports for expansion and wonder where to go from here as we >>> already have four clusters and it would be great to stop adding cluster >>> and start expending them beyond number of outlets on the switch/s .... >>> NFS and 1GB Ethernet works great for us and we want to stick with it , >>> but we would love to find a way how to overcome the current "switch >>> limitation". >>> >> With NFS you can nicely test the setup. Use one NFS server and let all >> nodes write different files into it and look what happens. >> >> Cheers, >> Henning >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Geoffrey D. Jacobs From gdjacobs at gmail.com Thu Mar 11 10:16:29 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Thu, 11 Mar 2010 12:16:29 -0600 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA8F107.7010805@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <20090910073957.GA8487@gretchen.aei.mpg.de> <4AA8F107.7010805@avalon.umaryland.edu> Message-ID: <4B99337D.8070203@gmail.com> psc wrote: > Thank you all for the answers. Would you guys please share with me some > good brands of those > 200+ 1GB Ethernet switches? I think I'll leave our current clusters > alone , but the new cluster I > will design for about 500 to 1000 nodes --- I don't think that we will > go much above since for big jobs > our scientists using outside resources. We do all our calculations and > analysis on the nodes and only the final produce > we sent to the frontend , also we don't run jobs across the nodes , so I > don't need to get too much creative with the network > beside being sure that I can expand the cluster without having the > switches as a limitation (our current situation) > > thank you again! > > > Henning Fehrmann wrote: >> Hi >> >> On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: >> >>> I wonder what would be the sensible biggest cluster possible based on >>> 1GB Ethernet network . >>> >> Hmmm, may I cheat and use a 10Gb core switch? >> >> If you setup a cluster with few thousand nodes you have to ask yourself >> whether this network should be non-blocking or not. >> >> For a non blocking network you need the right core-switch technology. >> Unfortunately, there are not many vendors out there which provide >> non-blocking Ethernet based core switches but I am aware of at least >> two. One provides or will provide 144 10Gb Ethernet ports. Another one >> sells switches with more than 1000 1 GB ports. >> You could buy edge-switches with 4 10Gb uplinks and 48 1GB ports. If >> you just use 40 of them you end up with a 1440 non-blocking 1Gb ports. >> >> It might be also possible to cross connect two of these core-switches >> with the help of some smaller switches so that one ends up with 288 >> 10Gb ports and, in principle, one might connect 2880 nodes in a >> non-blocking way, but we did not have the possibility to test it >> successfully yet. One of problems is that the internal hash table can >> not store that many mac addresses. Anyway, one probably needs to change >> the mac addresses of the nodes to avoid an overflow of the hash tables. >> An overflow might cause arp storms. >> >> Once this works one runs into some smaller problems. One of them is the arp >> cache of the nodes. It should be adjusted to hold as many mac addresses >> as you have nodes in the cluster. >> >> >> >>> And especially how would you connect those 1GB >>> switches together -- now we have (on one of our four clusters) Two 48 >>> ports gigabit switches connected together with 6 patch cables and I just >>> ran out of ports for expansion and wonder where to go from here as we >>> already have four clusters and it would be great to stop adding cluster >>> and start expending them beyond number of outlets on the switch/s .... >>> NFS and 1GB Ethernet works great for us and we want to stick with it , >>> but we would love to find a way how to overcome the current "switch >>> limitation". >>> >> With NFS you can nicely test the setup. Use one NFS server and let all >> nodes write different files into it and look what happens. >> >> Cheers, >> Henning >> > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf It looks like Allied Telesis makes chassis switches now too. -- Geoffrey D. Jacobs From carsten.aulbert at aei.mpg.de Thu Mar 11 22:41:18 2010 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Fri, 12 Mar 2010 07:41:18 +0100 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: =?utf-8?q?how=09large_of_an_installation_have_people_used?= NFS, with? In-Reply-To: <4B99337D.8070203@gmail.com> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA8F107.7010805@avalon.umaryland.edu> <4B99337D.8070203@gmail.com> Message-ID: <201003120741.20840.carsten.aulbert@aei.mpg.de> On Thursday 11 March 2010 19:16:29 Geoff Jacobs wrote: > > It looks like Allied Telesis makes chassis switches now too. > as well as Fortinet (I don't think Henning named them), they took over the WovenSystems stuff after the latter went under. Cheers CArsten -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1871 bytes Desc: not available URL: From hahn at mcmaster.ca Fri Mar 12 07:21:38 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 12 Mar 2010 10:21:38 -0500 (EST) Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: =?utf-8?q?how=09large_of_an_installation_have_people_used?= NFS, with? In-Reply-To: <201003120741.20840.carsten.aulbert@aei.mpg.de> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA8F107.7010805@avalon.umaryland.edu> <4B99337D.8070203@gmail.com> <201003120741.20840.carsten.aulbert@aei.mpg.de> Message-ID: >> It looks like Allied Telesis makes chassis switches now too. > > as well as Fortinet (I don't think Henning named them), they took over the > WovenSystems stuff after the latter went under. interesting - I wondered what had happened to Woven. but Woven reminds me of Gnodal, which also aims to produce a smarter fabric that can scalably switch ethernet. the latter seems promising to me, since they're leveraging lessons learned from Quadrics. From michf at post.tau.ac.il Wed Mar 10 13:20:56 2010 From: michf at post.tau.ac.il (Micha Feigin) Date: Wed, 10 Mar 2010 23:20:56 +0200 Subject: [Beowulf] assigning cores to queues with torque In-Reply-To: References: <20100308171400.5128b849@vivalunalitshi.luna.local> Message-ID: <20100310232056.65e553b9@vivalunalitshi.luna.local> On Mon, 8 Mar 2010 10:39:08 -0500 Glen Beane wrote: > > > > On 3/8/10 10:14 AM, "Micha Feigin" wrote: > > I have a small local cluster in our lab that I'm trying to setup with minimum > hustle to support both cpu and gpu processing where only some of the nodes have > a gpu and those have only two gpu for four cores. > > It is currently setup using torque from ubuntu (2.3.6) with the torque supplied > scheduler (set it up with maui initially but it was a bit of a pain for such a > small cluster so I switched) > > This cluster is used by very few people in a very controlled environment so I > don't really need any protection from each other, the queues are just for > convenience to allow remote execution > > The problem: > > I want to allow gpu related jobs to run only on the gpu equiped nodes (i.e more jobs then GPUs will be queued), I want to run other jobs on all nodes with either > 1. a priority to use the gpu equiped nodes last > 2. or better, use only two out of four cores on the gpu equiped nodes > > It doesn't seem though that I can map nodes or cores to queues with torque as far as I can tell > (i.e cpu queue uses 2 cores on gpu1, 2 cores on gpu2, all cores on everything else > gpu queue uses 2 cores on gpu1, 2 cores on gpu2) > > I can't seem to set user defined resources so that I can define gpu machines as having gpu resource and schedule according to that. > > Is it possible to achieve any of these two with torque, or is there any other > simple enough queue manager that can do this (preferably with a debian package > in some way to simplify maintanance). I only manage this cluster since no one > else knows how to and it's supposed to take as little of my time as possible > I'm looking for the simplest solution to implement and not the most versatile > one. > > > you can define a resource "gpu" in your TORQUE nodes file: > > hostname np=4 gpu > > and then users can request -l nodes=1:ppn=4:gpu to get assigned a node with a gpu, but to do anything more advanced you'll need Maui or Moab. You should try the maui users mailing list, or the torque users mailing list to see if anyone else has some ideas Thanks, almost perfect. It would have been a complete solution if there was a way to define how many such resources there are as there are 4 cores and 2 GPUs per node. Its good enough for now though as it works perfect when asking for nodes=1:ppn=2 to make sure that I don't get too many GPU jobs. This is a cluster that is used by 3 people that are cooperating at the moment so I can waste the extra core for now to spare man hours for the setup of maui. From chenyon1 at iit.edu Wed Mar 10 20:27:28 2010 From: chenyon1 at iit.edu (Yong Chen) Date: Wed, 10 Mar 2010 22:27:28 -0600 Subject: [Beowulf] [hpc-announce] P2S2-2010 submission still open: deadline extended to 3/17/2010 Message-ID: [Apologies if you got multiple copies of this email. If you'd like to opt out of these announcements, information on how to unsubscribe is available at the bottom of this email.] Dear Colleague: We would like to inform you that the paper submission deadline of the Third International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) has been extended to March 17th, 2010. A full CFP can be found below. Thank you. CALL FOR PAPERS =============== Third International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Sept. 13th, 2010 To be held in conjunction with ICPP-2010: The 39th International Conference on Parallel Processing, Sept. 13-16, 2010, San Diego, CA, USA Website: http://www.mcs.anl.gov/events/workshops/p2s2 SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Systems software for high-end scientific and enterprise computing architectures o Communication sub-subsystems for high-end computing o High-performance file and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms for high-end computing * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others o Hybrid Programming Models * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms PROCEEDINGS ----------- Proceedings of this workshop will be published in CD format and will be available at the conference (together with the ICPP conference proceedings) . SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/ for the submission link. JOURNAL SPECIAL ISSUE --------------------- The best papers of P2S2'10 will be included in a special issue of the International Journal of High Performance Computing Applications (IJHPCA) on Programming Models, Software and Tools for High-End Computing. IMPORTANT DATES --------------- Paper Submission: March 17th, 2010 Author Notification: May 3rd, 2010 Camera Ready: June 14th, 2010 PROGRAM CHAIRS -------------- * Pavan Balaji, Argonne National Laboratory * Abhinav Vishnu, Pacific Northwest National Laboratory PUBLICITY CHAIR --------------- * Yong Chen, Illinois Institute of Technology STEERING COMMITTEE ------------------ * William D. Gropp, University of Illinois Urbana-Champaign * Dhabaleswar K. Panda, Ohio State University * Vijay Saraswat, IBM Research PROGRAM COMMITTEE ----------------- * Ahmad Afsahi, Queen's University * George Almasi, IBM Research * Taisuke Boku, Tsukuba University * Ron Brightwell, Sandia National Laboratory * Franck Cappello, INRIA, France * Yong Chen, Illinois Institute of Technology * Ada Gavrilovska, Georgia Tech * Torsten Hoefler, Indiana University * Zhiyi Huang, University of Otago, New Zealand * Hyun-Wook Jin, Konkuk University, Korea * Zhiling Lan, Illinois Institute of Technology * Doug Lea, State University of New York at Oswego * Jiuxing Liu, IBM Research * Heshan Lin, Virginia Tech * Guillaume Mercier, INRIA, France * Scott Pakin, Los Alamos National Laboratory * Fabrizio Petrini, IBM Research * Bronis de Supinksi, Lawrence Livermore National Laboratory * Sayantan Sur, Ohio State University * Rajeev Thakur, Argonne National Laboratory * Vinod Tipparaju, Oak Ridge National Laboratory * Jesper Traff, NEC, Europe * Weikuan Yu, Auburn University If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov ======================================================================== You can unsubscribe from the hpc-announce mailing list here: https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce ======================================================================== From akshar.bhosale at gmail.com Fri Mar 12 10:08:56 2010 From: akshar.bhosale at gmail.com (akshar bhosale) Date: Fri, 12 Mar 2010 23:38:56 +0530 Subject: [Beowulf] error while using mpirun Message-ID: i have installed mpich 1.2 6 on my desktop (core 2 duo) my test file is : #include #include int main(int argc,char *argv[]) { int rank=0; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); printf("my second program rank is %d \n",rank); MPI_Finalize(); return; } --------- when i do /usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but when i do /usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get p0_31341: p4_error: Path to program is invalid while starting /home/npsf/last with rsh on dragon: -1 p4_error: latest msg from perror: No such file or directory error. please suggest the solution. From gdjacobs at gmail.com Fri Mar 12 17:04:26 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 12 Mar 2010 19:04:26 -0600 Subject: [Beowulf] Cluster of Linux and Windows In-Reply-To: References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> Message-ID: <4B9AE49A.1040004@gmail.com> Mark Hahn wrote: >> I am used to work with Arch Linux. What do you think about it? > > the distro is basically irrelevant. clustering is just a matter of your > apps, middleware like mpi (may or may not be provided by the cluster), > probably a shared filesystem, working kernel, network stack, > job-launch mechanism. distros are mainly about desktop gunk that is > completely irrelevant to clusters. Let's not start the great distro debate again! Suffice it to say that most distros will work as long as they satisfy your ISV and hardware requirements. >> And finnaly, I would like to know if Is it possible to get a Cluster >> Working >> with a Server on Arch Linux and the nodes Windows. File server? Sure, Samba is a breeze, not sure if the performance scaling is there. NFS works, I guess. Windows comes with a client for it. Authentication? Again, no problem. I don't mean to be a snob, but this is basic stuff. I'm guessing you're requirements are more esoteric so please try to be more specific about them. > sure, but why? windows is generally inferior as an OS platform, > so I would stay away unless you actually require your apps to run > under windows. (remember that linux can use windows storage and > authentication just fine.) If you look back in the archives of this ML, you'll find a thread from when Microsoft released their compute cluster product. It covers quite effectively the positives and negatives of Windows in HPC. Start with this posting by Jon Forrest. http://www.beowulf.org/archive/2008-April/021006.html Do you have an application which can only be run on Windows? >> Or even better the nodes without a defined SO. > > SO=Significant Other? oh, maybe "OS". generally, you want to minimize > the number of things that can go wrong in your system. using uniform OS > on nodes/servers is a good start. but sure, there's no reason you can't > run a cluster where every node is a different OS. they simply need to > agree on the network protocol (which doesn't have to be MPI - in fact, > using something more SOA-like might help if the nodes are heterogenous) Having as much homogeneity in your cluster as possible will help you administer it increasingly as a single resource. As Mark said, it's easier to debug a single set of problems. It's also easier in terms of infrastructure to maintain a minimal set of images for your clients. Perhaps you meant something different? -- Geoffrey D. Jacobs From gdjacobs at gmail.com Fri Mar 12 17:59:59 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 12 Mar 2010 19:59:59 -0600 Subject: [Beowulf] Cluster of Linux and Windows In-Reply-To: <4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com> References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> <4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com> Message-ID: <4B9AF19F.3070904@gmail.com> Leonardo Machado Moreira wrote: > Basicaly, Is a Cluster Implementation just based on these two libraries > MPI on the Server and SSH on the clients?? Technically you don't need a server as long as all your clients have a copy of your application and are able to talk to each other. File servers and authentication servers just make things easier. MPI is a type of library which allows your application to talk with it's sisters on other computers without you having to do sockets programming or other things just as distasteful. SSH allows you to sign in on one computer and launch jobs on one or more other computers, so you can start your application on all the computers (not technically accurate, but good enough for an overview). > And a program on tcl/tk for example on server to watch the cluster? TCL/TK is a programming language and widget library. Monitoring is done by applications written in TCL/TK and other languages, but it's not a requirement. Unless you use it to write your application, of course. Programs for monitoring hardware, batch scheduling, revision management, and so on are used because they make it easier to maintain and use the cluster optimally. It's best to just start reading here... http://www.clustermonkey.net//content/category/5/14/32/ > Thanks a lot. > > Leonardo Machado Moreira -- Geoffrey D. Jacobs From gdjacobs at gmail.com Fri Mar 12 18:06:06 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 12 Mar 2010 20:06:06 -0600 Subject: [Beowulf] Cluster of Linux and Windows In-Reply-To: <4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com> References: <4788ffe70911121401mcd5deaj52b9036197b4f619@mail.gmail.com> <4788ffe70911130534x1ffc7bbdna3ce1adddf54c7c@mail.gmail.com> Message-ID: <4B9AF30E.5090204@gmail.com> Leonardo Machado Moreira wrote: > Basicaly, Is a Cluster Implementation just based on these two libraries > MPI on the Server and SSH on the clients?? Technically you don't need a server as long as all your clients have a copy of your application and are able to talk to each other. File servers and authentication servers just make things easier. MPI is a type of library which allows your application to talk with it's sisters on other computers without you having to do sockets programming or other things just as distasteful. SSH allows you to sign in on one computer and launch jobs on one or more other computers, so you can start your application on all the computers (not technically accurate, but good enough for an overview). > And a program on tcl/tk for example on server to watch the cluster? TCL/TK is a programming language and widget library. Monitoring is done by applications written in TCL/TK and other languages, but it's not a requirement. Unless you use it to write your application, of course. Programs for monitoring hardware, batch scheduling, revision management, and so on are used because they make it easier to maintain and use the cluster optimally. It's best to just start reading here... http://www.clustermonkey.net//content/category/5/14/32/ > Thanks a lot. > > Leonardo Machado Moreira -- Geoffrey D. Jacobs From richard.walsh at comcast.net Fri Mar 12 20:11:28 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sat, 13 Mar 2010 04:11:28 +0000 (UTC) Subject: [Beowulf] error while using mpirun In-Reply-To: Message-ID: <573935419.327311268453488264.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Akshar bhosale wrote: >When i do: > >/usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but when i do >/usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get > >p0_31341: p4_error: Path to program is invalid while starting >/home/npsf/last with rsh on dragon: -1 >p4_error: latest msg from perror: No such file or directory error. > >please suggest the solution. Looks like the directory that your MPI executable 'test' is in is: /home/npsf/last Correct? This directory needs to be visible on each node used by MPI to run your program. You might also need to put a ./ in front of the name of the executable, as in ./test . You also need be able the 'rsh' to each of those nodes. Because you have not specified a 'machines' file, MPI is using the default file in the install tree which normally lists the nodes in simple sequence. Still, I think the problem option 1 or 2 above. rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdjacobs at gmail.com Fri Mar 12 21:43:16 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 12 Mar 2010 23:43:16 -0600 Subject: [Beowulf] hardware RAID versus mdadm versus LVM-striping In-Reply-To: <4B539883.8040003@abdn.ac.uk> References: <4B539883.8040003@abdn.ac.uk> Message-ID: <4B9B25F4.7070901@gmail.com> Tony Travis wrote: > Rahul Nabar wrote: >> If I have a option between doing Hardware RAID versus having software >> raid via mdadm is there a clear winner in terms of performance? Or is >> the answer only resolvable by actual testing? I have a fairly fast >> machine (Nehalem 2.26 GHz 8 cores) and 48 gigs of RAM. > > Hello, Rahul. > > It depends which level of RAID you want to use, and if you want hot-swap > capability. I use inexpensive 3ware 8006-2 RAID1 controllers and stripe > them using "md" software RAID0 to make RAID10 arrays. This gives me good > performance and hot-swap capability (the production md RAID driver does > not support hot-swap). However, where "md" really scores is portability. > My RAID's can only be read by 3ware controllers - I made a considered > descision about this: The 3ware controllers are well-supported by Linux > kernels, but it makes me uneasy using a proprietary RAID format. I do > also use "md" RAID5 which is more space efficient, but read this: > > http://www.baarf.com/ Hot swap is at least partially dependent on the controller. Even most of the built-in controllers now support hot swap. I'm not aware of md hardcoding anything on boot which would prevent a change on the fly, except that you would manually have to initiate a rebuild after swapping. Could you be more specific about what wasn't working as far as hot swapping? Was this with current controllers? Which ones? >> Should I be using the vendor's hardware RAID or mdadm? In case a >> generic answer is not possible, what might be a good way to test the >> two options? Any other implications that I should be thinking about? > > In fact, "mdadm" is just the user-space command for controlling the "md" > driver. The problem with using an on-board RAID controller is that many > of these are 'host' RAID (i.e. need a Windows driver to do the RAID) in > which case you are using the CPU anyway, and they also use proprietary > formats. Generally, I just use SATA mode on the on-board RAID controller > and create an "md" RAID. This means that I can replace a motherboard > withour worrying if it has the same type of RAID controller on-board. Yes, it's pedigreed software instead of mystery meat firmware. >> Finally, there;s always hybrid approaches. I could have several small >> RAID5's at the hardware level (RIAD5 seems ok since I have smaller >> disks ~300 GB so not really in the domain where the RAID6 arguments >> kick in, I think) Then using LVM I can integrate storage while asking >> LVM to stripe across these RAID5's. Thus I'd get striping at two >> levels: LVM (software) and RAID5 (hardware). > > Yes, I think a hybrid approach is good because that's what I use ;-) > > However, I would avoid relying on LVM mirroring for data protection. It > is much safer to stripe a set of RAID1's using LVM. I don't think LVM is > useful unless you are managing a disk farm. The commonest issue in disk > perfomance is decoupling seeks between different spindles, so I put the > system files on a different RAID1-set to /export (or /home) filesystems. LVM is there as a management convenience. It allows you to grow your disk pool more-or-less on demand. Where it's really beautiful, though, is when you want to migrate data -- Geoffrey D. Jacobs From deadline at eadline.org Mon Mar 15 11:24:56 2010 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 15 Mar 2010 14:24:56 -0400 (EDT) Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.ma il.comcast.net> References: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <33623.192.168.1.213.1268677496.squirrel@mail.eadline.org> I have placed a copy of Richard's table on ClusterMonkey in case you want an html view. http://www.clustermonkey.net//content/view/275/33/ -- Doug > > All, > > > In case anyone else has trouble keeping the numbers > straight between IB (SDR, DDR, QDR, EDR) and PCI-Express, > (1.0, 2.0, 30) here are a couple of tables in Excel I just worked > up to help me remember. > > > If anyone finds errors in it please let me know so that I can fix > them. > > > Regards, > > > rbw > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From patrick at myri.com Mon Mar 15 13:27:23 2010 From: patrick at myri.com (Patrick Geoffray) Date: Mon, 15 Mar 2010 16:27:23 -0400 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <4B9E982B.9000002@myri.com> Hi Richard, I meant to reply earlier but got busy. On 2/27/2010 11:17 PM, richard.walsh at comcast.net wrote: > If anyone finds errors in it please let me know so that I can fix > them. You don't consider the protocol efficiency, and this is a major issue on PCIe. First of all, I would change the labels "Raw" and "Effective" to "Signal" and "Raw". Then, I would add a third column "Effective" which consider the protocol overhead. The protocol overhead is the amount of raw bandwidth that is not used for useful payload. On PCIe, on the Read side, the data comes in small packets with a 20 Bytes header (could be 24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe chipsets only support 64 Bytes Read Completions MTU, and even the ones that support larger sizes would still use a majority of 64 Bytes completions because it maps well to the transaction size on the memory bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is 64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero number quoted by MVAPICH for example (3 GB/s unidirectional). Bidirectional efficiency is a bit worse because PCIe Acks take some raw bandwidth too. They are coalesced but the pipeline is not very deep, so you end up with roughly 20+20 Gb/s bidirectional. There is a similar protocol efficiency at the IB or Ethernet level, but the MTU is large enough that it's much smaller compared to PCIe. Now, all of this does not matter because Marketers will keep using useless Signal rates. They will even have the balls to (try to) rewrite history about packet rate benchmarks... Patrick From mathog at caltech.edu Mon Mar 15 13:47:09 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 15 Mar 2010 13:47:09 -0700 Subject: [Beowulf] 1000baseT NIC and PXE? Message-ID: Sorry if this is a silly question, but do any of the inexpensive 1000baseT NICs support PXE boot? I just finished looking through the offerings on newegg and while a couple of the really really cheap ones had an empty socket for a boot rom, none of the ones without such a socket said definitively in its specs that it could PXE boot (or for that matter, that it couldn't). The older machines are all 100baseT, be nice to give them a network speed boost by dropping in an inexpensive 1000baseT NIC, but not if the new cards won't be able to PXE boot. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mdidomenico4 at gmail.com Mon Mar 15 14:00:40 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Mon, 15 Mar 2010 17:00:40 -0400 Subject: [Beowulf] 1000baseT NIC and PXE? In-Reply-To: References: Message-ID: surprisingly enough there are still cards that don't come with PXE built into the embedded rom. you'll have to check the specs on the card you're interested in from the mfg website. one thing that bit me in the past, even though the card had pxe, the bios of the machine i was working on had no mechanism to load the option rom (it was a desktop), On Mon, Mar 15, 2010 at 4:47 PM, David Mathog wrote: > Sorry if this is a silly question, but do any of the inexpensive > 1000baseT NICs support PXE boot? ?I just finished looking through the > offerings on newegg and while a couple of the really really cheap ones > had an empty socket for a boot rom, none of the ones without such a > socket said definitively in its specs that it could PXE boot (or for > that matter, that it couldn't). ?The older machines are all 100baseT, be > nice to give them a network speed boost by dropping in an inexpensive > 1000baseT NIC, but not if the new cards won't be able to PXE boot. > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From richard.walsh at comcast.net Mon Mar 15 14:24:52 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Mon, 15 Mar 2010 21:24:52 +0000 (UTC) Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <1392498409.1069101268686455945.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> On Monday, March 15, 2010 1:27:23 PM GMT Patrick Geoffray wrote: >I meant to respond to this, but got busy. You don't consider the protocol >efficiency, and this is a major issue on PCIe. Yes, I forgot that there is more to the protocol than the 8B/10B encoding, but I am glad to get your input to improve the table (late or otherwise). >First of all, I would change the labels "Raw" and "Effective" to >"Signal" and "Raw". Then, I would add a third column "Effective" which >consider the protocol overhead. The protocol overhead is the amount of I think adding another column for protocol inefficiency column makes some sense. Not sure I know enough to chose the right protocol performance loss multipliers or what the common case values would be (as opposed to best and worst case). It would be good to add Ethernet to the mix (1Gb, 10Gb, and 40Gb) as well. Sounds like the 76% multiplier is reasonable for PCI-E (with a "your mileage may vary" footnote). The table cannot perfectly reflect every contributing variable without getting very large. Perhaps, you could provide a table with the Ethernet numbers, and I will do some more research to make estimates for IB? Then I will get a draft to Doug at Cluster Monkey. One more iteration only ... to improve things, but avoid a "protocol holy war" ... ;-) ... >raw bandwidth that is not used for useful payload. On PCIe, on the Read >side, the data comes in small packets with a 20 Bytes header (could be >24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe >chipsets only support 64 Bytes Read Completions MTU, and even the ones >that support larger sizes would still use a majority of 64 Bytes >completions because it maps well to the transaction size on the memory >bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is >64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero >number quoted by MVAPICH for example (3 GB/s unidirectional). >Bidirectional efficiency is a bit worse because PCIe Acks take some raw >bandwidth too. They are coalesced but the pipeline is not very deep, so >you end up with roughly 20+20 Gb/s bidirectional. Thanks for the clear and detailed description. >There is a similar protocol efficiency at the IB or Ethernet level, but >the MTU is large enough that it's much smaller compared to PCIe. Would you estimate less than 1%, 2%, 4% ... ?? >Now, all of this does not matter because Marketers will keep using >useless Signal rates. They will even have the balls to (try to) rewrite >history about packet rate benchmarks... I am hoping the table increases the number of fully informed decisions on these questions. rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shainer at mellanox.com Mon Mar 15 14:33:07 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Mon, 15 Mar 2010 14:33:07 -0700 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com> To make it more accurate, most PCIe chipsets supports 256B reads, and the data bandwidth is 26Gb/s, which makes it 26+26, not 20+20. Gilad From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of richard.walsh at comcast.net Sent: Monday, March 15, 2010 2:25 PM To: beowulf at beowulf.org Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)? On Monday, March 15, 2010 1:27:23 PM GMT Patrick Geoffray wrote: >I meant to respond to this, but got busy. You don't consider the protocol >efficiency, and this is a major issue on PCIe. Yes, I forgot that there is more to the protocol than the 8B/10B encoding, but I am glad to get your input to improve the table (late or otherwise). >First of all, I would change the labels "Raw" and "Effective" to >"Signal" and "Raw". Then, I would add a third column "Effective" which >consider the protocol overhead. The protocol overhead is the amount of I think adding another column for protocol inefficiency column makes some sense. Not sure I know enough to chose the right protocol performance loss multipliers or what the common case values would be (as opposed to best and worst case). It would be good to add Ethernet to the mix (1Gb, 10Gb, and 40Gb) as well. Sounds like the 76% multiplier is reasonable for PCI-E (with a "your mileage may vary" footnote). The table cannot perfectly reflect every contributing variable without getting very large. Perhaps, you could provide a table with the Ethernet numbers, and I will do some more research to make estimates for IB? Then I will get a draft to Doug at Cluster Monkey. One more iteration only ... to improve things, but avoid a "protocol holy war" ... ;-) ... >raw bandwidth that is not used for useful payload. On PCIe, on the Read >side, the data comes in small packets with a 20 Bytes header (could be >24 with optional ECRC) for a 64, 128 or 256 Bytes payload. Most PCIe >chipsets only support 64 Bytes Read Completions MTU, and even the ones >that support larger sizes would still use a majority of 64 Bytes >completions because it maps well to the transaction size on the memory >bus (HT, QPI). With 64 Bytes Read Completions, the PCIe efficiency is >64/84 = 76%, so 32 Gb/s becomes 24 Gb/s, which correspond to the hero >number quoted by MVAPICH for example (3 GB/s unidirectional). >Bidirectional efficiency is a bit worse because PCIe Acks take some raw >bandwidth too. They are coalesced but the pipeline is not very deep, so >you end up with roughly 20+20 Gb/s bidirectional. Thanks for the clear and detailed description. >There is a similar protocol efficiency at the IB or Ethernet level, but >the MTU is large enough that it's much smaller compared to PCIe. Would you estimate less than 1%, 2%, 4% ... ?? >Now, all of this does not matter because Marketers will keep using >useless Signal rates. They will even have the balls to (try to) rewrite >history about packet rate benchmarks... I am hoping the table increases the number of fully informed decisions on these questions. rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From fitz at cs.earlham.edu Fri Mar 12 17:05:41 2010 From: fitz at cs.earlham.edu (Andrew Fitz Gibbon) Date: Fri, 12 Mar 2010 19:05:41 -0600 Subject: [Beowulf] error while using mpirun In-Reply-To: References: Message-ID: <63ECD1AF-EB0B-46AB-BDB7-878853519DE6@cs.earlham.edu> On Mar 12, 2010, at 12:08 PM, akshar bhosale wrote: > when i do > /usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but > when i do > /usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get > > p0_31341: p4_error: Path to program is invalid while starting > /home/npsf/last with rsh on dragon: -1 > p4_error: latest msg from perror: No such file or directory > error. > please suggest the solution. My suggestion would be to specify something closer to the absolute path to your binary. For example: $ /usr/local/mpich-1.2.6/bin/mpirun -np 4 $HOME/test ---------------- Andrew Fitz Gibbon fitz at cs.earlham.edu From rigved.sharma123 at gmail.com Fri Mar 12 22:41:06 2010 From: rigved.sharma123 at gmail.com (rigved sharma) Date: Sat, 13 Mar 2010 12:11:06 +0530 Subject: [Beowulf] error while using mpirun In-Reply-To: <573935419.327311268453488264.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <573935419.327311268453488264.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: hi, thanks for your solutions. i tried all solutions given by you ..still same error. can ypu please suggest any other solution? regards, akshar On Sat, Mar 13, 2010 at 9:41 AM, wrote: > > Akshar bhosale wrote: > > >When i do: > > > >/usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but when i do > >/usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get > > > >p0_31341: p4_error: Path to program is invalid while starting > >/home/npsf/last with rsh on dragon: -1 > >p4_error: latest msg from perror: No such file or directory error. > > > >please suggest the solution. > > Looks like the directory that your MPI executable 'test' is in is: > > /home/npsf/last > > Correct? This directory needs to be visible on each node used > by MPI to run your program. You might also need to put a ./ in > front of the name of the executable, as in ./test . You also need > be able the 'rsh' to each of those nodes. Because you have not > specified a 'machines' file, MPI is using the default file in the install > tree which normally lists the nodes in simple sequence. Still, I > think the problem option 1 or 2 above. > > rbw > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From erik at contica.com Sat Mar 13 07:48:26 2010 From: erik at contica.com (Erik Andresen) Date: Sat, 13 Mar 2010 16:48:26 +0100 Subject: [Beowulf] error while using mpirun In-Reply-To: <201003130209.o2D28ttl025995@bluewest.scyld.com> References: <201003130209.o2D28ttl025995@bluewest.scyld.com> Message-ID: <4B9BB3CA.1050208@contica.com> > Date: Fri, 12 Mar 2010 23:38:56 +0530 > From: akshar bhosale > Subject: [Beowulf] error while using mpirun > To: beowulf at beowulf.org, torqueusers at supercluster.org > when i do > /usr/local/mpich-1.2.6/bin/mpicc -o test test.c ,i get test ;but when i do > /usr/local/mpich-1.2.6/bin/mpirun -np 4 test,i get > > p0_31341: p4_error: Path to program is invalid while starting > /home/npsf/last with rsh on dragon: -1 > p4_error: latest msg from perror: No such file or directory > error. > please suggest the solution. > > Classic mistake. Try to use mpirun -np 4 ./test On my system 'which test' returns /usr/bin/test so I guess mpirun tries to run another 'test' than the one you made. Erik Andresen From patrick at myri.com Mon Mar 15 15:47:51 2010 From: patrick at myri.com (Patrick Geoffray) Date: Mon, 15 Mar 2010 18:47:51 -0400 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com> References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> <9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com> Message-ID: <4B9EB917.8090201@myri.com> On 3/15/2010 5:33 PM, Gilad Shainer wrote: > To make it more accurate, most PCIe chipsets supports 256B reads, and > the data bandwidth is 26Gb/s, which makes it 26+26, not 20+20. I know Marketers lives in their own universe, but here are a few nuts for you to crack: * If most PCIe chipsets would effectively do 256B Completions, why is the max unidirectional bandwidth for QDR/Nehalem is 3026 MB/s (24.2 GB/s) as reported in the latest MVAPICH announcement ? 3026 MB/s is 73.4% efficiency compared to raw bandwidth of 4 GB for Gen2 8x. With 256B Completions, the PCIe efficiency would be 92.7%, so someone would be losing 19.3% ? Would that be your silicon ? * For 64B Completions: 64/84 is 0.7619, and 0.7619 * 32 = 24.38 Gb/s. How do you get 26 Gb/s again ? * PCIe is a reliable protocol, there are Acks in the other direction. If you claim that one way is 26 GB/s and two-way is 26+26 Gb/s, does that mean you have invented a reliable protocol that does not need acks ? * If bidirectional is 26+26, why is the max bidirectional bandwidth reported by MVAPICH is 5858 MB/s, ie 46.8 Gb/s or 23.4+23.4 Gb/s ? Granted, it's more than 20+20, but it depends a lot on the chipset-dependent pipeline depth. BTW, Greg's offer is still pending... Patrick From Shainer at mellanox.com Mon Mar 15 16:09:53 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Mon, 15 Mar 2010 16:09:53 -0700 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net><9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com> <4B9EB917.8090201@myri.com> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F0276123F@mtiexch01.mti.com> I don?t appreciate those kind of responses and it is not appropriate for this mailing list. Please fix in future emails. I am standing behind any info I put out, and definitely don?t do down estimations as you do. It was nice to see that you fixed your 20+20 numbers to 24+23 (that was marketing that you did?), but I suggest you do a better search to look on numbers of recent systems, with decent Bios setting. Gen2 system can provide 3300MB/s uni or >6500MB bi dir. Of course you can find versions that gives lower performance, and I can send you some instruction to get the PCIe BW even lower than 20 for your own performance testing if you want to. It still will be much higher than what you can do with Myri10G... Gilad -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Patrick Geoffray Sent: Monday, March 15, 2010 3:48 PM To: beowulf at beowulf.org Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)? On 3/15/2010 5:33 PM, Gilad Shainer wrote: > To make it more accurate, most PCIe chipsets supports 256B reads, and > the data bandwidth is 26Gb/s, which makes it 26+26, not 20+20. I know Marketers lives in their own universe, but here are a few nuts for you to crack: * If most PCIe chipsets would effectively do 256B Completions, why is the max unidirectional bandwidth for QDR/Nehalem is 3026 MB/s (24.2 GB/s) as reported in the latest MVAPICH announcement ? 3026 MB/s is 73.4% efficiency compared to raw bandwidth of 4 GB for Gen2 8x. With 256B Completions, the PCIe efficiency would be 92.7%, so someone would be losing 19.3% ? Would that be your silicon ? * For 64B Completions: 64/84 is 0.7619, and 0.7619 * 32 = 24.38 Gb/s. How do you get 26 Gb/s again ? * PCIe is a reliable protocol, there are Acks in the other direction. If you claim that one way is 26 GB/s and two-way is 26+26 Gb/s, does that mean you have invented a reliable protocol that does not need acks ? * If bidirectional is 26+26, why is the max bidirectional bandwidth reported by MVAPICH is 5858 MB/s, ie 46.8 Gb/s or 23.4+23.4 Gb/s ? Granted, it's more than 20+20, but it depends a lot on the chipset-dependent pipeline depth. BTW, Greg's offer is still pending... Patrick _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From patrick at myri.com Mon Mar 15 16:30:15 2010 From: patrick at myri.com (Patrick Geoffray) Date: Mon, 15 Mar 2010 19:30:15 -0400 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <4B9EC307.5070009@myri.com> On 3/15/2010 5:24 PM, richard.walsh at comcast.net wrote: > to best and worst case). It would be good to add Ethernet to the mix > (1Gb, 10Gb, and 40Gb) as well. 10 Gb Ethernet uses 8b/10b with a signal rate of 12.5 Gb/s, for a raw bandwidth of 10 Gb/s. I don't know how 1Gb is encoded and 40 Gb/s is still in draft. Last time I looked at 40 Gb/s, it was pretty much four 10 Gb links put together, so I would say 8b/10b with 50 Gb/s signal rate. > >There is a similar protocol efficiency at the IB or Ethernet level, but > >the MTU is large enough that it's much smaller compared to PCIe. > > Would you estimate less than 1%, 2%, 4% ... ?? It depends on the packet size. For example, 14 Bytes Ethernet header on 1500 Bytes MTU, that's 1%. For Jumbo frames at 9000B MTU, it's much less than that. I don't know the header size in IB, but with an MTU of 2K or 4K, it's negligible. However, things are different for tiny packets. The minimum packet size on Ethernet is 60 Bytes. The maximum packet rate (not coalesced !) is 14.88 Mpps on a 10GE link, counting everything (inter-packet gap, CRC, etc). If you do the math, that's 14.88*60 = 892 MB/s on the link, or 684 MB/s if you remove the 14B Ethernet header (54% efficiency). I don't think you can put all that on an Excel sheet :-) Patrick From tom.ammon at utah.edu Mon Mar 15 16:45:44 2010 From: tom.ammon at utah.edu (Tom Ammon) Date: Mon, 15 Mar 2010 17:45:44 -0600 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <4B9EC307.5070009@myri.com> References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> <4B9EC307.5070009@myri.com> Message-ID: <4B9EC6A8.2000800@utah.edu> If I understand correctly, 40GbE is 64/66 encoded. Tom On 3/15/2010 5:30 PM, Patrick Geoffray wrote: > On 3/15/2010 5:24 PM, richard.walsh at comcast.net wrote: > >> to best and worst case). It would be good to add Ethernet to the mix >> (1Gb, 10Gb, and 40Gb) as well. >> > 10 Gb Ethernet uses 8b/10b with a signal rate of 12.5 Gb/s, for a raw > bandwidth of 10 Gb/s. I don't know how 1Gb is encoded and 40 Gb/s is > still in draft. Last time I looked at 40 Gb/s, it was pretty much four > 10 Gb links put together, so I would say 8b/10b with 50 Gb/s signal rate. > > >> >There is a similar protocol efficiency at the IB or Ethernet level, but >> >the MTU is large enough that it's much smaller compared to PCIe. >> >> Would you estimate less than 1%, 2%, 4% ... ?? >> > It depends on the packet size. For example, 14 Bytes Ethernet header on > 1500 Bytes MTU, that's 1%. For Jumbo frames at 9000B MTU, it's much less > than that. I don't know the header size in IB, but with an MTU of 2K or > 4K, it's negligible. > > However, things are different for tiny packets. The minimum packet size > on Ethernet is 60 Bytes. The maximum packet rate (not coalesced !) is > 14.88 Mpps on a 10GE link, counting everything (inter-packet gap, CRC, > etc). If you do the math, that's 14.88*60 = 892 MB/s on the link, or 684 > MB/s if you remove the 14B Ethernet header (54% efficiency). > > I don't think you can put all that on an Excel sheet :-) > > Patrick > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- -------------------------------------------------------------------- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu From Shainer at mellanox.com Mon Mar 15 16:54:08 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Mon, 15 Mar 2010 16:54:08 -0700 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F01E1558F@mtiexch01.mti.com> That's correct ----- Original Message ----- From: beowulf-bounces at beowulf.org To: Patrick Geoffray Cc: beowulf at beowulf.org Sent: Mon Mar 15 16:45:44 2010 Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)? If I understand correctly, 40GbE is 64/66 encoded. Tom On 3/15/2010 5:30 PM, Patrick Geoffray wrote: > On 3/15/2010 5:24 PM, richard.walsh at comcast.net wrote: > >> to best and worst case). It would be good to add Ethernet to the mix >> (1Gb, 10Gb, and 40Gb) as well. >> > 10 Gb Ethernet uses 8b/10b with a signal rate of 12.5 Gb/s, for a raw > bandwidth of 10 Gb/s. I don't know how 1Gb is encoded and 40 Gb/s is > still in draft. Last time I looked at 40 Gb/s, it was pretty much four > 10 Gb links put together, so I would say 8b/10b with 50 Gb/s signal rate. > > >> >There is a similar protocol efficiency at the IB or Ethernet level, but >> >the MTU is large enough that it's much smaller compared to PCIe. >> >> Would you estimate less than 1%, 2%, 4% ... ?? >> > It depends on the packet size. For example, 14 Bytes Ethernet header on > 1500 Bytes MTU, that's 1%. For Jumbo frames at 9000B MTU, it's much less > than that. I don't know the header size in IB, but with an MTU of 2K or > 4K, it's negligible. > > However, things are different for tiny packets. The minimum packet size > on Ethernet is 60 Bytes. The maximum packet rate (not coalesced !) is > 14.88 Mpps on a 10GE link, counting everything (inter-packet gap, CRC, > etc). If you do the math, that's 14.88*60 = 892 MB/s on the link, or 684 > MB/s if you remove the 14B Ethernet header (54% efficiency). > > I don't think you can put all that on an Excel sheet :-) > > Patrick > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- -------------------------------------------------------------------- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom.elken at qlogic.com Mon Mar 15 17:03:14 2010 From: tom.elken at qlogic.com (Tom Elken) Date: Mon, 15 Mar 2010 17:03:14 -0700 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com><9FA59C95FFCBB34EA5E42C1A8573784F02662C76@mtiexch01.mti.com> <20100219220538.GL2857@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> Message-ID: <35AAF1E4A771E142979F27B51793A48887348B241D@AVEXMB1.qlogic.org> > On Behalf Of Gilad Shainer > > ... OSU has different benchmarks > so you can measure message coalescing or real message rate. [ As a refresher for the wider audience , as Gilad defined earlier: " Message coalescing is when you incorporate multiple MPI messages in a single network packet." And I agree with this definition :) ] Gilad, Sorry for the delayed QLogic response on this. I was on vacation when this thread started up. But now that it has been revived, ... Which OSU benchmarks have message-coalescing built into the source? > Nowadays it seems that QLogic > promotes the message rate as non coalescing data and I almost got > bought > by their marketing machine till I looked on at the data on the wire... > interesting what the bits and bytes and symbols can tell you... Message-coalescing has been done in benchmark source code, such as HPC Challenge's MPI RandomAccess benchmark. In that case, coalescing is performed when the SANDIA_OPT2 define is turned on during the build. More typically message coalescing is a feature of some MPIs and they use various heuristics for when it is active. MVAPICH has an environment variable -- VIADEV_USE_COALESCE -- which can turn this feature on or off. HP-MPI has coalescing heuristics on by default when using IB-Verbs, off by default when using QLogic's PSM. Open MPI has enabled message-coalescing heuristics for more recent versions when running over IB verbs. There is nothing wrong with message coalescing features in the MPI. Only when you are trying to measure the raw message rate of the network adapter, it is best to not use message coalescing feature so you can measure what you set out to measure. QLogic MPI does not have a message coalescing feature, and that is what we use to measure MPI message rate on our IB adapters. We also measure using MVAPICH with it's message coalescing feature turned off, and get virtually identical message rate performance to that with QLogic MPI. I don't know what you were measuring on the wire, but with the osu_mbw_mr benchmark and QLogic MPI, for the small 1 to 8 byte message sizes where we achieve maximum message rate, each message is in its own 56 byte packet with no coalescing. I asked a couple of our engineers who have looked at a lot of PCIe traces to make sure of this. Regards, -Tom > > > > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] > On Behalf Of Greg Lindahl > Sent: Friday, February 19, 2010 2:06 PM > To: beowulf at beowulf.org > Subject: Re: [Beowulf] Q: IB message rate & large core counts (per > node)? > > > Mellanox latest message rate numbers with ConnectX-2 more than > > doubled versus the old cards, and are for real message rate - > > separate messages on the wire. The competitor numbers are with using > > message coalescing, so it is not real separate messages on the wire, > > or not really message rate. > > Gilad, > > I think you forgot which side you're supposed to be supporting. > > The only people I have ever seen publish message rate with coalesced > messages are DK Panda (with Mellanox cards) and Mellanox. > > QLogic always hated coalesced messages, and if you look back in the > archive for this mailing list, you'll see me denouncing coalesced > messages as meanless about 1 microsecond after the first result was > published by Prof. Panda. > > Looking around the Internet, I don't see any numbers ever published by > PathScale/QLogic using coalesced messages. > > At the end of the day, the only reason microbenchmarks are useful is > when they help explain why one interconnect does better than another > on real applications. No customer should ever choose which adapter to > buy based on microbenchmarks. > > -- greg > (formerly employed by QLogic) > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Mon Mar 15 17:06:01 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 15 Mar 2010 17:06:01 -0700 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <4B9EC307.5070009@myri.com> References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> <4B9EC307.5070009@myri.com> Message-ID: <20100316000601.GA8362@bx9.net> On Mon, Mar 15, 2010 at 07:30:15PM -0400, Patrick Geoffray wrote: > However, things are different for tiny packets. The minimum packet size > on Ethernet is 60 Bytes. The maximum packet rate (not coalesced !) is > 14.88 Mpps on a 10GE link, counting everything (inter-packet gap, CRC, > etc). If you do the math, that's 14.88*60 = 892 MB/s on the link, or 684 > MB/s if you remove the 14B Ethernet header (54% efficiency). There's the additional complexity, for tiny packets, that different cards will have different outgoing inter-packet gaps, usually greater than the minimum. Switches can merge streams from multiple hosts and reduce that inter-packet gap on the receiving side, if multiple hosts talk to one. This is true for both Ethernet and IB. And if we're talking MPI, different MPIs have different header sizes. That's where a graph of non-coalesced MPI bandwidth (not including all the overheads) as a function of message size and core counts is interesting. The spreadsheet can kinda be useful for large messages, but not for data sizes < 1/2 MTU. But even for large data sizes, there are plenty of other factors which your real application can trip over. -- greg From hahn at mcmaster.ca Mon Mar 15 19:20:54 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 15 Mar 2010 22:20:54 -0400 (EDT) Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F0276123F@mtiexch01.mti.com> References: <512379913.1083501268688292427.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net><9FA59C95FFCBB34EA5E42C1A8573784F02761204@mtiexch01.mti.com> <4B9EB917.8090201@myri.com> <9FA59C95FFCBB34EA5E42C1A8573784F0276123F@mtiexch01.mti.com> Message-ID: > I don???t appreciate those kind of responses and it is not >appropriate for this mailing list. Please fix in future emails. I am your assert some numbers, perhaps correct, but Patrick provides useful explanations. I prefer the latter. >system can provide 3300MB/s uni or >6500MB bi dir. Of course you can find your numbers are 10% different from Patrick's. wow. could we all just drop the silly personal byplay? From lindahl at pbm.com Mon Mar 15 21:48:07 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Mon, 15 Mar 2010 21:48:07 -0700 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <35AAF1E4A771E142979F27B51793A48887348B241D@AVEXMB1.qlogic.org> References: <20100219220538.GL2857@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> <35AAF1E4A771E142979F27B51793A48887348B241D@AVEXMB1.qlogic.org> Message-ID: <20100316044807.GB27065@bx9.net> On Mon, Mar 15, 2010 at 05:03:14PM -0700, Tom Elken wrote: > QLogic MPI does not have a message coalescing feature, and that is > what we use to measure MPI message rate on our IB adapters. Thank you for making that clear, Tom. -- greg From niftyompi at niftyegg.com Mon Mar 15 21:58:36 2010 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Mon, 15 Mar 2010 21:58:36 -0700 Subject: [Beowulf] confidential data on public HPC cluster In-Reply-To: <4B8BEB7D.9050904@scinet.utoronto.ca> References: <4B8BEB7D.9050904@scinet.utoronto.ca> Message-ID: <20100316045836.GA3252@compegg> On Mon, Mar 01, 2010 at 11:29:49AM -0500, Jonathan Dursi wrote: > > Hi; > > We're a fairly typical academic HPC centre, and we're starting to > have users talk to us about using our new clusters for projects that > have various requirements for keeping data confidential. "Various requirements" should spell it out for you. The requirements result in consequences and a price. Multiple groups may have conflicting requirements and cannot play together. If they want timeshare does that desire argue with their requirements? In one senario you can isolate storage and kickstart (clean load) all the compute hosts between project access. i.e. It is possible for each group to have its own "Head Node" with a dedicated NFS resource and allow only one "Head Node" to be physically connected to cluster at a time. Requirements should specify staff requirements and more including physical access. The cost of a breach can dwarf the cost of dedicated individual disk farms and clusters. If their requirements cost you then they need to put skin in the game. Your best solution might be to turn it back at them and make "various requirements" of them that you can live with! This might require a legal review as well. -- T o m M i t c h e l l Found me a new hat, now what? From carsten.aulbert at aei.mpg.de Tue Mar 16 08:27:30 2010 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Tue, 16 Mar 2010 16:27:30 +0100 Subject: [Beowulf] HPL as a learning experience Message-ID: <201003161627.32012.carsten.aulbert@aei.mpg.de> Hi all, I wanted to run high performance linpack mostly for fun (and of course to learn more about it and stress test a couple of machines). However, so far I've had very mixed results. I downloaded the 2.0 version released in September 2008 and managed it to compile with mpich 1.2.7 on Debian Lenny. The resulting xhpl file is dynamically linked like this: linux-vdso.so.1 => (0x00007fffca372000) libpthread.so.0 => /lib/libpthread.so.0 (0x00007fb47bca8000) librt.so.1 => /lib/librt.so.1 (0x00007fb47ba9f000) libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fb47b7c4000) libm.so.6 => /lib/libm.so.6 (0x00007fb47b541000) libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fb47b32a000) libc.so.6 => /lib/libc.so.6 (0x00007fb47afd7000) /lib64/ld-linux-x86-64.so.2 (0x00007fb47bec4000) Then I wanted to run a couple of tests on a single quad-CPU node (with 12 GB physical RAM), I used http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html to generate files for a single and a dual core test [1] and [2]. Starting the single core run does not pose any problem: /usr/bin/mpirun.mpich -np 1 -machinefile machines /nfs/xhpl where machines is just a simple file containing 4 times the name of this host. So far so good. ============================================================================ T/V N NB P Q Time Gflops ---------------------------------------------------------------------------- WR11C2R4 14592 128 1 1 407.94 5.078e+00 ---------------------------------------------------------------------------- ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0087653 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0209927 ...... PASSED ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0045327 ...... PASSED ============================================================================ When starting the two core run, I receive the following error message after a couple of seconds (after RSS hits the VIRT RAM value in top): /usr/bin/mpirun.mpich -np 2 -machinefile machines /nfs/xhpl p0_20535: p4_error: interrupt SIGSEGV: 11 rm_l_1_20540: (1.804688) net_send: could not write to fd=5, errno = 32 SIGSEGV with p4_error indicates a seg fault within hpl - that's as far as I've come with google, but right now I have no idea how to proceed. I somehow doubt that this venerable program is so buggy that I'd hit it on my first day ;) Any ideas where I might do something wrong? Cheers Carsten [1] single core test HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 8 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 14592 Ns 1 # of NBs 128 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 1 Ps 1 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0) ##### This line (no. 32) is ignored (it serves as a separator). ###### 0 Number of additional problem sizes for PTRANS 1200 10000 30000 values of N 0 number of additional blocking sizes for PTRANS 40 9 8 13 13 20 16 32 64 values of NB [2] dual core setup HPLinpack benchmark input file Innovative Computing Laboratory, University of Tennessee HPL.out output file name (if any) 8 device out (6=stdout,7=stderr,file) 1 # of problems sizes (N) 14592 Ns 1 # of NBs 128 NBs 0 PMAP process mapping (0=Row-,1=Column-major) 1 # of process grids (P x Q) 1 Ps 2 Qs 16.0 threshold 1 # of panel fact 2 PFACTs (0=left, 1=Crout, 2=Right) 1 # of recursive stopping criterium 4 NBMINs (>= 1) 1 # of panels in recursion 2 NDIVs 1 # of recursive panel fact. 1 RFACTs (0=left, 1=Crout, 2=Right) 1 # of broadcast 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) 1 # of lookahead depth 1 DEPTHs (>=0) 2 SWAP (0=bin-exch,1=long,2=mix) 64 swapping threshold 0 L1 in (0=transposed,1=no-transposed) form 0 U in (0=transposed,1=no-transposed) form 1 Equilibration (0=no,1=yes) 8 memory alignment in double (> 0) ##### This line (no. 32) is ignored (it serves as a separator). ###### 0 Number of additional problem sizes for PTRANS 1200 10000 30000 values of N 0 number of additional blocking sizes for PTRANS 40 9 8 13 13 20 16 32 64 values of NB From gus at ldeo.columbia.edu Tue Mar 16 10:00:07 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 16 Mar 2010 13:00:07 -0400 Subject: [Beowulf] HPL as a learning experience In-Reply-To: <201003161627.32012.carsten.aulbert@aei.mpg.de> References: <201003161627.32012.carsten.aulbert@aei.mpg.de> Message-ID: <4B9FB917.1070905@ldeo.columbia.edu> Hi Carsten The problem is most likely mpich 1.2.7. MPICH-1 is old and no longer maintained. It is based on the P4 lower level libraries, which don't seem to talk properly to current Linux kernels and/or to current Ethernet card drivers. There were several postings on this list, on the ROCKS Clusters list, on the MPICH list, etc, reporting errors very similar to yours: a p4 error followed by a segmentation fault. The MPICH developers recommend upgrading to MPICH2 because of these problems, besides performance, ease of use, etc. The easy fix is to use another MPI, say, OpenMPI or MPICH2. I would guess they are available as packages for Debian. However, you can build both very easily from source using just gcc/g++/gfortran. Get the source code and documentation, then read the README files, FAQ (OpenMPI), and Install Guide, User Guide (MPICH2) for details: OpenMPI http://www.open-mpi.org/ http://www.open-mpi.org/software/ompi/v1.4/ http://www.open-mpi.org/faq/ http://www.open-mpi.org/faq/?category=building MPICH2: http://www.mcs.anl.gov/research/projects/mpich2/ http://www.mcs.anl.gov/research/projects/mpich2/downloads/index.php?s=downloads http://www.mcs.anl.gov/research/projects/mpich2/documentation/index.php?s=docs I compiled and ran HPL here with both OpenMPI and MPICH2 (and MVAPICH2 as well), and it works just fine, over Ethernet and over Infiniband. I hope this helps. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Carsten Aulbert wrote: > Hi all, > > I wanted to run high performance linpack mostly for fun (and of course to > learn more about it and stress test a couple of machines). However, so far > I've had very mixed results. > > I downloaded the 2.0 version released in September 2008 and managed it to > compile with mpich 1.2.7 on Debian Lenny. The resulting xhpl file is > dynamically linked like this: > > linux-vdso.so.1 => (0x00007fffca372000) > libpthread.so.0 => /lib/libpthread.so.0 (0x00007fb47bca8000) > librt.so.1 => /lib/librt.so.1 (0x00007fb47ba9f000) > libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fb47b7c4000) > libm.so.6 => /lib/libm.so.6 (0x00007fb47b541000) > libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fb47b32a000) > libc.so.6 => /lib/libc.so.6 (0x00007fb47afd7000) > /lib64/ld-linux-x86-64.so.2 (0x00007fb47bec4000) > > Then I wanted to run a couple of tests on a single quad-CPU node (with 12 GB > physical RAM), I used > > http://www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html > > to generate files for a single and a dual core test [1] and [2]. > > Starting the single core run does not pose any problem: > /usr/bin/mpirun.mpich -np 1 -machinefile machines /nfs/xhpl > > where machines is just a simple file containing 4 times the name of this host. > So far so good. > ============================================================================ > T/V N NB P Q Time Gflops > ---------------------------------------------------------------------------- > WR11C2R4 14592 128 1 1 407.94 5.078e+00 > ---------------------------------------------------------------------------- > ||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0087653 ...... PASSED > ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0209927 ...... PASSED > ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0045327 ...... PASSED > ============================================================================ > > When starting the two core run, I receive the following error message after a > couple of seconds (after RSS hits the VIRT RAM value in top): > > /usr/bin/mpirun.mpich -np 2 -machinefile machines /nfs/xhpl > p0_20535: p4_error: interrupt SIGSEGV: 11 > rm_l_1_20540: (1.804688) net_send: could not write to fd=5, errno = 32 > > SIGSEGV with p4_error indicates a seg fault within hpl - that's as far as I've > come with google, but right now I have no idea how to proceed. I somehow doubt > that this venerable program is so buggy that I'd hit it on my first day ;) > > Any ideas where I might do something wrong? > > Cheers > > Carsten > > [1] > single core test > HPLinpack benchmark input file > Innovative Computing Laboratory, University of Tennessee > HPL.out output file name (if any) > 8 device out (6=stdout,7=stderr,file) > 1 # of problems sizes (N) > 14592 Ns > 1 # of NBs > 128 NBs > 0 PMAP process mapping (0=Row-,1=Column-major) > 1 # of process grids (P x Q) > 1 Ps > 1 Qs > 16.0 threshold > 1 # of panel fact > 2 PFACTs (0=left, 1=Crout, 2=Right) > 1 # of recursive stopping criterium > 4 NBMINs (>= 1) > 1 # of panels in recursion > 2 NDIVs > 1 # of recursive panel fact. > 1 RFACTs (0=left, 1=Crout, 2=Right) > 1 # of broadcast > 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) > 1 # of lookahead depth > 1 DEPTHs (>=0) > 2 SWAP (0=bin-exch,1=long,2=mix) > 64 swapping threshold > 0 L1 in (0=transposed,1=no-transposed) form > 0 U in (0=transposed,1=no-transposed) form > 1 Equilibration (0=no,1=yes) > 8 memory alignment in double (> 0) > ##### This line (no. 32) is ignored (it serves as a separator). ###### > 0 Number of additional problem sizes for PTRANS > 1200 10000 30000 values of N > 0 number of additional blocking sizes for PTRANS > 40 9 8 13 13 20 16 32 64 values of NB > > [2] > dual core setup > HPLinpack benchmark input file > Innovative Computing Laboratory, University of Tennessee > HPL.out output file name (if any) > 8 device out (6=stdout,7=stderr,file) > 1 # of problems sizes (N) > 14592 Ns > 1 # of NBs > 128 NBs > 0 PMAP process mapping (0=Row-,1=Column-major) > 1 # of process grids (P x Q) > 1 Ps > 2 Qs > 16.0 threshold > 1 # of panel fact > 2 PFACTs (0=left, 1=Crout, 2=Right) > 1 # of recursive stopping criterium > 4 NBMINs (>= 1) > 1 # of panels in recursion > 2 NDIVs > 1 # of recursive panel fact. > 1 RFACTs (0=left, 1=Crout, 2=Right) > 1 # of broadcast > 1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM) > 1 # of lookahead depth > 1 DEPTHs (>=0) > 2 SWAP (0=bin-exch,1=long,2=mix) > 64 swapping threshold > 0 L1 in (0=transposed,1=no-transposed) form > 0 U in (0=transposed,1=no-transposed) form > 1 Equilibration (0=no,1=yes) > 8 memory alignment in double (> 0) > ##### This line (no. 32) is ignored (it serves as a separator). ###### > 0 Number of additional problem sizes for PTRANS > 1200 10000 30000 values of N > 0 number of additional blocking sizes for PTRANS > 40 9 8 13 13 20 16 32 64 values of NB > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Tue Mar 16 10:38:07 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 16 Mar 2010 10:38:07 -0700 Subject: [Beowulf] 1000baseT NIC and PXE? Message-ID: Michael Di Domenico wrote: > > surprisingly enough there are still cards that don't come with PXE > built into the embedded rom. you'll have to check the specs on the > card you're interested in from the mfg website. Here are the docs for a typical inexpensive 1000baseT card: ftp://ftp10.dlink.com/pdfs/products/DGE-530T/DGE-530T_ds.pdf There is no empty socket on it for a boot rom, yet the specs say nothing about whether or not it will PXE boot. Let's turn this question around a bit. Can anyone suggest a specific inexpensive 1000baseT card which provides PXE and is otherwise reliable and fast? That is, one that you have personally used to boot a machine into your cluster. Similarly, the names of any models that should be avoided would also be useful information. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mdidomenico4 at gmail.com Tue Mar 16 14:52:30 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Tue, 16 Mar 2010 17:52:30 -0400 Subject: [Beowulf] 1000baseT NIC and PXE? In-Reply-To: References: Message-ID: On Tue, Mar 16, 2010 at 1:38 PM, David Mathog wrote: > Michael Di Domenico wrote: >> >> surprisingly enough there are still cards that don't come with PXE >> built into the embedded rom. ?you'll have to check the specs on the >> card you're interested in from the mfg website. > > Here are the docs for a typical inexpensive 1000baseT card: > > ftp://ftp10.dlink.com/pdfs/products/DGE-530T/DGE-530T_ds.pdf > > There is no empty socket on it for a boot rom, yet the specs say nothing > about whether or not it will PXE boot. > > Let's turn this question around a bit. ?Can anyone suggest a specific > inexpensive 1000baseT card which provides PXE and is otherwise reliable > and fast? ?That is, one that you have personally used to boot a machine > into your cluster. ?Similarly, the names of any models that should be > avoided would also be useful information. I can't speak for the dlink card, but i primarily choose Intel NIC's, something like this http://www.newegg.com/Product/Product.aspx?Item=N82E16833106121 Any of the intel NIC's that support your slot specification and support "Intel Boot Agent" should support PXE From greg.matthews at diamond.ac.uk Wed Mar 17 02:54:05 2010 From: greg.matthews at diamond.ac.uk (Gregory Matthews) Date: Wed, 17 Mar 2010 09:54:05 +0000 Subject: [Beowulf] 1000baseT NIC and PXE? In-Reply-To: References: Message-ID: <4BA0A6BD.90208@diamond.ac.uk> David Mathog wrote: > Let's turn this question around a bit. Can anyone suggest a specific > inexpensive 1000baseT card which provides PXE and is otherwise reliable > and fast? That is, one that you have personally used to boot a machine > into your cluster. Similarly, the names of any models that should be > avoided would also be useful information. we have had many problems with the Intel NICs that come embedded on the supermicro twin boards when paired with nortel switches. Setting the MTU to anything other than 1500 results in cards coming up in a strange state 50% of the time. Also, PXE has been a problem with certain versions of the driver not to mention the extra confusion over e1000 and e1000e. These report themselves as: Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet Controller (Copper) (rev 01) GREG > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Greg Matthews 01235 778658 Senior Computer Systems Administrator Diamond Light Source, Oxfordshire, UK From andrew.robbie at gmail.com Wed Mar 17 06:06:28 2010 From: andrew.robbie at gmail.com (Andrew Robbie (Gmail)) Date: Thu, 18 Mar 2010 00:06:28 +1100 Subject: [Beowulf] 1000baseT NIC and PXE? In-Reply-To: References: Message-ID: <92BA00FE-E346-4ABA-82B5-D96F4E8D4F55@gmail.com> On 16/03/2010, at 7:47 AM, David Mathog wrote: > Sorry if this is a silly question, but do any of the inexpensive > 1000baseT NICs support PXE boot? Can I suggest you consult the rom-o-matic database maintained by the etherboot/gPXE project? The raw list is at: http://rom-o-matic.net/gpxe/gpxe-git/gpxe.git/src/bin/NIC For example, it shows that the DGE-530T is supported with the skge driver. It can be problematic mapping brand names/part numbers to chipsets though. If you can buy one to test that helps... Do you mean PCI, PCI-64, PCI-X or PCIe cards? Standard PCI is really too slow for GigE. Can you say which your motherboard supports? It is a good idea to make sure the NIC can work at 3.3v or 5v for flexability. Regards, Andrew From dnlombar at ichips.intel.com Wed Mar 17 09:24:25 2010 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Wed, 17 Mar 2010 09:24:25 -0700 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <33623.192.168.1.213.1268677496.squirrel@mail.eadline.org> References: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> <33623.192.168.1.213.1268677496.squirrel@mail.eadline.org> Message-ID: <20100317162425.GA19667@nlxcldnl2.cl.intel.com> On Mon, Mar 15, 2010 at 11:24:56AM -0700, Douglas Eadline wrote: > I have placed a copy of Richard's table on ClusterMonkey > in case you want an html view. > > http://www.clustermonkey.net//content/view/275/33/ > IBTA shows 20Gb/s for EDR: -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From Shainer at mellanox.com Wed Mar 17 10:37:54 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Wed, 17 Mar 2010 10:37:54 -0700 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F01E155B9@mtiexch01.mti.com> The EDR speed will be 25.78Gb/s per lane or 100Gb/s data rate for 4x port. It was not made public on the IBTA web site, probably will be updated in the comming days. Gilad ----- Original Message ----- From: beowulf-bounces at beowulf.org To: Douglas Eadline Cc: richard.walsh at comcast.net ; beowulf at beowulf.org Sent: Wed Mar 17 09:24:25 2010 Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)? On Mon, Mar 15, 2010 at 11:24:56AM -0700, Douglas Eadline wrote: > I have placed a copy of Richard's table on ClusterMonkey > in case you want an html view. > > http://www.clustermonkey.net//content/view/275/33/ > IBTA shows 20Gb/s for EDR: -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Wed Mar 17 12:55:42 2010 From: mathog at caltech.edu (David Mathog) Date: Wed, 17 Mar 2010 12:55:42 -0700 Subject: [Beowulf] Re: 1000baseT NIC and PXE? Message-ID: Andrew Robbie wrote: > Can I suggest you consult the rom-o-matic database maintained by the > etherboot/gPXE project? The raw list is at: > http://rom-o-matic.net/gpxe/gpxe-git/gpxe.git/src/bin/NIC > > For example, it shows that the DGE-530T is supported with the skge > driver. I did visit that site, but the documentation was not exactly "make PXE work when it didn't before" newbie friendly. It looked very helpful if one already knows what it means! Yesterday I called a bunch of tech support lines for the various cheap NIC manufacturers and was that ever a miserable experience. D-link and netsys both sent me to Indian phone support hell, in one instance so bad the questioner got into a loop on her script and started asking the same questions again, at which point I threw an exception and hung up. For those cards which had an empty socket to hold an EEPROM, nobody could (or would) tell me what type of chip to put in there, or where to get the software to load on it. The most honest answer of the day was (not an exact quote) "we just make the things, ask Realtek if it can do that". (Still waiting for Realtek to reply.) Gus Correa sent me the simplest solution - PXE boot using the existing 100baseT on the mobo and use the new gigabit card once the system comes up. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From carsten.aulbert at aei.mpg.de Wed Mar 17 13:07:39 2010 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Wed, 17 Mar 2010 21:07:39 +0100 Subject: [Beowulf] HPL as a learning experience In-Reply-To: <4B9FB917.1070905@ldeo.columbia.edu> References: <201003161627.32012.carsten.aulbert@aei.mpg.de> <4B9FB917.1070905@ldeo.columbia.edu> Message-ID: <201003172107.47989.carsten.aulbert@aei.mpg.de> Hi On Tuesday 16 March 2010 18:00:07 Gus Correa wrote: > The problem is most likely mpich 1.2.7. > MPICH-1 is old and no longer maintained. > It is based on the P4 lower level libraries, which don't > seem to talk properly to current Linux kernels and/or > to current Ethernet card drivers. [...] > The easy fix is to use another MPI, say, OpenMPI or MPICH2. > I would guess they are available as packages for Debian. > You were right, switching over to openmpi solved this issue at once. Sorry for not trying that before causing noise here :) Now, the hard part might begin to narrow down the parameter space... Cheers Carsten From landman at scalableinformatics.com Wed Mar 17 15:54:29 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 17 Mar 2010 18:54:29 -0400 Subject: [Beowulf] Re: 1000baseT NIC and PXE? In-Reply-To: References: Message-ID: <0C96F3A2-F1D4-4281-A7AC-BFB9B0765418@scalableinformatics.com> Why not use a USB stick with gpxe.USB? This would provide the greatest flexibility. Please pardon brevity and typos ... Sent from my iPhone On Mar 17, 2010, at 3:55 PM, "David Mathog" wrote: > Andrew Robbie wrote: >> Can I suggest you consult the rom-o-matic database maintained by the >> etherboot/gPXE project? The raw list is at: >> http://rom-o-matic.net/gpxe/gpxe-git/gpxe.git/src/bin/NIC >> >> For example, it shows that the DGE-530T is supported with the skge >> driver. > > I did visit that site, but the documentation was not exactly "make PXE > work when it didn't before" newbie friendly. It looked very helpful > if > one already knows what it means! > > Yesterday I called a bunch of tech support lines for the various cheap > NIC manufacturers and was that ever a miserable experience. D-link > and > netsys both sent me to Indian phone support hell, in one instance so > bad > the questioner got into a loop on her script and started asking the > same > questions again, at which point I threw an exception and hung up. For > those cards which had an empty socket to hold an EEPROM, nobody could > (or would) tell me what type of chip to put in there, or where to get > the software to load on it. The most honest answer of the day was > (not > an exact quote) "we just make the things, ask Realtek if it can do > that". (Still waiting for Realtek to reply.) > > Gus Correa sent me the simplest solution - PXE boot using the existing > 100baseT on the mobo and use the new gigabit card once the system > comes up. > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Wed Mar 17 16:11:26 2010 From: mathog at caltech.edu (David Mathog) Date: Wed, 17 Mar 2010 16:11:26 -0700 Subject: [Beowulf] Re: 1000baseT NIC and PXE? Message-ID: Joe Landman wrote: > Why not use a USB stick with gpxe.USB? > > This would provide the greatest flexibility. Not sure if these machines will boot from a USB key, never tried it. They are old enough that they might not. If it works, then yes, this would be a good option. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From mathog at caltech.edu Thu Mar 18 14:14:04 2010 From: mathog at caltech.edu (David Mathog) Date: Thu, 18 Mar 2010 14:14:04 -0700 Subject: [Beowulf] Re: 1000baseT NIC and PXE? Message-ID: Joe Landman wrote: > Why not use a USB stick with gpxe.USB? Gave gPXE.USB a try using the motherboard's 100baseT NIC. gPXE started after the right BIOS settings were entered and the USB showed up in the boot list. Unfortunately it used MAC "ad:ad:ad:ad:00:00" instead of the actual hardware MAC "00:e0:81:22:cc:3d", so DHCP wasn't set up to send it anything other than an address. ^B to get into the gPXE command line, but it wasn't accepting or echoing keyboard input. (Possibly terminal output was going out the serial port, anyway, it seemed to be locked up at the command line.) How does one make gPXE use the MAC it finds on the NIC instead of ad:ad:ad:ad:00:00? The nodes on this system are not interchangeable, node1 has data that node2 doesn't, and so forth. The cluster is soon to become heterogeneous. So the master does need to know who and what it is responding to. If there are multiple nics how is gPXE configured to use a particular one? (If they have different hardware I guess just include that one driver, but what if there are two the same?) Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From david.ritch at gmail.com Thu Mar 18 05:57:39 2010 From: david.ritch at gmail.com (David B. Ritch) Date: Thu, 18 Mar 2010 08:57:39 -0400 Subject: [Beowulf] PXE Booting and Interface Bonding Message-ID: <4BA22343.8030802@gmail.com> What is the best practice for the use of multiple NICs on cluster nodes? I've found that when I enable Etherchannel bonding in out network equipment, PXE booting does not work. It breaks the initial DHCP discover request, presumably because the response may not return to the same NIC. However, bonding the interfaces is a clear win for node availability and for performance. The best solution that I have is to turn off port channels in our network equipment, and use Linux kernel bonding, in balance-alb mode. This provides adaptive load balancing by ARP manipulation. Thanks in advance! David From ebiederm at xmission.com Sat Mar 20 12:20:49 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 20 Mar 2010 12:20:49 -0700 Subject: [Beowulf] PXE Booting and Interface Bonding In-Reply-To: <4BA22343.8030802@gmail.com> (David B. Ritch's message of "Thu\, 18 Mar 2010 08\:57\:39 -0400") References: <4BA22343.8030802@gmail.com> Message-ID: "David B. Ritch" writes: > What is the best practice for the use of multiple NICs on cluster nodes? > > I've found that when I enable Etherchannel bonding in out network > equipment, PXE booting does not work. It breaks the initial DHCP > discover request, presumably because the response may not return to the > same NIC. However, bonding the interfaces is a clear win for node > availability and for performance. > > The best solution that I have is to turn off port channels in our > network equipment, and use Linux kernel bonding, in balance-alb mode. > This provides adaptive load balancing by ARP manipulation. A few years ago I added 802.3ad LAG support to etherboot now gpxe. When used it negotiates tells the LAG that there is only one member. You might want to try that. Eric From ebiederm at xmission.com Sat Mar 20 16:06:08 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Sat, 20 Mar 2010 16:06:08 -0700 Subject: [Beowulf] PXE Booting and Interface Bonding In-Reply-To: <4BA53056.6010507@gmail.com> (David B. Ritch's message of "Sat\, 20 Mar 2010 16\:30\:14 -0400") References: <4BA22343.8030802@gmail.com> <4BA53056.6010507@gmail.com> Message-ID: "David B. Ritch" writes: > Eric, > > Thank you - that sounds like a good idea. However, I'm not sure that > we'll have an opportunity to replace the bootloader on our > motherboards. I'd love to see that become standard! > > Is gPXE widely used? How else would one approach this? Yes. At least I have seen it as the boot firmware on several 10Gig nics. You can also get gPXE to boot off of just about anything, so you should be able to at least try it out. Eric From gdjacobs at gmail.com Sat Mar 20 23:57:51 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Sun, 21 Mar 2010 01:57:51 -0500 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B7AD159.50000@scalableinformatics.com> Message-ID: <4BA5C36F.9010609@gmail.com> Lux, Jim (337C) wrote: > > James Lux, P.E. Task Manager, SOMD Software Defined Radios Flight > Communications Systems Section Jet Propulsion Laboratory 4800 Oak > Grove Drive, Mail Stop 161-213 Pasadena, CA, 91109 +1(818)354-2075 > phone +1(818)393-6875 fax > >> -----Original Message----- From: beowulf-bounces at beowulf.org >> [mailto:beowulf-bounces at beowulf.org] On Behalf Of Doug O'Neal Sent: >> Tuesday, February 16, 2010 10:21 AM To: beowulf at beowulf.org >> Subject: [Beowulf] Re: Third-party drives not permitted on new Dell >> servers? >> >> On 02/16/2010 12:52 PM, Lux, Jim (337C) wrote: >>> >>> On 2/16/10 9:09 AM, "Joe Landman" >>> wrote: >>>> 5X markup? We must be doing something wrong :/ >>>> >>> >>> Depends on what the price includes. I could easily see a >>> commodity drive in a case lot being dropped on the loading dock >>> at, say, $100 each, and the drive with installation, system >>> integrator testing, downstream support, etc. being $500. Doesn't >>> take many hours on the phone tracking down an idiosyncracy or >>> setup to cost $500 in labor. >> But when you're installing anywhere from eight to forty-eight >> drives in a single system the required hours to make up that >> $400/drive overhead does get larger. And if you spread the system >> integrator testing over eight drives per unit and hundreds to >> thousands of units the cost per drive shouldn't be measured in >> hundreds of dollars. >> > > True, IFF the costing strategy is based on that sort of approach. > Various companies can and do price the NRE and support tail cost in a > variety of ways. They might have a "notional" system size and base > the pricing model on that: Say they, through research, find that > most customers are buying, say, 32 systems at a crack. Now the > support tail (which is basically "per system") is spread across only > 32 drives, not thousands. If you happen to buy 64 systems, then you > basically are paying twice. Most companies don't have infinite > granularity in this sort of thing, and try to pick a few breakpoints > that make sense. But in this case, they're selling not 32 controllers, or whatever. They're selling thousands or tens of thousands of controllers and tens or hundreds of thousands of drives across the entire product line. Do they qualify drives per system, or across the line (perhaps per controller model)? > (NRE = non recurring engineering) As far as the NRE goes, say they > get a batch of a dozen drives each of half a dozen kinds. They have > to set up half a dozen test systems (either in parallel or > sequentially), run the tests on all of them, and wind up with maybe 2 > or 3 leading candidates that they decide to list on their "approved > disk" list. The cost of testing the disks that didn't make the cut > has to be added to the cost of the disks that did. > > There's a lot that goes into pricing that isn't obvious at first > glance, or even second glance, especially if you're looking at a > single instance (your own purchase) and trying to work backwards from > there. There are weird anomalies that crop up in supposedly > commodity items from things like fuel prices (e.g. you happened to > buy that container load of disks when fuel prices were high, so > shipping cost more). A couple years ago, there were huge fluctuations > in the price of copper, so there would be 2:1 differences in the > retail cost of copper wire and tubing at the local Home Depot and > Lowes, basically depending on when they happened to have bought the > stuff wholesale. (this is the kind of thing that arbitrageurs look > for, of course) Logically, one would not see consistency in the markup in such a case. Nor would the tier one vendors be consistently marked up at similar amounts. > Some of it is "paying for convenience", too. Rather than do all the > testing yourself, or writing a detailed requirements and procurement > document for a third party, both of which cost you some non-zero > amount of time and money, you just pay the increased price to a > vendor who's done it for you. It's like eating sausage. You can buy > already made sausage, and the sausage maker has done the > experimenting with seasoning and process controls to come out with > something that people like. Or, you can spend the time to make it > yourself, potentially saving some money and getting a more customized > sausage taste, BUT, you're most likely going to have some > less-than-ideal sausage in the process. They should be willing and able to convince people of the value of each and every product they sell, and that includes justifying the non-interoperability of their disk controllers with 3rd party HDDs. > The more computers or sausage you're consuming, the more likely it is > that you could do better with a customized approach, but, even > there, you may be faced with resource limits (e.g. you could spend > your time getting a better deal on the disks or you could spend your > time doing research with the disks. Ultimately, the research MUST > get done, so you have to trade off how much you're willing to spend.) > Jim and Joe both are likely to have more of an idea of the realities going on inside Dell than I. Michael Will likely does as well as some others on list. However, it's up to Dell to justify their decisions to those on list who have concerns of this nature either now or when asked to in the bidding process. Just like how Joe was able to explain one of the subtle, relevant problems in system integration, all in one email! It's as simple as that. They should be able to justify their position, without sounding like they're high on Prozac. Whenever there's a discussion on vendor markup, I always think on the Audiophile scene. In particular: http://www.usa.denon.com/ProductDetails/3429.asp http://www.positive-feedback.com/Issue32/anjou.htm I think I can speak on behalf of everyone that we do not want computer hardware vendors to degrade to this level. -- Geoffrey D. Jacobs From hearnsj at googlemail.com Sun Mar 21 01:52:59 2010 From: hearnsj at googlemail.com (John Hearns) Date: Sun, 21 Mar 2010 08:52:59 +0000 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> <4B7A032C.2080207@scalableinformatics.com> Message-ID: <9f8092cc1003210152p2425a946s129e040aedc22e2b@mail.gmail.com> On 16 February 2010 07:08, Mark Hahn wrote: > > I think the real paradigm shift is that disks have become a consumable > which you want to be able to replace in 1-2 product generations (2-3 years). > along with this, disks just aren't that important, individually - even > something _huge_ like seagate's firmware problem, for instance, only drove > up random failures, no? You have just hit a very big nail on the head. Let's think about current RAID arrays - you have to replace a drive with the same type - take Fibrechannel arrays for instance - they have different drive speeds, and sizes of course. ot FC, but once when doing support I replaced a SATA drive by one of the same size. But not the same manufacturer - and it had just a couple of sectors less, so was not accepted in as a spare drive. We could go on - but the point being that once you select a storage array you are bound into that type of disk. I'm now still getting speedy and good service on a FC array which is rather elderly - replacment drives have been on the shelf for years. Anyway, Mark prompts me to think back to the IBM Storage Tank concept - drive goes bad and it is popped out of a hatch like a vending machine. Remember, this is the Beowulf list and Beowulf is about applying COTS technology. We're in the Web 2.0 age, with Google, Microsoft et. al. deploying containerised data centres - and somehow I don't reckon they keep all their data on some huge EMC fibrechannel array with a dual FC fabric and a live mirror to another lockstep duplicate array in another building, via dark fibre, with endless discussions on going to 8Gbit FC (yadda yadda, you get the point). As Mark says - storage is storage. It should be bought by the pallet load, and deployed like Lego bricks. From james.p.lux at jpl.nasa.gov Sun Mar 21 08:20:51 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Sun, 21 Mar 2010 08:20:51 -0700 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: <4BA5C36F.9010609@gmail.com> Message-ID: On 3/20/10 11:57 PM, "Geoff Jacobs" wrote: > > Whenever there's a discussion on vendor markup, I always think on the > Audiophile scene. In particular: > http://www.usa.denon.com/ProductDetails/3429.asp > > ?Additionally, signal directional markings are provided for optimum signal transfer.? You mean you don?t carefully align your ethernet cables so that they are oriented in the direction of predominant data flow? I'm sure all the hot stuff cluster folks take a look at the data flow among nodes before each job and have the techs go reorient the cables. Or provide dual cables and paths, at the very least. > > > http://www.positive-feedback.com/Issue32/anjou.htm > > I think I can speak on behalf of everyone that we do not want computer > hardware vendors to degrade to this level. Aughhh! And I just spent the last year in my garage perfecting my Beophile (tm pending) interconnect cables. I have them carefully aligned to magnetic north to cure, waiting for the residual stresses to decay. > > -- > Geoffrey D. Jacobs > From james.p.lux at jpl.nasa.gov Sun Mar 21 08:41:55 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Sun, 21 Mar 2010 08:41:55 -0700 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <9f8092cc1003210152p2425a946s129e040aedc22e2b@mail.gmail.com> Message-ID: On 3/21/10 1:52 AM, "John Hearns" wrote: > On 16 February 2010 07:08, Mark Hahn wrote: > >> >> I think the real paradigm shift is that disks have become a consumable >> which you want to be able to replace in 1-2 product generations (2-3 years). >> along with this, disks just aren't that important, individually - even >> something _huge_ like seagate's firmware problem, for instance, only drove >> up random failures, no? > > > > You have just hit a very big nail on the head. > > Remember, this is the Beowulf list and Beowulf is about applying COTS > technology. To a certain extent Beowulfery has strayed from its roots. Originally, it was:Hey, I can do supercomputing that's competitive (or almost as good) as the big iron with cheap consumer gear. But now it has succeeded to the point that it's the dominant way of supercomputing, and the emphasis is on optimizing performance, almost to the point of CDC carefully trimming the wire lengths on the fast big vector machines. Clusters these days leverage commodity, but the nodes tend to be more specialized, compared the run of the mill desktops that we started with. Would you see anything like the StoneSouperComputer today? Talk about a heterogenous cluster. We're in the Web 2.0 age, with Google, Microsoft et. al. > deploying containerised data centres - and somehow I don't reckon they > keep all their data on some huge EMC fibrechannel array with a dual FC > fabric and a live mirror to another lockstep duplicate array in > another building, via dark fibre, with endless discussions on going to > 8Gbit FC (yadda yadda, you get the point). > > As Mark says - storage is storage. It should be bought by the pallet > load, and deployed like Lego bricks. Yes. And the true direction of classic Beowulfery should be to deal with the non-ideal/heterogenous nature of this approach. Jim From david.ritch.lists at gmail.com Sat Mar 20 13:30:14 2010 From: david.ritch.lists at gmail.com (David B. Ritch) Date: Sat, 20 Mar 2010 16:30:14 -0400 Subject: [Beowulf] PXE Booting and Interface Bonding In-Reply-To: References: <4BA22343.8030802@gmail.com> Message-ID: <4BA53056.6010507@gmail.com> Eric, Thank you - that sounds like a good idea. However, I'm not sure that we'll have an opportunity to replace the bootloader on our motherboards. I'd love to see that become standard! Is gPXE widely used? How else would one approach this? Thanks! dbr On 3/20/2010 3:20 PM, Eric W. Biederman wrote: > "David B. Ritch" writes: > > >> What is the best practice for the use of multiple NICs on cluster nodes? >> >> I've found that when I enable Etherchannel bonding in out network >> equipment, PXE booting does not work. It breaks the initial DHCP >> discover request, presumably because the response may not return to the >> same NIC. However, bonding the interfaces is a clear win for node >> availability and for performance. >> >> The best solution that I have is to turn off port channels in our >> network equipment, and use Linux kernel bonding, in balance-alb mode. >> This provides adaptive load balancing by ARP manipulation. >> > A few years ago I added 802.3ad LAG support to etherboot now gpxe. > > When used it negotiates tells the LAG that there is only one member. > You might want to try that. > > Eric > > > From thakur at mcs.anl.gov Wed Mar 24 06:46:06 2010 From: thakur at mcs.anl.gov (Rajeev Thakur) Date: Wed, 24 Mar 2010 08:46:06 -0500 Subject: [Beowulf] [hpc-announce] FW: EuroMPI 2010 Call for Papers (extended deadline April 19th) Message-ID: <5874958B3C0140A6BEBA541CDC14ED06@thakurlaptop> -----Original Message----- From: Rolf Rabenseifner Sent: Monday, March 22, 2010 11:05 AM To: thakur at mcs.anl.gov Subject: EuroMPI 2010 Call for Papers (extended deadline) Please excuse the cross-posting. Due to many requests, we have postponed the submission deadline by 2 weeks. ------------------------------------------------------------------------ -------- CALL FOR PAPERS -- Extension of Deadlines 17th European MPI Users' Group Meeting (EuroMPI 2010) http://www.eurompi2010.org Stuttgart, Germany, September 12th-15th 2010 Extended submission deadline: April 19th 2010 ------------------------------------------------------------------------ -------- MPI (Message Passing Interface) has evolved into the standard interfaces for high-performance parallel programming in the message-passing paradigm. EuroMPI is the most prominent meeting dedicated to the latest developments of MPI, its use, including support tools, and implementation, and to applications using these interfaces. The 17th European MPI Users' Group Meeting will be a forum for users and developers of MPI and other message-passing programming environments. Through the presentation of contributed papers, poster presentations and invited talks, attendees will have the opportunity to share ideas and experiences to contribute to the improvement and furthering of message-passing and related parallel programming paradigms. Topics of interest for the meeting include, but are not limited to: - MPI implementation issues and improvements - Latest extensions to MPI - MPI for high-performance computing, clusters and grid environments - New message-passing and hybrid parallel programming paradigms - Interaction between message-passing software and hardware - Fault tolerance in message-passing programs - Performance evaluation of MPI applications - Tools and environments for MPI - Algorithms using the message-passing paradigm - Applications in science and engineering based on message-passing Submissions on applications demonstrating both the potential and shortcomings of message passing programming and specifically MPI are particularly welcome. SUBMISSION INFORMATION Contributors are invited to submit a full paper as a PDF document not exceeding 8 pages in English (2 pages for poster abstracts). The title page should contain an abstract of at most 100 words and five specific keywords. The conference proceedings consisting of abstracts of invited talks, full papers, and two page abstracts for the posters will be published by Springer in the LNCS series. Papers need to be formatted according to the Springer LNCS guidelines. The usage of LaTeX for preparation of the contribution as well as the submission in camera ready format is strongly recommended. Style files can be found at http://www.springer.de/comp/lncs/authors.html . Papers are to be submitted electronically via the online submission system at http://www.easychair.org/conferences/?conf=eurompi2010 Submissions to the ParSim2010 session are handled and reviewed by the respective session chairs. For more information please refer to the ParSim website http://www.lrr.in.tum.de/~trinitic/parsim10/ . All accepted submissions are expected to be presented at the conference by one of the authors, which requires registration for the conference. IMPORTANT DATES EuroMPI Conference September 12-15th, 2010 Submission of full papers April 19th, 2010 (extended deadline) Notification of authors May 17th, 2010 Camera ready papers June 10th, 2010 As in the previous years, the special session 'ParSim' will focus on numerical simulation for parallel engineering environments. EuroMPI 2010 will also hold the 'Outstanding Papers' session, where the best papers selected by the program committee will be presented. For further Information please see the conference website: http://www.eurompi2010.org General Chair: Jack Dongarra, University of Tennessee, USA Program Chair: Michael Resch, HLRS, University of Stuttgart, Germany Program Co-Chairs: Rainer Keller, ORNL, USA and HLRS, Germany Edgar Gabriel, University of Houston, USA Program Committee: Richard Barrett, Oak Ridge National Laboratory, USA Gil Bloch, Mellanox, Israel George Bosilca, University of Tennessee, USA Ron Brightwell, Sandia National Laboratories, New Mexico Franck Cappello, University of Illinois, USA / INRIA, France Barbara Chapman, University of Houston, USA Yiannis Cotronis, University of Athens Erik D.'Hollander, Ghent University, Belgium Jean-Christophe Desplat, ICHEC, Ireland Frederic Desprez, INRIA, France Jack Dongarra, University of Tennessee, USA Edgar Gabriel, University of Houston, USA Javier Garcia-Blas, Universidad Carlos III de Madrid, Spain Al Geist, Oak Ridge National Laboratory, USA Michael Gerndt, Technical University Muenchen, Germany Ganesh Gopalakrishnan, University of Utah, USA Sergei Gorlatch, University of Muenster, Germany Andrzej Goscinski, Deakin University, Australia Richard L. Graham, Oak Ridge National Laboratory, USA William Gropp, University of Illinois Urbana-Champaign, USA Thomas Herault, INRIA/LRI, France Torsten Hoefler, Indiana University, USA Josh Hursey, Indiana University, USA Yutaka Ishikawa, University of Tokyo, Japan Tahar Kechadi, University College Dublin, Ireland Rainer Keller, Oak Ridge National Laboratory, USA Stefan Lankes, RWTH Aachen, Germany Jesper Larsson-Traeff, University of Vienna, Austria Alexey Lastovetsky, University College Dublin, Ireland Andrew Lumsdaine, Indiana University, USA Ewing Rusty Lusk, Argonne National Laboratory, USA Thomas Margalef, Universitat Autonoma de Barcelona, Spain Jean-Francois Mehaut, IMAG, France Bernd Mohr, Forschungszentrum Juelich, Germany Raymond Namyst, University of Bordeaux, France Rolf Rabenseifner, HLRS, University of Stuttgart, Germany Michael Resch, HLRS, University of Stuttgart, Germany Casiano Rodriguez-Leon, Universidad de la Laguna, Spain Robert Ross, Argonne National Laboratory, USA Martin Schulz, Lawrence Livermore National Laboratory, USA Stephen F. Siegel, University of Delaware, USA Jeffrey Squyres, Cisco, Inc., USA Bronis R. de Supinski, Lawrence Livermore National Laboratory, USA Rajeev Thakur, Argonne National Laboratory, USA Carsten Trinitis, Technische Universitaet Muenchen, Germany Jan Westerholm, Abo Akademi University, Finland Roland Wismueller, Universitaet Siegen, Germany Joachim Worringen, International Algorithmic Trading GmbH, Germany From hugo.hernandez at nih.gov Fri Mar 26 09:32:05 2010 From: hugo.hernandez at nih.gov (Hernandez, Hugo (NIH/NIAID) [C]) Date: Fri, 26 Mar 2010 12:32:05 -0400 Subject: [Beowulf] Problems installing HPL 2.0 Message-ID: Hello there, Can somebody help me on a problem I am experiencing when trying to install HPL 2.0 in our system? The error comes as the HPL_dlamch.c isn?t working because it can?t find the hpl.h include file. The file already exists in $(TOPdir)/include/hpl.h. Do I am missing something? I have added the directory /myApps/hpl-2.0/include into my LD_LIBRARY_PATH without any result. Could you please let me know how to work on this problem? All help will be really appreciated! Many thanks, -Hugo My system configuration: RHEL 5.4 Linux myMachine.com 2.6.18-164.el5 #1 SMP Tue Aug 18 15:51:48 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux openmpi-1.3.2-2.el5 /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc /usr/lib64/openmpi/1.3.2-gcc/bin/mpif90 Linear Algebra library: GotoBLAS2 Here is my make.arch file: SHELL = /bin/sh # CD = cd CP = cp LN_S = ln -s MKDIR = mkdir RM = /bin/rm -f TOUCH = touch # ARCH = Linux_x86_64 # - HPL Directory Structure / HPL library ------------------------------ TOPdir = /niaidAdmin/apps/hpl-2.0 INCdir = $(TOPdir)/include BINdir = $(TOPdir)/bin/$(ARCH) LIBdir = $(TOPdir)/lib/$(ARCH) # HPLlib = $(LIBdir)/libhpl.a # - Message Passing library (MPI) -------------------------------------- MPdir = /usr MPinc =-I$(MPdir)/include/openmpi MPlib = $(MPdir)/lib64/openmpi/libmpi.so # - Linear Algebra library (BLAS or VSIPL) ----------------------------- LAdir = /niaidAdmin/apps/GotoBLAS2 LAinc =-I$(MPdir)/include LAlib = $(LAdir)/libblas.so.3 $(LAdir)/atlas/libblas.so.3 # F2CDEFS = # HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib) # HPL_OPTS = -DHPL_CALL_CBLAS # ---------------------------------------------------------------------- HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES) # - Compilers / linkers - Optimization flags --------------------------- CC = /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc CCNOOPT = CCFLAGS = $(HPL_DEFS) -pipe -O3 -funroll-loops # LINKER = /usr/lib64/openmpi/1.3.2-gcc/bin/mpif90 LINKFLAGS = $(CCFLAGS) # ARCHIVER = ar ARFLAGS = r RANLIB = echo And here is the error message: [root at test hpl-2.0]# make arch=niaid make -f Make.top startup_dir arch=niaid make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0' mkdir include/niaid mkdir: cannot create directory `include/niaid': File exists make[1]: [startup_dir] Error 1 (ignored) mkdir lib mkdir: cannot create directory `lib': File exists make[1]: [startup_dir] Error 1 (ignored) mkdir lib/niaid mkdir: cannot create directory `lib/niaid': File exists make[1]: [startup_dir] Error 1 (ignored) mkdir bin mkdir: cannot create directory `bin': File exists make[1]: [startup_dir] Error 1 (ignored) mkdir bin/niaid mkdir: cannot create directory `bin/niaid': File exists make[1]: [startup_dir] Error 1 (ignored) make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top startup_src arch=niaid make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=src/auxil arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/auxil ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd src/auxil/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=src/blas arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/blas ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd src/blas/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=src/comm arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/comm ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd src/comm/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=src/grid arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/grid ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd src/grid/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=src/panel arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/panel ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd src/panel/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=src/pauxil arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/pauxil ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd src/pauxil/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=src/pfact arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/pfact ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd src/pfact/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=src/pgesv arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/pgesv ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd src/pgesv/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top startup_tst arch=niaid make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=testing/matgen arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd testing/matgen ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd testing/matgen/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=testing/timer arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd testing/timer ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd testing/timer/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=testing/pmatgen arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd testing/pmatgen ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd testing/pmatgen/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=testing/ptimer arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd testing/ptimer ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd testing/ptimer/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top leaf le=testing/ptest arch=niaid make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd testing/ptest ; mkdir niaid ) mkdir: cannot create directory `niaid': File exists make[2]: [leaf] Error 1 (ignored) ( cd testing/ptest/niaid ; \ ln -s /niaidAdmin/apps/hpl-2.0/Make.niaid Make.inc ) ln: creating symbolic link `Make.inc' to `/niaidAdmin/apps/hpl-2.0/Make.niaid': File exists make[2]: [leaf] Error 1 (ignored) make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top refresh_src arch=niaid make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0' cp makes/Make.auxil src/auxil/niaid/Makefile cp makes/Make.blas src/blas/niaid/Makefile cp makes/Make.comm src/comm/niaid/Makefile cp makes/Make.grid src/grid/niaid/Makefile cp makes/Make.panel src/panel/niaid/Makefile cp makes/Make.pauxil src/pauxil/niaid/Makefile cp makes/Make.pfact src/pfact/niaid/Makefile cp makes/Make.pgesv src/pgesv/niaid/Makefile make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top refresh_tst arch=niaid make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0' cp makes/Make.matgen testing/matgen/niaid/Makefile cp makes/Make.timer testing/timer/niaid/Makefile cp makes/Make.pmatgen testing/pmatgen/niaid/Makefile cp makes/Make.ptimer testing/ptimer/niaid/Makefile cp makes/Make.ptest testing/ptest/niaid/Makefile make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top refresh_src arch=niaid make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0' cp makes/Make.auxil src/auxil/niaid/Makefile cp makes/Make.blas src/blas/niaid/Makefile cp makes/Make.comm src/comm/niaid/Makefile cp makes/Make.grid src/grid/niaid/Makefile cp makes/Make.panel src/panel/niaid/Makefile cp makes/Make.pauxil src/pauxil/niaid/Makefile cp makes/Make.pfact src/pfact/niaid/Makefile cp makes/Make.pgesv src/pgesv/niaid/Makefile make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top refresh_tst arch=niaid make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0' cp makes/Make.matgen testing/matgen/niaid/Makefile cp makes/Make.timer testing/timer/niaid/Makefile cp makes/Make.pmatgen testing/pmatgen/niaid/Makefile cp makes/Make.ptimer testing/ptimer/niaid/Makefile cp makes/Make.ptest testing/ptest/niaid/Makefile make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make -f Make.top build_src arch=niaid make[1]: Entering directory `/niaidAdmin/apps/hpl-2.0' ( cd src/auxil/niaid; make ) make[2]: Entering directory `/niaidAdmin/apps/hpl-2.0/src/auxil/niaid' /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlacpy.o -c -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops ../HPL_dlacpy.c /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlatcpy.o -c -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops ../HPL_dlatcpy.c /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_fprintf.o -c -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops ../HPL_fprintf.c /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_warn.o -c -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops ../HPL_warn.c /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_abort.o -c -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops ../HPL_abort.c /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlaprnt.o -c -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops ../HPL_dlaprnt.c /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlange.o -c -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops ../HPL_dlange.c /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlamch.o -c ../HPL_dlamch.c ../HPL_dlamch.c:50:17: error: hpl.h: No such file or directory ../HPL_dlamch.c:57: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS? ../HPL_dlamch.c:60: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS? ../HPL_dlamch.c:64: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS? ../HPL_dlamch.c:67: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS? ../HPL_dlamch.c:70: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS? ../HPL_dlamch.c:74: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?STDC_ARGS? ../HPL_dlamch.c: In function ?HPL_dlamch?: ../HPL_dlamch.c:85: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?CMACH? ../HPL_dlamch.c:164: error: ?HPL_rone? undeclared (first use in this function) ../HPL_dlamch.c:164: error: (Each undeclared identifier is reported only once ../HPL_dlamch.c:164: error: for each function it appears in.) ../HPL_dlamch.c:164: error: ?HPL_rtwo? undeclared (first use in this function) ../HPL_dlamch.c:166: error: ?HPL_rzero? undeclared (first use in this function) ../HPL_dlamch.c:176: error: ?HPL_MACH_EPS? undeclared (first use in this function) ../HPL_dlamch.c:177: error: ?HPL_MACH_SFMIN? undeclared (first use in this function) ../HPL_dlamch.c:178: error: ?HPL_MACH_BASE? undeclared (first use in this function) ../HPL_dlamch.c:179: error: ?HPL_MACH_PREC? undeclared (first use in this function) ../HPL_dlamch.c:180: error: ?HPL_MACH_MLEN? undeclared (first use in this function) ../HPL_dlamch.c:181: error: ?HPL_MACH_RND? undeclared (first use in this function) ../HPL_dlamch.c:182: error: ?HPL_MACH_EMIN? undeclared (first use in this function) ../HPL_dlamch.c:183: error: ?HPL_MACH_RMIN? undeclared (first use in this function) ../HPL_dlamch.c:184: error: ?HPL_MACH_EMAX? undeclared (first use in this function) ../HPL_dlamch.c:185: error: ?HPL_MACH_RMAX? undeclared (first use in this function) ../HPL_dlamch.c: In function ?HPL_dlamc1?: ../HPL_dlamch.c:262: error: ?HPL_rone? undeclared (first use in this function) ../HPL_dlamch.c:274: error: ?HPL_rtwo? undeclared (first use in this function) ../HPL_dlamch.c: At top level: ../HPL_dlamch.c:345: warning: conflicting types for ?HPL_dlamc2? ../HPL_dlamch.c:345: error: static declaration of ?HPL_dlamc2? follows non-static declaration ../HPL_dlamch.c:161: error: previous implicit declaration of ?HPL_dlamc2? was here ../HPL_dlamch.c: In function ?HPL_dlamc2?: ../HPL_dlamch.c:416: error: ?HPL_rzero? undeclared (first use in this function) ../HPL_dlamch.c:416: error: ?HPL_rone? undeclared (first use in this function) ../HPL_dlamch.c:416: error: ?HPL_rtwo? undeclared (first use in this function) ../HPL_dlamch.c:545: error: ?stderr? undeclared (first use in this function) ../HPL_dlamch.c: At top level: ../HPL_dlamch.c:583: error: conflicting types for ?HPL_dlamc3? ../HPL_dlamch.c:274: error: previous implicit declaration of ?HPL_dlamc3? was here ../HPL_dlamch.c:626: warning: conflicting types for ?HPL_dlamc4? ../HPL_dlamch.c:626: error: static declaration of ?HPL_dlamc4? follows non-static declaration ../HPL_dlamch.c:464: error: previous implicit declaration of ?HPL_dlamc4? was here ../HPL_dlamch.c: In function ?HPL_dlamc4?: ../HPL_dlamch.c:667: error: ?HPL_rone? undeclared (first use in this function) ../HPL_dlamch.c:668: error: ?HPL_rzero? undeclared (first use in this function) ../HPL_dlamch.c: At top level: ../HPL_dlamch.c:698: warning: conflicting types for ?HPL_dlamc5? ../HPL_dlamch.c:698: error: static declaration of ?HPL_dlamc5? follows non-static declaration ../HPL_dlamch.c:570: error: previous implicit declaration of ?HPL_dlamc5? was here ../HPL_dlamch.c: In function ?HPL_dlamc5?: ../HPL_dlamch.c:748: error: ?HPL_rzero? undeclared (first use in this function) ../HPL_dlamch.c:812: error: ?HPL_rone? undeclared (first use in this function) ../HPL_dlamch.c: At top level: ../HPL_dlamch.c:842: error: conflicting types for ?HPL_dipow? ../HPL_dlamch.c:164: error: previous implicit declaration of ?HPL_dipow? was here ../HPL_dlamch.c: In function ?HPL_dipow?: ../HPL_dlamch.c:866: error: ?HPL_rone? undeclared (first use in this function) ../HPL_dlamch.c:871: error: ?HPL_rzero? undeclared (first use in this function) make[2]: *** [HPL_dlamch.o] Error 1 make[2]: Leaving directory `/niaidAdmin/apps/hpl-2.0/src/auxil/niaid' make[1]: *** [build_src] Error 2 make[1]: Leaving directory `/niaidAdmin/apps/hpl-2.0' make: *** [build] Error 2 -- Hugo R. Hernandez, Contractor Dell Perot Systems Sr. Systems Administrator Mac & Linux Server Team, OCICB/OEB National Institutes of Health National Institute of Allergy & Infectious Diseases 10401 Fernwood Drive Fernwood West - Rm. 2009 Bethesda, MD 20817 Phone: 301-841-4203 Cell: 240-479-1888 Fax: 301-480-0784 www.dell.com/perotsystems -- "Si seus esfor?os, foram vistos com indefren?a, n?o desanime, que o sol faze un espectacolo maravilhoso todas as manh?s cuando a maior parte das pessoas, ainda estam durmindo" - An?nimo brasileiro Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. From hugo.hernandez at nih.gov Fri Mar 26 10:31:08 2010 From: hugo.hernandez at nih.gov (Hernandez, Hugo (NIH/NIAID) [C]) Date: Fri, 26 Mar 2010 13:31:08 -0400 Subject: [Beowulf] Problems installing HPL 2.0 In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0FDB66F2@milexchmb1.mil.tagmclarengroup.com> Message-ID: Hello John, Thanks for your answer. I have set correctly my TOPdir and INCdir. I did change /myApps for /niaidAdmin/apps but I continue having the same problem. I will try to use your suggestion on last line (...leave it as '.'). -Hugo # - HPL Directory Structure / HPL library ------------------------------ TOPdir = /niaidAdmin/apps/hpl-2.0 INCdir = $(TOPdir)/include BINdir = $(TOPdir)/bin/$(ARCH) LIBdir = $(TOPdir)/lib/$(ARCH) # HPLlib = $(LIBdir)/libhpl.a # - Message Passing library (MPI) -------------------------------------- MPdir = /usr MPinc =-I$(MPdir)/include/openmpi MPlib = $(MPdir)/lib64/openmpi/libmpi.so # - Linear Algebra library (BLAS or VSIPL) ----------------------------- LAdir = /niaidAdmin/apps/GotoBLAS2 LAinc =-I$(MPdir)/include LAlib = $(LAdir)/libblas.so.3 $(LAdir)/atlas/libblas.so.3 # F2CDEFS = # HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc) On 3/26/10 1:21 PM, "Hearns, John" wrote: > Ahem. > In your makefile TOPdir is set to TOPdir = /niaidAdmin/apps/hpl-2.0 > > then INCdir is $(TOPdir)/include > > You need to set TOPdir explicitly as /myApps/hpl-2.0 > Or even better I would leave it as '.' > > > The contents of this email are confidential and for the exclusive use of the > intended recipient. If you receive this email in error you should not copy > it, retransmit it, use it or disclose its contents but should return it to the > sender immediately and delete your copy. -- Hugo R. Hernandez, Contractor Dell Perot Systems Sr. Systems Administrator Mac & Linux Server Team, OCICB/OEB National Institutes of Health National Institute of Allergy & Infectious Diseases 10401 Fernwood Drive Fernwood West - Rm. 2009 Bethesda, MD 20817 Phone: 301-841-4203 Cell: 240-479-1888 Fax: 301-480-0784 www.dell.com/perotsystems -- "Si seus esfor?os, foram vistos com indefren?a, n?o desanime, que o sol faze un espectacolo maravilhoso todas as manh?s cuando a maior parte das pessoas, ainda estam durmindo" - An?nimo brasileiro Disclaimer: The information in this e-mail and any of its attachments is confidential and may contain sensitive information. It should not be used by anyone who is not the original intended recipient. If you have received this e-mail in error please inform the sender and delete it from your mailbox or any other storage devices. National Institute of Allergy and Infectious Diseases shall not accept liability for any statements made that are sender's own and not expressly made on behalf of the NIAID by one of its representatives. From mathog at caltech.edu Fri Mar 26 15:48:50 2010 From: mathog at caltech.edu (David Mathog) Date: Fri, 26 Mar 2010 15:48:50 -0700 Subject: [Beowulf] mysterious slow SATA on one machine Message-ID: I'm hoping somebody has seen this before and can suggest what might be going on. One machine (Arima HDAMA-I board, dual Opteron 280, 4GB RAM, Sil 3114 Sata controller, Sil 5.4.03 firmware) has mysteriously slow SATA IO. This is the case for two different disks (WD10EARS and ST340014AS), two different disk schedulers, and two different OS's (Mandriva 2010.0 and PLD 2.97 rescue linux.) Using a different brand of cable, and plugging into a different SATA port didn't help either. However, move those disks to another machine (Asus A8N5X, Nvidia CK804 SATA controller, single core, 1 GB RAM, Knoppix) and they are both much faster. Raw results from various experiments here: http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf For the sustained write test both disks on the slow system take about 102s to write 4GB to disk, or around 41.3GB/s. That isn't horrible horrible, but it isn't great either. On the faster machine the WD10EARS does the job in 39 seconds, and even the old Seagate is done in 74s. It strikes me that something must be rate limiting both disks to about the same throughput. The Sil 3114 chip is somehow interfaced through the PCI bus, but even if that is only 33MHz it is still 4 bytes wide and should be able to handle around 132 MB/s, 3X what I'm seeing. All of the PCI and PCI-X slots are unoccupied. I have no previous experience with the Sil 3114 or the Arima board, so don't know if this is typical for either. Perhaps the oddest part of this is that during these tests the disk light on the slow system blinks but is often off for long periods. Conversely, on the faster system the disk light stays on pretty steadily. As if on the slower system it is doing something else when it should be doing disk IO. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From coutinho at dcc.ufmg.br Fri Mar 26 17:30:08 2010 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Fri, 26 Mar 2010 21:30:08 -0300 Subject: [Beowulf] mysterious slow SATA on one machine In-Reply-To: References: Message-ID: AFAIK all your disks and the nvidia CK804 supprt NCQ but the sil 3114 don't. This could explain the lower drive throughput undel sil 3114 controller. 2010/3/26 David Mathog > I'm hoping somebody has seen this before and can suggest what might be > going on. > > One machine (Arima HDAMA-I board, dual Opteron 280, 4GB RAM, > Sil 3114 Sata controller, Sil 5.4.03 firmware) has mysteriously slow > SATA IO. This is the case for two different disks (WD10EARS and > ST340014AS), two different disk schedulers, and two different OS's > (Mandriva 2010.0 and PLD 2.97 rescue linux.) Using a different brand of > cable, and plugging into a different SATA port didn't help either. > However, move those disks to another machine (Asus A8N5X, Nvidia CK804 > SATA controller, single core, 1 GB RAM, Knoppix) and they are both much > faster. Raw results from various experiments here: > > http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf > http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf > > For the sustained write test both disks on the slow system take about > 102s to write 4GB to disk, or around 41.3GB/s. That isn't horrible > horrible, but it isn't great either. On the faster machine the WD10EARS > does the job in 39 seconds, and even the old Seagate is done in 74s. It > strikes me that something must be rate limiting both disks to about the > same throughput. The Sil 3114 chip is somehow interfaced through the > PCI bus, but even if that is only 33MHz it is still 4 bytes wide and > should be able to handle around 132 MB/s, 3X what I'm seeing. All of > the PCI and PCI-X slots are unoccupied. I have no previous experience > with the Sil 3114 or the Arima board, so don't know if this is typical > for either. > > Perhaps the oddest part of this is that during these tests the disk > light on the slow system blinks but is often off for long periods. > Conversely, on the faster system the disk light stays on pretty > steadily. As if on the slower system it is doing something else when it > should be doing disk IO. > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gdjacobs at gmail.com Fri Mar 26 18:18:25 2010 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Fri, 26 Mar 2010 20:18:25 -0500 Subject: [Beowulf] mysterious slow SATA on one machine In-Reply-To: References: Message-ID: <4BAD5CE1.9030802@gmail.com> David Mathog wrote: > I'm hoping somebody has seen this before and can suggest what might be > going on. > > One machine (Arima HDAMA-I board, dual Opteron 280, 4GB RAM, > Sil 3114 Sata controller, Sil 5.4.03 firmware) has mysteriously slow > SATA IO. This is the case for two different disks (WD10EARS and > ST340014AS), two different disk schedulers, and two different OS's > (Mandriva 2010.0 and PLD 2.97 rescue linux.) Using a different brand of > cable, and plugging into a different SATA port didn't help either. > However, move those disks to another machine (Asus A8N5X, Nvidia CK804 > SATA controller, single core, 1 GB RAM, Knoppix) and they are both much > faster. Raw results from various experiments here: > > http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf > http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf > > For the sustained write test both disks on the slow system take about > 102s to write 4GB to disk, or around 41.3GB/s. That isn't horrible > horrible, but it isn't great either. On the faster machine the WD10EARS > does the job in 39 seconds, and even the old Seagate is done in 74s. It > strikes me that something must be rate limiting both disks to about the > same throughput. The Sil 3114 chip is somehow interfaced through the > PCI bus, but even if that is only 33MHz it is still 4 bytes wide and > should be able to handle around 132 MB/s, 3X what I'm seeing. All of > the PCI and PCI-X slots are unoccupied. I have no previous experience > with the Sil 3114 or the Arima board, so don't know if this is typical > for either. > > Perhaps the oddest part of this is that during these tests the disk > light on the slow system blinks but is often off for long periods. > Conversely, on the faster system the disk light stays on pretty > steadily. As if on the slower system it is doing something else when it > should be doing disk IO. As mentioned to David in a separate post, I see similar (worse) performance deltas using an S-I controller. I see the same delta using sata_sil driving an ATI SB4xx south bridge. It might be kernel related, as escalated here: https://bugzilla.redhat.com/show_bug.cgi?id=502499 -- Geoffrey D. Jacobs From hahn at mcmaster.ca Fri Mar 26 19:40:06 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 26 Mar 2010 22:40:06 -0400 (EDT) Subject: [Beowulf] Problems installing HPL 2.0 In-Reply-To: References: Message-ID: > Can somebody help me on a problem I am experiencing when trying to install >HPL 2.0 in our system? The error comes as the HPL_dlamch.c isn?t working >because it can?t find the hpl.h include file. The file already exists in >$(TOPdir)/include/hpl.h. Do I am missing something? the makefile isn't passing the -I when compiling that file: > /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlange.o -c -DHPL_CALL_CBLAS -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include -I/niaidAdmin/apps/hpl-2.0/include/Linux_x86_64 -I/usr/include -I/usr/include/openmpi -pipe -O3 -funroll-loops ../HPL_dlange.c > /usr/lib64/openmpi/1.3.2-gcc/bin/mpicc -o HPL_dlamch.o -c ../HPL_dlamch.c > ../HPL_dlamch.c:50:17: error: hpl.h: No such file or directory everything between the -DHPL_CALL... and -funroll-loops is missing. you need to look at the makefile in src/auxil/niaid... > I have added the directory /myApps/hpl-2.0/include into my LD_LIBRARY_PATH > without any result. thank goodness! ;) that would make no sense, since LD_LIBRARY_PATH is strictly a runtime/library thing, nothing to do with compile-time/headers. > Disclaimer: The information in this e-mail and any of its attachments is >confidential and may contain sensitive information. It should not be used by >anyone who is not the original intended recipient. If you have received this >e-mail in error please inform the sender and delete it from your mailbox or >any other storage devices. National Institute of Allergy and Infectious >Diseases shall not accept liability for any statements made that are >sender's own and not expressly made on behalf of the NIAID by one of its >representatives. I assume you know that this boilerplate is completely meaningless... regards, mark hahn. From mathog at caltech.edu Mon Mar 29 10:55:29 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 29 Mar 2010 10:55:29 -0700 Subject: [Beowulf] mysterious slow SATA on one machine Message-ID: > David Mathog wrote: >Raw results from various experiments here: > > http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf > http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf > Some progress, see updated files above. lspci showed there were two devices on the bus where the Sil 3114 was located, it and the ATI Rage VGA controller. It also showed the pci latency for the former was 32 and the latter 66. The VGA controller is currently running in text mode, without the atyfb module loaded, and with nothing happening on it (it just has the text login prompt dislayed), and changing its latency up or down makes no difference to disk speed (not shown in the files cited above). The machine was started once with the VGA jumpered off, but it didn't boot, so it wasn't possible to completely remove it from the equation. However, increasing the pci_latency on the Sil 3114 as far as it would go (to 144 as shown in lspci) with setpci -s '01:07.0' latency_timer=99 made a considerable difference. For the older disk it brought it up to approximately the same speed as on the nvidia ck804 controller. For the WD10EARS it sped things up about 50%, but didn't manage to match the nvidia controller, possibly because of the absence of the ncq mentioned previously in this thread. Also on the 3114 it topped out at around 106MB/sec for the fastest bonnie++ applications, and that is pretty close to the 132MB/s limit on the 32 bit PCI bus, whereas on the ck804 peaks were 140MB/s, which would be more than the bus can carry. Assuming the PCI on the Arima board is really 33 MHz, like the manual says, and not 66 MHz, as lspci reports. Vibration may still be playing some role, but apparently that wasn't the primary problem. I may still put in a PCI-X SATA controller though, as there is still another 40% performance to go on the WD drive, and that will provide enough bus bandwidth to support that. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From cousins at umit.maine.edu Mon Mar 29 12:51:42 2010 From: cousins at umit.maine.edu (Steve Cousins) Date: Mon, 29 Mar 2010 15:51:42 -0400 (EDT) Subject: [Beowulf] HP 10 GbE card use/warranty Message-ID: I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have been trying to get them to work in non-HP Linux systems. After failing to be able to comile/install the nx_nic drivers on Fedora 9 and 12 (it checks to see if kernel >= 2.6.27 and if so looks for net/8021q/vlan.h but can't find it) I installed one of the supported distributions (or close enough): CentOS 5. Driver installation went fine and I was able to get one of the cards to work. The other one seems to be bad. The nx_nic driver loads but no eth2 device shows up. Also, the Activity LED is constant but no Link LED lights up. Since I can't get the eth device loaded I can't update the firmware. So, I have one good one and one bad one. Pretty clear-cut. I've tried to get technical support from HP to RMA it and I end up in no-mans land. It seems that they do not have support for peripherals like this. They consider this part of a system and the only systems they seem to think these go in are their Proliant servers so I end up at Proliant tech support. But I have no serial number for a Proliant server so it is a dead end. I tried "Customer Satisfaction" and got none. They insist that these cards are *only* for HP Proliant servers but I have not seen any indication of this at the sites I go to to buy this type of card. I bought this from a reseller when we bought a bunch of Procurve equipment with some 10 GbE modules for a switch. The reseller is seeing what he can do but in the mean time I thought I'd check here to see if anyone has run into this sort of thing from HP. I've always had good luck with HP but it has mainly been from the Procurve section and they won't touch this either. Anyone running HP cards in non-HP equipment? Any tips on getting an RMA for this? I'm thinking of trying to track down an HP server on campus to work with just to get the RMA but I doubt if there are too many of these that aren't in use and still under warranty. Thanks, Steve ______________________________________________________________________ Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473 (207) 581-4302 ~ cousins at umit.maine.edu ~ (207) 866-6552 From chekh at pcbi.upenn.edu Mon Mar 29 13:28:53 2010 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Mon, 29 Mar 2010 16:28:53 -0400 Subject: [Beowulf] HP 10 GbE card use/warranty In-Reply-To: References: Message-ID: <20100329162853.aa18be0b.chekh@pcbi.upenn.edu> Hi Steve, I can report that I had the same problem. I have a bunch of NC510c cards in IBM X3650 servers. One broke, with the symptom that it would just hang any machine I put it into. Luckily (or with great foresight), I had previously ordered a spare to have on hand. At one point, the HP tech on the line was able to find the serial number in the HP system somewhere but only confirm that my card was out of warranty (it was manufactured long before we purchased it). In the end, our VAR was able to convince their distributor to mail me a new working card, free. Took a couple of weeks. Here, I can even copy in our e-mail chain, somewhat sanitized: ***** From: helpful VAR Jeff To: Alex Chekholko Subject: RE: FW: bad network card Date: Thu, 23 Oct 2008 08:46:25 -0400 Good morning Alex, We finally got the distributor to ship you a new card - it should be shipping today. Sorry for all the trouble. ________________________________________ From: Alex Chekholko [chekh at pcbi.upenn.edu] Sent: Monday, October 20, 2008 12:15 PM To: helpful VAR Jeff Cc: manager at genomics.upenn.edu Subject: Re: FW: bad network card Hi Jeff, Tech support line that I used is 1-800-474-6836 The last time I spoke to HP on Oct 8th, I ended up in the queue Tech Support -> Networking -> ProCurve This time I spoke to HP, I ended up in the queue Tech Support -> Servers -> Proliant They couldn't find any record of that card anywhere in their system and said I have to get a replacement through the reseller the card was purchased from and recommended I call 1-888-943-8476 (turned out to be a "customer satisfaction" line). I called back again and went to Tech Support -> Networking -> ProCurve They initially couldn't find it in the system, until I recollected that they called it a "server adapter" last time; that allowed them to find it, and then they dumped me back to the initial menu. Apparently, the serial number doesn't help, and the part number doesn't show up in the system (for the ProCurve folks). Then I called the "customer satisfaction" line, who suggested the HP Parts Store (1-800-227-8164 option 2) and transferred me there. They said that Tech Support is the only place that could authorize warranty replacement, and were unable to look up the order numbers listed below, but connected me to "HW Support Orders" department (1-800-525-7104) who then forwarded me to the front menu of Tech Support. Total time, 2hrs. Good luck. On the UPenn side, the PO number is 2010394. Regards, Alex On Mon, 13 Oct 2008 17:33:53 -0400 helpful VAR wrote: > Alex, > > Per note below can you try HP support one more time? Sorry for the trouble! Thanks. > > -----Original Message----- > From: distributor "arrow" On Behalf Of ISS Team > Sent: Monday, October 13, 2008 5:18 PM > To: helpful VAR Jeff > Cc: other arrow folks > Subject: RE: bad network card > > Hi Jeff. We came up with, we hope, information that will prove to HP > that the card is under warranty. It was ordered through Synnex and the > order numbers are 24927045 and 26715052. They shipped from Synnex on > March 20, 2008. > > Part# 414129-B21 (they ordered a qty of 3) > HP NC510C PCIE 10 GIGABIT SERVER ADAPTER > Reseller Info: > helpful VAR... > > End-User Info: > > our address > PHILADELPHIA, PA 19104 > > Please have the end user call HP and give them this information. Please > get back with me if HP still won't replace it. > > Thanks. > > ***** On Mon, 29 Mar 2010 15:51:42 -0400 (EDT) Steve Cousins wrote: > > I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have > been trying to get them to work in non-HP Linux systems. After failing to > be able to comile/install the nx_nic drivers on Fedora 9 and 12 (it checks > to see if kernel >= 2.6.27 and if so looks for net/8021q/vlan.h but can't > find it) I installed one of the supported distributions (or close enough): > CentOS 5. Driver installation went fine and I was able to get one of the > cards to work. The other one seems to be bad. The nx_nic driver loads but > no eth2 device shows up. Also, the Activity LED is constant but no Link > LED lights up. Since I can't get the eth device loaded I can't update the > firmware. > > So, I have one good one and one bad one. Pretty clear-cut. I've tried to > get technical support from HP to RMA it and I end up in no-mans land. It > seems that they do not have support for peripherals like this. They > consider this part of a system and the only systems they seem to think > these go in are their Proliant servers so I end up at Proliant tech > support. But I have no serial number for a Proliant server so it is a dead > end. > > I tried "Customer Satisfaction" and got none. > > They insist that these cards are *only* for HP Proliant servers but I have > not seen any indication of this at the sites I go to to buy this type of > card. I bought this from a reseller when we bought a bunch of Procurve > equipment with some 10 GbE modules for a switch. The reseller is seeing > what he can do but in the mean time I thought I'd check here to see if > anyone has run into this sort of thing from HP. I've always had good luck > with HP but it has mainly been from the Procurve section and they won't > touch this either. > > Anyone running HP cards in non-HP equipment? Any tips on getting an RMA > for this? I'm thinking of trying to track down an HP server on campus to > work with just to get the RMA but I doubt if there are too many of these > that aren't in use and still under warranty. > > Thanks, > > Steve > ______________________________________________________________________ > Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine > Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive > Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473 > (207) 581-4302 ~ cousins at umit.maine.edu ~ (207) 866-6552 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Alex Chekholko chekh at pcbi.upenn.edu From mathog at caltech.edu Mon Mar 29 16:33:18 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 29 Mar 2010 16:33:18 -0700 Subject: [Beowulf] mysterious slow SATA on one machine Message-ID: > > David Mathog wrote: > >Raw results from various experiments here: > > > > http://saf.bio.caltech.edu/pub/pickup/bonnie++.rtf > > http://saf.bio.caltech.edu/pub/pickup/sustained_write.rtf > > > > Some progress, see updated files above. And a step back... With the latency set to 22 on the VGA, and 144 on the Sil 3114, three consecutive boots varying (and this is probably a red herring) only the type of partition 5 (swap partition, which is the first logical partition in the one extended partition, following the first partition, which is both real and the boot partition) Boot Type bonnie++ (-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- line) A 83 48104 91 65832 32 33587 13 48852 89 106061 16 219.9 1 B 82 47572 89 47176 18 19601 6 35147 79 59726 8 188.2 1 C 83 48662 92 65424 26 32901 12 48648 89 105560 15 214.9 1 Conversely, the sustained write test was about the same for all 3 boots, although slightly (5%) faster for A,C than B. Run the bonnie++ test over and over during each uptime and the results come out more or less the same. Reboot, and they changed. Could the partition type really matter? Change it back, reboot, giving D: D 82 48188 90 63984 27 33609 13 47391 89 106234 16 217.9 1 So no, the partition type isn't the story, or at least not the whole story. Something else must be going on. If feels like there is a bit somewhere that is blowing in the wind... Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From ebiederm at xmission.com Tue Mar 30 00:00:51 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Tue, 30 Mar 2010 00:00:51 -0700 Subject: [Beowulf] HP 10 GbE card use/warranty In-Reply-To: (Steve Cousins's message of "Mon\, 29 Mar 2010 15\:51\:42 -0400 \(EDT\)") References: Message-ID: Steve Cousins writes: > I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have been > trying to get them to work in non-HP Linux systems. After failing to be able to > comile/install the nx_nic drivers on Fedora 9 and 12 (it checks to see if kernel >>= 2.6.27 and if so looks for net/8021q/vlan.h but can't find it) I suspect that header was simply not packaged in the appropriate rpm. You might be able to get away with commenting out that include. > I installed > one of the supported distributions (or close enough): CentOS 5. Driver > installation went fine and I was able to get one of the cards to work. The other > one seems to be bad. The nx_nic driver loads but no eth2 device shows up. Also, > the Activity LED is constant but no Link LED lights up. Since I can't get the > eth device loaded I can't update the firmware. Does fedora not build the in kernel driver? They should. Except for occasionally having to flash the firmware up to the latest image I have had good luck with the netxen nics. I don't know anything about the your weird purchase/support situation. Eric From cousins at umit.maine.edu Tue Mar 30 06:24:48 2010 From: cousins at umit.maine.edu (Steve Cousins) Date: Tue, 30 Mar 2010 09:24:48 -0400 (EDT) Subject: [Beowulf] HP 10 GbE card use/warranty In-Reply-To: References: Message-ID: On Tue, 30 Mar 2010, Eric W. Biederman wrote: > Steve Cousins writes: > >>> I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have been >>> trying to get them to work in non-HP Linux systems. After failing to be able to >>> comile/install the nx_nic drivers on Fedora 9 and 12 (it checks to see if kernel >>>> = 2.6.27 and if so looks for net/8021q/vlan.h but can't find it) >> > I suspect that header was simply not packaged in the appropriate rpm. > You might be able to get away with commenting out that include. That was the first thing I tried. It lead to a path that got wider as I went. >>> I installed >>> one of the supported distributions (or close enough): CentOS 5. Driver >>> installation went fine and I was able to get one of the cards to work. The other >>> one seems to be bad. The nx_nic driver loads but no eth2 device shows up. Also, >>> the Activity LED is constant but no Link LED lights up. Since I can't get the >>> eth device loaded I can't update the firmware. >> > Does fedora not build the in kernel driver? They should. Yes it does. I got basic functionality with it but under large transfers it locks up. No lockups with the nx_nic driver. > Except for occasionally having to flash the firmware up to the latest > image I have had good luck with the netxen nics. I don't know anything > about the your weird purchase/support situation. > Eric > ______________________________________________________________________ Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473 (207) 581-4302 ~ cousins at umit.maine.edu ~ (207) 866-6552 From cousins at umit.maine.edu Tue Mar 30 11:47:59 2010 From: cousins at umit.maine.edu (Steve Cousins) Date: Tue, 30 Mar 2010 14:47:59 -0400 Subject: [Beowulf] HP 10 GbE card use/warranty In-Reply-To: <20100329162853.aa18be0b.chekh@pcbi.upenn.edu> References: <20100329162853.aa18be0b.chekh@pcbi.upenn.edu> Message-ID: Hi Alex, Thanks a lot. Sounds very similar to what I've been doing. I too *luckily* ordered a spare thinking that some day we'd be able to use it in another system. Thanks to some people from the list I've got a lead on some help. Our VAR is still working on it too. I'll let you know either way. I hope this gets the word out that the HP NICs are definitely for HP equipment only. Maybe not functionally but as far as warranty goes keep it in mind. Steve Alex Chekholko writes: >Hi Steve, > >I can report that I had the same problem. I have a bunch of NC510c >cards in IBM X3650 servers. One broke, with the symptom that it would >just hang any machine I put it into. Luckily (or with great >foresight), I had previously ordered a spare to have on hand. At one >point, the HP tech on the line was able to find the serial number in >the HP system somewhere but only confirm that my card was out of >warranty (it was manufactured long before we purchased it). In the >end, our VAR was able to convince their distributor to mail me a new >working card, free. Took a couple of weeks. > >Here, I can even copy in our e-mail chain, somewhat sanitized: > > >***** > >From: helpful VAR Jeff >To: Alex Chekholko >Subject: RE: FW: bad network card >Date: Thu, 23 Oct 2008 08:46:25 -0400 > >Good morning Alex, > >We finally got the distributor to ship you a new card - it should be >shipping today. Sorry for all the trouble. > >________________________________________ >From: Alex Chekholko [chekh at pcbi.upenn.edu] >Sent: Monday, October 20, 2008 12:15 PM >To: helpful VAR Jeff >Cc: manager at genomics.upenn.edu >Subject: Re: FW: bad network card > >Hi Jeff, > >Tech support line that I used is 1-800-474-6836 > >The last time I spoke to HP on Oct 8th, I ended up in the queue >Tech Support -> Networking -> ProCurve > >This time I spoke to HP, I ended up in the queue >Tech Support -> Servers -> Proliant > >They couldn't find any record of that card anywhere in their system and >said I have to get a replacement through the reseller the card was >purchased from and recommended I call 1-888-943-8476 (turned out to be >a "customer satisfaction" line). > >I called back again and went to >Tech Support -> Networking -> ProCurve > >They initially couldn't find it in the system, until I recollected that >they called it a "server adapter" last time; that allowed them to find >it, and then they dumped me back to the initial menu. Apparently, the >serial number doesn't help, and the part number doesn't show up in the >system (for the ProCurve folks). > >Then I called the "customer satisfaction" line, who suggested the HP >Parts Store (1-800-227-8164 option 2) and transferred me there. They >said that Tech Support is the only place that could authorize warranty >replacement, and were unable to look up the order numbers listed below, >but connected me to "HW Support Orders" department (1-800-525-7104) who >then forwarded me to the front menu of Tech Support. > >Total time, 2hrs. > >Good luck. > >On the UPenn side, the PO number is 2010394. > >Regards, >Alex > >On Mon, 13 Oct 2008 17:33:53 -0400 >helpful VAR wrote: > >> Alex, >> >> Per note below can you try HP support one more time? Sorry for the trouble! Thanks. >> >> -----Original Message----- >> From: distributor "arrow" On Behalf Of ISS Team >> Sent: Monday, October 13, 2008 5:18 PM >> To: helpful VAR Jeff >> Cc: other arrow folks >> Subject: RE: bad network card >> >> Hi Jeff. We came up with, we hope, information that will prove to HP >> that the card is under warranty. It was ordered through Synnex and the >> order numbers are 24927045 and 26715052. They shipped from Synnex on >> March 20, 2008. >> >> Part# 414129-B21 (they ordered a qty of 3) >> HP NC510C PCIE 10 GIGABIT SERVER ADAPTER >> Reseller Info: >> helpful VAR... >> >> End-User Info: >> >> our address >> PHILADELPHIA, PA 19104 >> >> Please have the end user call HP and give them this information. Please >> get back with me if HP still won't replace it. >> >> Thanks. >> >> > >***** > >On Mon, 29 Mar 2010 15:51:42 -0400 (EDT) >Steve Cousins wrote: > >> >> I have a couple of 10 GbE cards from HP (NC510F, NetXen/QLogic) and I have >> been trying to get them to work in non-HP Linux systems. After failing to >> be able to comile/install the nx_nic drivers on Fedora 9 and 12 (it checks >> to see if kernel >= 2.6.27 and if so looks for net/8021q/vlan.h but can't >> find it) I installed one of the supported distributions (or close enough): >> CentOS 5. Driver installation went fine and I was able to get one of the >> cards to work. The other one seems to be bad. The nx_nic driver loads but >> no eth2 device shows up. Also, the Activity LED is constant but no Link >> LED lights up. Since I can't get the eth device loaded I can't update the >> firmware. >> >> So, I have one good one and one bad one. Pretty clear-cut. I've tried to >> get technical support from HP to RMA it and I end up in no-mans land. It >> seems that they do not have support for peripherals like this. They >> consider this part of a system and the only systems they seem to think >> these go in are their Proliant servers so I end up at Proliant tech >> support. But I have no serial number for a Proliant server so it is a dead >> end. >> >> I tried "Customer Satisfaction" and got none. >> >> They insist that these cards are *only* for HP Proliant servers but I have >> not seen any indication of this at the sites I go to to buy this type of >> card. I bought this from a reseller when we bought a bunch of Procurve >> equipment with some 10 GbE modules for a switch. The reseller is seeing >> what he can do but in the mean time I thought I'd check here to see if >> anyone has run into this sort of thing from HP. I've always had good luck >> with HP but it has mainly been from the Procurve section and they won't >> touch this either. >> >> Anyone running HP cards in non-HP equipment? Any tips on getting an RMA >> for this? I'm thinking of trying to track down an HP server on campus to >> work with just to get the RMA but I doubt if there are too many of these >> that aren't in use and still under warranty. >> >> Thanks, >> >> Steve >> ______________________________________________________________________ >> Steve Cousins - Supercomputer Engineer/Administrator - Univ of Maine >> Marine Sciences, 452 Aubert Hall Target Tech, 20 Godfrey Drive >> Orono, ME 04469 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Orono, ME 04473 >> (207) 581-4302 ~ cousins at umit.maine.edu ~ (207) 866-6552 >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > >-- >Alex Chekholko chekh at pcbi.upenn.edu ______________________________________________________________________ Steve Cousins, Ocean Modeling Group Email: cousins at umit.maine.edu Marine Sciences, 452 Aubert Hall http://rocky.umeoce.maine.edu Univ. of Maine, Orono, ME 04469 Phone: (207) 581-4302 From ebiederm at xmission.com Wed Mar 31 05:23:27 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Wed, 31 Mar 2010 05:23:27 -0700 Subject: [Beowulf] HP 10 GbE card use/warranty In-Reply-To: (Steve Cousins's message of "Tue\, 30 Mar 2010 09\:24\:48 -0400 \(EDT\)") References: Message-ID: Steve Cousins writes: > >>>> I installed >>>> one of the supported distributions (or close enough): CentOS 5. Driver >>>> installation went fine and I was able to get one of the cards to work. The other >>>> one seems to be bad. The nx_nic driver loads but no eth2 device shows up. Also, >>>> the Activity LED is constant but no Link LED lights up. Since I can't get the >>>> eth device loaded I can't update the firmware. >>> >> Does fedora not build the in kernel driver? They should. > > > Yes it does. I got basic functionality with it but under large transfers it > locks up. No lockups with the nx_nic driver. You might want to work with the maintainers of the netxen driver in the kernel. That have been fairly responsive when I have worked with them. Eric From orion at cora.nwra.com Wed Mar 31 08:27:29 2010 From: orion at cora.nwra.com (Orion Poplawski) Date: Wed, 31 Mar 2010 09:27:29 -0600 Subject: [Beowulf] AMD 6100 vs Intel 5600 Message-ID: <4BB369E1.60906@cora.nwra.com> Looks like it's time to start evaluating the AMD 6100 (magny-cours) offerings versus the Intel 5600 (Nehalem-EX?) offerings. Any suggestions for resources? -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion at cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From smulcahy at atlanticlinux.ie Wed Mar 31 08:36:46 2010 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Wed, 31 Mar 2010 16:36:46 +0100 Subject: [Beowulf] AMD 6100 vs Intel 5600 In-Reply-To: <4BB369E1.60906@cora.nwra.com> References: <4BB369E1.60906@cora.nwra.com> Message-ID: <4BB36C0E.1010502@atlanticlinux.ie> Orion Poplawski wrote: > Looks like it's time to start evaluating the AMD 6100 (magny-cours) > offerings versus the Intel 5600 (Nehalem-EX?) offerings. Any > suggestions for resources? > http://www.anandtech.com/show/2978/amd-s-12-core-magny-cours-opteron-6174-vs-intel-s-6-core-xeon/10 has some benchmarks as a starting point. Would be interested to hear from others with more HPC-oriented benchmark results. -stephen -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From kilian.cavalotti.work at gmail.com Wed Mar 31 10:37:01 2010 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed, 31 Mar 2010 19:37:01 +0200 Subject: [Beowulf] AMD 6100 vs Intel 5600 In-Reply-To: <4BB369E1.60906@cora.nwra.com> References: <4BB369E1.60906@cora.nwra.com> Message-ID: On Wed, Mar 31, 2010 at 5:27 PM, Orion Poplawski wrote: > Looks like it's time to start evaluating the AMD 6100 (magny-cours) > offerings versus the Intel 5600 (Nehalem-EX?) offerings. ?Any suggestions > for resources? Just for the sake of precision, Intel 5600 series was codenamed Westmere (dual-socket, 32nm, 6-cores, 3 memory channels). Intel 7500 series was codenamed Beckton, aka Nehalem-EX (quad-socket and beyond, 45nm, 8-cores, 4 memory-channels). I would say that the 2x6-cores Magny-Cours probably has to be compared to Nehalem-EX. Some SPEC results are being posted on http://www.spec.org/cpu2006/results/res2010q1/ Cheers, -- Kilian From bill at cse.ucdavis.edu Wed Mar 31 13:51:18 2010 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 31 Mar 2010 13:51:18 -0700 Subject: [Beowulf] AMD 6100 vs Intel 5600 In-Reply-To: References: <4BB369E1.60906@cora.nwra.com> Message-ID: <4BB3B5C6.8060001@cse.ucdavis.edu> On 03/31/2010 10:37 AM, Kilian CAVALOTTI wrote: > On Wed, Mar 31, 2010 at 5:27 PM, Orion Poplawski wrote: >> Looks like it's time to start evaluating the AMD 6100 (magny-cours) >> offerings versus the Intel 5600 (Nehalem-EX?) offerings. Any suggestions >> for resources? > > Just for the sake of precision, Intel 5600 series was codenamed > Westmere (dual-socket, 32nm, 6-cores, 3 memory channels). Intel 7500 > series was codenamed Beckton, aka Nehalem-EX (quad-socket and beyond, > 45nm, 8-cores, 4 memory-channels). > > I would say that the 2x6-cores Magny-Cours probably has to be compared > to Nehalem-EX. Why? Various vendors try various strategies to differentiate products based on features. For the most part HPC types care about performance per $, performance per watt, and reliability. I'd be pretty surprised to see large HPC cluster built out of Nehalem-EX chips. Sure, large NUMA machines from SGI, or HA clusters for running oracle and related business critical applications. The best price/perf from intel looks to be the 5600, and the best from AMD is the Magny-Cours. Granted these are from AMD but: http://www.amd.com/us/products/server/benchmarks/Pages/memory-bandwidth-stream-two-socket-servers.aspx http://www.amd.com/us/products/server/processors/six-core-opteron/Pages/SPECfp-rate2006-two-socket-servers.aspx http://www.amd.com/us/products/server/processors/six-core-opteron/Pages/SPECint-rate-2006-two-socket-servers.aspx Of course this is all hand waving without system prices though. I have to say I've been pleasantly surprised. At a Supermicro reseller I configured 2 reasonable compute nodes for a particular application and came up with: 1U dual Opteron 6128 (8 corex2.0 GHz), 32GB DDR3-1333, 2x1TB, IPMI 2.0 = $3802 1U dual Xeon X5650 (6 corex2.6 GHz), 24GB DDR3-1333, 2x1TB, IPMI 2.0 = $4639 Granted IPC isn't the same, but I was amused to see AMD offering 16 2.0 GHz cores = 32 GHz, and the Intel config had 12 2.6 GHz cores = 31.2 GHz. I've yet to get an account on the new AMD chips to measure our actual application performance, but I have to say AMD looks pretty good at the moment. I figured maybe the 6 core intel is priced artificially high so I tried a 4 core: 1U Xeon 5620 (4 core x 2.4 GHz), 24GB DDR3-1333, 2x1TB, IPMI 2.0 = $3293 So $3,293 for the cheaper Intel, or pay $509 to upgrade to the AMD and get another 8GB ram and double the cores. Granted they are at 2.0 GHz instead of 2.4. Seems like AMDs offering more memory memory bandwidth and specFP rate per dollar. Certainly enough to have me looking for an account to measure performance on our codes. From cbergstrom at pathscale.com Wed Mar 31 14:16:30 2010 From: cbergstrom at pathscale.com (=?UTF-8?B?IkMuIEJlcmdzdHLDtm0i?=) Date: Thu, 01 Apr 2010 04:16:30 +0700 Subject: [Beowulf] AMD 6100 vs Intel 5600 In-Reply-To: <4BB3B5C6.8060001@cse.ucdavis.edu> References: <4BB369E1.60906@cora.nwra.com> <4BB3B5C6.8060001@cse.ucdavis.edu> Message-ID: <4BB3BBAE.5070802@pathscale.com> Bill Broadley wrote: > ... > Seems like AMDs offering more memory memory bandwidth and specFP rate per > dollar. Certainly enough to have me looking for an account to measure > performance on our codes. > While for my own selfish reasons I'm happy AMD may have some chance at a comeback, but... I caution everyone to please ignore SPEC* as any indicator of performance.. This will be especially true for any benchmarks based on AMD's compiler. Your code will always be the best benchmark and I'm happy to assist anyone offlist that needs help getting unbiased numbers. Best, ./C #pathscale - irc.freenode.net CTOPathScale - twitter From malexand at scaledinfra.com Tue Mar 30 13:30:20 2010 From: malexand at scaledinfra.com (Michael Alexander) Date: Tue, 30 Mar 2010 22:30:20 +0200 Subject: [Beowulf] CfP with Extended Deadline 5th Workshop on Virtualization in High-Performance Cloud Computing (VHPC'10) Message-ID: Apologies if you received multiple copies of this message. ================================================================= CALL FOR PAPERS 5th Workshop on Virtualization in High-Performance Cloud Computing VHPC'10 as part of Euro-Par 2010, Island of Ischia-Naples, Italy ================================================================= Date: August 31, 2010 Euro-Par 2009: http://www.europar2010.org/ Workshop URL: http://vhpc.org SUBMISSION DEADLINE: Abstracts: April 4, 2010 (extended) Full Paper: June 19, 2010 (extended) Scope: Virtualization has become a common abstraction layer in modern data centers, enabling resource owners to manage complex infrastructure independently of their applications. Conjointly virtualization is becoming a driving technology for a manifold of industry grade IT services. Piloted by the Amazon Elastic Computing Cloud services, the cloud concept includes the notion of a separation between resource owners and users, adding services such as hosted application frameworks and queuing. Utilizing the same infrastructure, clouds carry significant potential for use in high-performance scientific computing. The ability of clouds to provide for requests and releases of vast computing resource dynamically and close to the marginal cost of providing the services is unprecedented in the history of scientific and commercial computing. Distributed computing concepts that leverage federated resource access are popular within the grid community, but have not seen previously desired deployed levels so far. Also, many of the scientific datacenters have not adopted virtualization or cloud concepts yet. This workshop aims to bring together industrial providers with the scientific community in order to foster discussion, collaboration and mutual exchange of knowledge and experience. The workshop will be one day in length, composed of 20 min paper presentations, each followed by 10 min discussion sections. Presentations may be accompanied by interactive demonstrations. It concludes with a 30 min panel discussion by presenters. TOPICS Topics include, but are not limited to, the following subjects: - Virtualization in cloud, cluster and grid HPC environments - VM cloud, cluster load distribution algorithms - Cloud, cluster and grid filesystems - QoS and and service level guarantees - Cloud programming models, APIs and databases - Software as a service (SaaS) - Cloud provisioning - Virtualized I/O - VMMs and storage virtualization - MPI, PVM on virtual machines - High-performance network virtualization - High-speed interconnects - Hypervisor extensions - Tools for cluster and grid computing - Xen/other VMM cloud/cluster/grid tools - Raw device access from VMs - Cloud reliability, fault-tolerance, and security - Cloud load balancing - VMs - power efficiency - Network architectures for VM-based environments - VMMs/Hypervisors - Hardware support for virtualization - Fault tolerant VM environments - Workload characterizations for VM-based environments - Bottleneck management - Metering - VM-based cloud performance modeling - Cloud security, access control and data integrity - Performance management and tuning hosts and guest VMs - VMM performance tuning on various load types - Research and education use cases - Cloud use cases - Management of VM environments and clouds - Deployment of VM-based environments PAPER SUBMISSION Papers submitted to the workshop will be reviewed by at least two members of the program committee and external reviewers. Submissions should include abstract, key words, the e-mail address of the corresponding author, and must not exceed 10 pages, including tables and figures at a main font size no smaller than 11 point. Submission of a paper should be regarded as a commitment that, should the paper be accepted, at least one of the authors will register and attend the conference to present the work. Accepted papers will be published in the Springer LNCS series - the format must be according to the Springer LNCS Style. Initial submissions are in PDF, accepted papers will be requested to provided source files. Format Guidelines: http://www.springer.de/comp/lncs/authors.html Submission Link: http://edas.info/newPaper.php?c=8553 IMPORTANT DATES April 4 - Abstract submission due (extended) May 19 - Full paper submission (extended) July 14 - Acceptance notification August 3 - Camera-ready version due August 31 - September 3 - conference CHAIR Michael Alexander (chair), scaledinfra technologies GmbH, Austria Gianluigi Zanetti (co-chair), CRS4, Italy PROGRAM COMMITTEE Padmashree Apparao, Intel Corp., USA Volker Buege, University of Karlsruhe, Germany Roberto Canonico, University of Napoli Federico II, Italy Tommaso Cucinotta, Scuola Superiore Sant'Anna, Italy Werner Fischer, Thomas Krenn AG, Germany William Gardner, University of Guelph, Canada Wolfgang Gentzsch, DEISA. Max Planck Gesellschaft, Germany Derek Groen, UVA, The Netherlands Marcus Hardt, Forschungszentrum Karlsruhe, Germany Sverre Jarp, CERN, Switzerland Shantenu Jha, Louisiana State University, USA Xuxian Jiang, NC State, USA Kenji Kaneda, Google, Japan Yves Kemp, DESY Hamburg, Germany Ignacio Llorente, Universidad Complutense de Madrid, Spain Naoya Maruyama, Tokyo Institute of Technology, Japan Jean-Marc Menaud, Ecole des Mines de Nantes, France Anastassios Nano, National Technical University of Athens, Greece Oliver Oberst, Karlsruhe Institute of Technology, Germany Jose Renato Santos, HP Labs, USA Borja Sotomayor, University of Chicago, USA Yoshio Turner, HP Labs, USA Kurt Tuschku, University of Vienna, Austria Lizhe Wang, Indiana University, USA Chao-Tung Yang, Tunghai University, Taiwan DURATION: Workshop Duration is one day. GENERAL INFORMATION The workshop will be held as part of Euro-Par 2010, Island of Ischia-Naples, Italy. Euro-Par 2010: http://www.europar2010.org/ From kilian.cavalotti.work at gmail.com Wed Mar 31 23:36:37 2010 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Thu, 1 Apr 2010 08:36:37 +0200 Subject: [Beowulf] AMD 6100 vs Intel 5600 In-Reply-To: <4BB3B5C6.8060001@cse.ucdavis.edu> References: <4BB369E1.60906@cora.nwra.com> <4BB3B5C6.8060001@cse.ucdavis.edu> Message-ID: On Wed, Mar 31, 2010 at 10:51 PM, Bill Broadley wrote: >> I would say that the 2x6-cores Magny-Cours probably has to be compared >> to Nehalem-EX. > > Why? Maybe first because that's where the core spaces from AMD and Intel intersect (8-cores Beckton and 8-cores Magny-Cours). I'm not sure it's really significant to compare performance between a 6-cores Westmere and a 12-cores Magny-Cours. I feel it makes more sense to compare apples to apples, ie. same core count. And then, also maybe because they are the same MP class, not dual-socket only. Meaning there are similarly equipped in terms of memory channels and inter-CPU links (QPI or HT), to be associated in platforms of 4 or more. > Various vendors try various strategies to differentiate products based > on features. ?For the most part HPC types care about performance per $, > performance per watt, and reliability. ?I'd be pretty surprised to see large > HPC cluster built out of Nehalem-EX chips. Not entirely built out of Nehalem-EX, probably, but including a fair share of this newly coming (again) SMP machines, I have no doubt. Users, both academic and from the industry, have more and more needs for huge amounts of memory, that can not easily be met using the distributed memory approach. Nehalem-EX and Magny-Cours offers just that, hundreds of GB or RAM. I know some people drooling right now at the idea of putting their hands on a 1TB machine. I'm totally in line with you on the price/perf points you made, though. Cheers, -- Kilian