From richard.walsh at comcast.net Mon Feb 1 07:24:13 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Mon, 1 Feb 2010 15:24:13 +0000 (UTC) Subject: [Beowulf] Re: GPU Beowulf Clusters In-Reply-To: Message-ID: <724359053.1362061265037853168.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> David Mathog wrote: >Jon Forrest wrote: > >> Are there any other issues I'm leaving out? > >Yes, the time and expense of rewriting your code from a CPU model to a >GPU model, and the learning curve for picking up this new skill. (Unless >you are lucky and somebody has already ported the software you use.) Coming in on this late, but to reduce this work load there is PGI's version 10.0 compiler suite which supports accelerator compiler directives. This will reduce the coding effort, but probably suffer from the classical "if it is easy, it won't perform as well" trade-off. My experience is limited, but a nice intro can be found at: http://www.pgroup.com/lit/articles/insider/v1n1a1.htm You might also inquire with PGI about their SC09 course and class notes or Google for them. rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shainer at mellanox.com Mon Feb 1 10:24:14 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Mon, 1 Feb 2010 10:24:14 -0800 Subject: [Beowulf] HPC Advisory Council: The 2010 (March 15-17) Switzerland HPC Conference Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F025D8266@mtiexch01.mti.com> Sending on behalf of the HPC Advisory Council. The HPC Advisory Council will hold the Switzerland HPC Workshop in March 2010 together with the Swiss National Supercomputing Centre (www.cscs.ch). The workshop will be dedicated to HPC training (interconnect architecture and advanced features, network management, HPC storage, CPU technologies, High performance visualization, accelerators and more). This is an excellent training and education opportunity for HPC IT professionals. The 3-day conference is free for attendees but registration is required. The workshop will include coffee breaks, lunch and evening events courtesy of the HPC Advisory Council. More info on the workshop can be found on the workshop web site - http://www.hpcadvisorycouncil.com/events/switzerland_workshop/. For registration, please use http://www.hpcadvisorycouncil.com/events/switzerland_workshop/attendee_reg.php. Thanks, Gilad From jlforrest at berkeley.edu Mon Feb 1 11:53:30 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Mon, 01 Feb 2010 11:53:30 -0800 Subject: [Beowulf] Re: GPU Beowulf Clusters In-Reply-To: <724359053.1362061265037853168.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <724359053.1362061265037853168.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <4B67313A.7090009@berkeley.edu> On 2/1/2010 7:24 AM, richard.walsh at comcast.net wrote: > Coming in on this late, but to reduce this work load there is PGI's version > 10.0 compiler suite which supports accelerator compiler directives. This > will reduce the coding effort, but probably suffer from the classical > "if it is > easy, it won't perform as well" trade-off. My experience is limited, but > a nice intro can be found at: I'm not sure how much traction such a thing will get. Let's say you have a big Fortran program that you want to port to CUDA. Let's assume you already know where the program spends its time, so you know which routines are good candidates for running on the GPU. Rather than rewriting the whole program in C[++], wouldn't it be easiest to leave all the non-CUDA parts of the program in Fortran, and then to call CUDA routines written in C[++]. Since the CUDA routines will have to be rewritten anyway, why write them in a language which would require purchasing yet another compiler? Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From richard.walsh at comcast.net Mon Feb 1 12:54:45 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Mon, 1 Feb 2010 20:54:45 +0000 (UTC) Subject: [Beowulf] Re: GPU Beowulf Clusters In-Reply-To: <4B67313A.7090009@berkeley.edu> Message-ID: <186160193.1523781265057685252.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Jon Forrest wrote: >On 2/1/2010 7:24 AM, richard.walsh at comcast.net wrote: > >> Coming in on this late, but to reduce this work load there is PGI's version >> 10.0 compiler suite which supports accelerator compiler directives. This >> will reduce the coding effort, but probably suffer from the classical >> "if it is >> easy, it won't perform as well" trade-off. My experience is limited, but >> a nice intro can be found at: > >I'm not sure how much traction such a thing will get. >Let's say you have a big Fortran program that you want >to port to CUDA. Let's assume you already know where the >program spends its time, so you know which routines >are good candidates for running on the GPU. > >Rather than rewriting the whole program in C[++], >wouldn't it be easiest to leave all the non-CUDA >parts of the program in Fortran, and then to call >CUDA routines written in C[++]. Since the CUDA >routines will have to be rewritten anyway, why >write them in a language which would require >purchasing yet another compiler? Mmm ... not sure I understand the response, but perhaps this response was to a different message ... ?? In any case, the PGI software supports accelerator directives for both C and Fortran, so for those languages I do not see a need to rewrite whole applications. The question presented is the same as always, what does the performance-programming effort function look like and how well does your code perform with directives to start with. The PGI models is also hardware generic and the code runs on the CPU in parallel when there is no GPU around I believe. What will gate interest is how well PGI compiler group does at delivering performance and how important portability is to the person developing the code. HMPP make offers a similar proposition ... rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From michf at post.tau.ac.il Mon Feb 1 15:56:44 2010 From: michf at post.tau.ac.il (Micha) Date: Tue, 02 Feb 2010 01:56:44 +0200 Subject: [Beowulf] Re: GPU Beowulf Clusters In-Reply-To: <186160193.1523781265057685252.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <186160193.1523781265057685252.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <4B676A3C.20504@post.tau.ac.il> On 01/02/2010 22:54, richard.walsh at comcast.net wrote: > > Jon Forrest wrote: > > >On 2/1/2010 7:24 AM, richard.walsh at comcast.net wrote: > > > >> Coming in on this late, but to reduce this work load there is PGI's > version > >> 10.0 compiler suite which supports accelerator compiler directives. This > >> will reduce the coding effort, but probably suffer from the classical > >> "if it is > >> easy, it won't perform as well" trade-off. My experience is limited, but > >> a nice intro can be found at: > > > >I'm not sure how much traction such a thing will get. > >Let's say you have a big Fortran program that you want > >to port to CUDA. Let's assume you already know where the > >program spends its time, so you know which routines > >are good candidates for running on the GPU. > > > >Rather than rewriting the whole program in C[++], > >wouldn't it be easiest to leave all the non-CUDA > >parts of the program in Fortran, and then to call > >CUDA routines written in C[++]. Since the CUDA > >routines will have to be rewritten anyway, why > >write them in a language which would require > >purchasing yet another compiler? > > Mmm ... not sure I understand the response, but perhaps this response > was to a different message ... ?? In any case, the PGI software supports > accelerator directives for both C and Fortran, so for those languages I do > not see a need to rewrite whole applications. The question presented is > the same as always, what does the performance-programming effort function > look like and how well does your code perform with directives to start > with. The PGI models is also hardware generic and the code runs on > the CPU in parallel when there is no GPU around I believe. What will > gate interest is how well PGI compiler group does at delivering performance > and how important portability is to the person developing the code. > As far as I know pgi also has a Cuda Fortran similar to cuda c, not only a directive based approach, but I have to admit that I don't have any experience with it. As for why spend money on a compiler since the code has to be re-written. Even an expensive compiler is cheap with regards to a programmer's time. Even for the salary of a cheap programmer you can buy the compiler in at most two weeks salary's worth. On the other hand, you have a programmer that already knows fortran and a piece of code that is already written and debugged in fortran. Quite a few programs can produce a first unoptimized version with very little work. Just sorting through counter based bugs and memory order bugs can cost you a lot more than the compiler. Fortran is 1 based compared to c that is 0 based (actually fortran 90/95 can use any index range for matrices). Fortran is column order while c is row order. Do you know how much head ache that can bring into the porting? Translating matlab code into fortran is also much easier that into c due to these issues. > HMPP make offers a similar proposition ... > > rbw > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From vanw+beowulf at sabalcore.com Mon Feb 1 15:44:39 2010 From: vanw+beowulf at sabalcore.com (Kevin Van Workum) Date: Mon, 1 Feb 2010 18:44:39 -0500 Subject: [Beowulf] sysstat experience Message-ID: <643a61251002011544g6e44d79by86911a3a673d8f80@mail.gmail.com> Does anyone have any experience using the sysstat tool? What are your opinions? My basic concern is its safety. I'd also like to know if you think it gives good results. sysstat is at: http://pagesperso-orange.fr/sebastien.godard/ -- Kevin Van Workum, PhD Sabalcore Computing Inc. Run your code on 500 processors. Sign up for a free trial account. www.sabalcore.com 877-492-8027 ext. 11 -------------- next part -------------- An HTML attachment was scrubbed... URL: From tegner at renget.se Mon Feb 1 23:45:56 2010 From: tegner at renget.se (tegner at renget.se) Date: Tue, 2 Feb 2010 08:45:56 +0100 Subject: [Beowulf] (no subject) Message-ID: This will boil down to a questions eventually, but I need to give some background first. We are a small group doing CFD, and when we several years ago realized that beowulfs would be the right choice for us we decided to extend our computational capabilities gradually. Every year, or every second year we bought two gigabit switches and a bunch of nodes connected to these switches. One of the switches is used for mpi communications and one for connecting the nodes to a fileserver and a master node. As of today we have five "subclusters", all connected to the same filserver and master node (torque/maui is used to distribute the jobs on the different subclusters). This has worked out great for us, and we do believe the strategy of buying gradually has been advantageous to us (instead of doing larger purchases less often), and we want to continue extending our hardware in this fashion. Up till now we have not been hurt by the fact that we have a single fileserver (connected to a bunch of raided drives), but we anticipate there will be issues when we further extend the number of nodes. And we plan on building a separate "infiniband storage network" (consisting of a 24 DDR switch) and connect a number of "gluster nodes" to it. Each subcluster will then be connected to this "infiniband storage network" via one (or maybe several) ports. However, we will still limit the jobs to run within there separate subcluster and we are going to accept lower bandwidth between the subclusters. By doing this we gain the following: (i) We can get more computational nodes, since we are limiting the number of ports used to connect the switches to each other. (ii) For our application I/O is not as demanding as the "mpi-communiction" but we are still getting - hopefully - acceptable I/O performance. (iii) We can extend our storage by adding more "gluster nodes" to the "infiniband storage network" when needed. (iv) We can continue adding subclusters when we have the money. And we can also remove old ones when they "cost" too much (in terms of electricity/performance, maintenance etc.). Since we havent worked with infiniband before, the question is simply if there could be issues with this approach? Regards, and thanks, /jon From tegner at renget.se Tue Feb 2 00:36:36 2010 From: tegner at renget.se (tegner at renget.se) Date: Tue, 2 Feb 2010 09:36:36 +0100 Subject: [Beowulf] storage solution/investment strategy Message-ID: Please excuse me, I forgot to put in the subject. Probably best to just disregard my previous post (content is the same). /jon This will boil down to a questions eventually, but I need to give some background first. We are a small group doing CFD, and when we several years ago realized that beowulfs would be the right choice for us we decided to extend our computational capabilities gradually. Every year, or every second year we bought two gigabit switches and a bunch of nodes connected to these switches. One of the switches is used for mpi communications and one for connecting the nodes to a fileserver and a master node. As of today we have five "subclusters", all connected to the same filserver and master node (torque/maui is used to distribute the jobs on the different subclusters). This has worked out great for us, and we do believe the strategy of buying gradually has been advantageous to us (instead of doing larger purchases less often), and we want to continue extending our hardware in this fashion. Up till now we have not been hurt by the fact that we have a single fileserver (connected to a bunch of raided drives), but we anticipate there will be issues when we further extend the number of nodes. And we plan on building a separate "infiniband storage network" (consisting of a 24 DDR switch) and connect a number of "gluster nodes" to it. Each subcluster will then be connected to this "infiniband storage network" via one (or maybe several) ports. However, we will still limit the jobs to run within there separate subcluster and we are going to accept lower bandwidth between the subclusters. By doing this we gain the following: (i) We can get more computational nodes, since we are limiting the number of ports used to connect the switches to each other. (ii) For our application I/O is not as demanding as the "mpi-communiction" but we are still getting (hopefully) acceptable I/O performance. (iii) We can extend our storage by adding more "gluster nodes" to the "infiniband storage network" when needed. (iv) We can continue adding subclusters when we have the money. And we can also remove old ones when they "cost" too much (in terms of electricity/performance, maintenance etc.). Since we havent worked with infiniband before, the question is simply if there could be issues with this approach? Regards, and thanks, /jon From diep at xs4all.nl Tue Feb 2 01:41:44 2010 From: diep at xs4all.nl (Vincent Diepeveen) Date: Tue, 2 Feb 2010 10:41:44 +0100 Subject: [Beowulf] hardware question - which PSU for this? Message-ID: <971A1CC5-ABBD-4E02-8010-1260805E2DD1@xs4all.nl> hi, This seems ideal mainboard for beowulf clusters. built in infiniband it seems. http://cgi.ebay.com/Arima-AMD-Opteron-Quad-Core-Socket-F-3000-series- Server_W0QQitemZ390149471460QQcmdZViewItemQQptZCOMP_EN_Networking_Compon ents?hash=item5ad6b87ce4 they get offered regurarly and cheap. on ebay a while ago 10 of 'em got offered: http://cgi.ebay.com/ws/eBayISAPI.dll? ViewItem&item=360198390884&ssPageName=ADME:X:RTQ:US:1123 The problem is no manufacturer lists these boards and which PSU fits on it and what cpu's is unclear. that's not a good way to build a cluster huh? well suppose there is a psu that works, then that's ideal to build a cluster with. those barcelona cores do like $40 - $90 a piece on ebay now. So most expensive part of the machine is the DDR2- ECC-reg ram. Some 'cheap offers' on ebay regrettably are not ecc-reg despite what the description. For the coming time of course pricewise nothing beats 16 core AMD's. Those cpu's way faster than i had thought. Intel is too expensive second hand simply and always will be, as they are very good in marketing their hardware as being fast, despite that already for like half a decade they have no good 4 socket platform getting sold (if we forget about Dunnington - that thing is too expensive anyway still so let's forget about Dunnington). intel and amd seem to have canned all type of new cpu's. i read now nehalem-ex is just 6 cores, no longer 8 and 2.26Ghz, so that's not so interesting as IPC wise for well written software you can mathematical prove that AMD is nearly same speed. If intel has that problem i suppose amd won't have a 12 core version of their chip any soon, besides those are like $2k a piece the highend versions. Too much. At least for my software the barcelona's are way faster than i had thought. DDR3 is a lot faster latency of course than DDR2, but the total cpu speed is what counts in first place and the price you can get it for. Considering testreports of those 8 core cpu's i guess we won't see them soon. What was it 700mm^2? They can't produce those in times like this i guess. Too expensive, no company wants to buy that. So hence my quest for cheap alternatives :) sorry for off-topic posting. guess though more of you are looking for cheap clusters. Seems NASA has to size down also nowadays, so no more intel hardware for it in future either i bet. Vincent From john.hearns at mclaren.com Tue Feb 2 03:24:38 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 2 Feb 2010 11:24:38 -0000 Subject: [Beowulf] hardware question - which PSU for this? In-Reply-To: <971A1CC5-ABBD-4E02-8010-1260805E2DD1@xs4all.nl> References: <971A1CC5-ABBD-4E02-8010-1260805E2DD1@xs4all.nl> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F245EC1@milexchmb1.mil.tagmclarengroup.com> > > intel and amd seem to have canned all type of new cpu's. i read now > nehalem-ex is just 6 cores, no longer 8 and 2.26Ghz, Vincent, please can you provide a reference for this? My understanding is that Nehalem-EX will be available in eight core versions and a six core edition, which is attractive for HPC use. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From jlforrest at berkeley.edu Tue Feb 2 14:00:37 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 02 Feb 2010 14:00:37 -0800 Subject: [Beowulf] Transient NFS Problems in New Cluster Message-ID: <4B68A085.2070406@berkeley.edu> I have a new cluster running CentOS 5.3. The cluster uses a Sun 7310 storage server that provides NFS service over a private 1Gb/s ethernet with 9K jumbo frames to the cluster. We've noticed that a number of the compute nodes sometimes generate the automount[15023]: umount_autofs_indirect: ask umount returned busy /home message. When this happens the program running on the node dies. This has happened between 10 and 20 times. We're not sure what's going on on a node when this happens. Most of the time everything is fine and the home directories are automounted without problem. I've googled for this problem and I see that other people have seen it too, but I've never seen a resolution, especially not for RHEL5. The auto.master line for this mount is /home /etc/auto.home --timeout=1200 noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768 The network interface configuration is eth0 Link encap:Ethernet HWaddr 00:30:48:B9:F6:52 inet addr:10.1.255.233 Bcast:10.1.255.255 Mask:255.255.0.0 inet6 addr: fe80::230:48ff:feb9:f652/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:32999308 errors:0 dropped:0 overruns:0 frame:0 TX packets:27468315 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:24225053296 (22.5 GiB) TX bytes:73313582546 (68.2 GiB) Interrupt:74 Base address:0x2000 Any advice on what to do? Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 From landman at scalableinformatics.com Tue Feb 2 14:29:13 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 02 Feb 2010 17:29:13 -0500 Subject: [Beowulf] Transient NFS Problems in New Cluster In-Reply-To: <4B68A085.2070406@berkeley.edu> References: <4B68A085.2070406@berkeley.edu> Message-ID: <4B68A739.3020805@scalableinformatics.com> Jon Forrest wrote: > I have a new cluster running CentOS 5.3. > The cluster uses a Sun 7310 storage server > that provides NFS service over a private > 1Gb/s ethernet with 9K jumbo frames to the > cluster. > > We've noticed that a number of the compute > nodes sometimes generate the > > automount[15023]: umount_autofs_indirect: ask umount returned busy /home [...] > Any advice on what to do? We still recommend turning off autofs for home directories. We've seen lots of problems with it on many clusters. Hard mounts are IMO better. That server should be able to handle it. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From jlforrest at berkeley.edu Tue Feb 2 15:27:20 2010 From: jlforrest at berkeley.edu (Jon Forrest) Date: Tue, 02 Feb 2010 15:27:20 -0800 Subject: [Beowulf] Transient NFS Problems in New Cluster In-Reply-To: <4B68A739.3020805@scalableinformatics.com> References: <4B68A085.2070406@berkeley.edu> <4B68A739.3020805@scalableinformatics.com> Message-ID: <4B68B4D8.3010807@berkeley.edu> On 2/2/2010 2:29 PM, Joe Landman wrote: > We still recommend turning off autofs for home directories. We've seen > lots of problems with it on many clusters. Hard mounts are IMO better. > That server should be able to handle it. These problems were also happening for another non-home mount, but I hear what you're saying. This is the only cluster we're seeing the problem on, but then this is the only cluster with a Sun storage server. All the others are using CentOS 5.X. Do you think this problem could be caused by the server? Also, what do you believe the fundamental cause is? Cordially, -- Jon Forrest Research Computing Support College of Chemistry 173 Tan Hall University of California Berkeley Berkeley, CA 94720-1460 510-643-1032 jlforrest at berkeley.edu From christiansuhendra at gmail.com Mon Feb 1 22:26:50 2010 From: christiansuhendra at gmail.com (christian suhendra) Date: Mon, 1 Feb 2010 18:26:50 -1200 Subject: [Beowulf] problem of mpich-1.2.7p1 Message-ID: hello guys i have installed mpich-1.2.7p1 on ubuntu 9.04, i have configured hte NFS and RSH.. i use device=ch_p4,, but when i ran my program it's like not working i've got this result : root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 1 canon Process 0 of 1 on cluster3 Total Time: 4.316000 msecs root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 4 canon Process 0 of 4 on cluster3 Total Time: 21.552000 msecs Process 2 of 4 on cluster2 Process 1 of 4 on cluster1 Process 3 of 4 on cluster1 root at cluster3:/mirror/mpich-1.2.7p1# the process only wotk in 1 node.. but when i test the machine it connected to all node.. root at cluster3:/mirror/mpich-1.2.7p1# /mirror/mpich-1.2.7p1/sbin/tstmachines -v LINUX Trying true on cluster1 ... Trying true on cluster2 ... Trying true on cluster3 ... Trying true on cluster4 ... Trying ls on cluster1 ... Trying ls on cluster2 ... Trying ls on cluster3 ... Trying ls on cluster4 ... Trying user program on cluster1 ... Trying user program on cluster2 ... Trying user program on cluster3 ... Trying user program on cluster4 ... i don't know where exactly the problem so that my program cannot run in all node.. please help me... my deadline its about 1 week later... i'm very excpeting your help... i attached my listing program so you can test on your system thank you very much... regards christian -------------- next part -------------- An HTML attachment was scrubbed... URL: From gus at ldeo.columbia.edu Tue Feb 2 17:48:07 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 02 Feb 2010 20:48:07 -0500 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: References: Message-ID: <4B68D5D7.1010907@ldeo.columbia.edu> Hi Christian Somehow your program was not attached to the message. In any case, you didn't say anything about your "machinefile" contents. You need to list the nodes you want to use there. The command line will be something like this: mpirun -np 4 -machinefile my_machinefile canon "man mpirun" may help you with the details. (I assume you are using the mpirun that comes with mpich1.) Having said that, I suggest that you move from MPICH-1 to OpenMPI or to MPICH2. MPICH-1 (mpich-1.2.7p1) is old, not maintained or supported anymore, and often times breaks in current Linux kernels. The MPICH developers also recommend upgrading to MPICH2. OpenMPI and MPICH2 are free, easy to install, stable, up to date, and more efficient than MPICH1. Upgrading to one of them is likely to avoid more trouble later, specially with your tight deadline. See: http://www.open-mpi.org/ http://www.mcs.anl.gov/research/projects/mpich2/ I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- christian suhendra wrote: > hello guys > i have installed mpich-1.2.7p1 on ubuntu 9.04, i have configured hte NFS > and RSH.. > i use device=ch_p4,, > but when i ran my program it's like not working i've got this result : > root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 1 canon > Process 0 of 1 on cluster3 > Total Time: 4.316000 msecs > root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 4 canon > Process 0 of 4 on cluster3 > Total Time: 21.552000 msecs > Process 2 of 4 on cluster2 > Process 1 of 4 on cluster1 > Process 3 of 4 on cluster1 > root at cluster3:/mirror/mpich-1.2.7p1# > > the process only wotk in 1 node.. > but when i test the machine it connected to all node.. > root at cluster3:/mirror/mpich-1.2.7p1# > /mirror/mpich-1.2.7p1/sbin/tstmachines -v LINUX > Trying true on cluster1 ... > Trying true on cluster2 ... > Trying true on cluster3 ... > Trying true on cluster4 ... > Trying ls on cluster1 ... > Trying ls on cluster2 ... > Trying ls on cluster3 ... > Trying ls on cluster4 ... > Trying user program on cluster1 ... > Trying user program on cluster2 ... > Trying user program on cluster3 ... > Trying user program on cluster4 ... > > i don't know where exactly the problem so that my program cannot run in > all node.. > please help me... > my deadline its about 1 week later... > i'm very excpeting your help... > > > i attached my listing program so you can test on your system > thank you very much... > > > > > regards > christian > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From gus at ldeo.columbia.edu Tue Feb 2 17:58:01 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 02 Feb 2010 20:58:01 -0500 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: <4B68D5D7.1010907@ldeo.columbia.edu> References: <4B68D5D7.1010907@ldeo.columbia.edu> Message-ID: <4B68D829.4010604@ldeo.columbia.edu> PS - And don't run the programs as root! Gus Correa Gus Correa wrote: > Hi Christian > > Somehow your program was not attached to the message. > > In any case, you didn't say anything about your "machinefile" contents. > You need to list the nodes you want to use there. > The command line will be something like this: > > mpirun -np 4 -machinefile my_machinefile canon > > "man mpirun" may help you with the details. > (I assume you are using the mpirun that comes with mpich1.) > > Having said that, I suggest that you move from MPICH-1 to > OpenMPI or to MPICH2. > MPICH-1 (mpich-1.2.7p1) is old, not maintained or supported anymore, > and often times breaks in current Linux kernels. > The MPICH developers also recommend upgrading to MPICH2. > > OpenMPI and MPICH2 are free, easy to install, stable, up to date, > and more efficient than MPICH1. > Upgrading to one of them is likely to avoid more trouble later, > specially with your tight deadline. > > See: > http://www.open-mpi.org/ > http://www.mcs.anl.gov/research/projects/mpich2/ > > > I hope this helps, > Gus Correa > --------------------------------------------------------------------- > Gustavo Correa > Lamont-Doherty Earth Observatory - Columbia University > Palisades, NY, 10964-8000 - USA > --------------------------------------------------------------------- > > > christian suhendra wrote: >> hello guys >> i have installed mpich-1.2.7p1 on ubuntu 9.04, i have configured hte >> NFS and RSH.. >> i use device=ch_p4,, >> but when i ran my program it's like not working i've got this result : >> root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 1 canon >> Process 0 of 1 on cluster3 >> Total Time: 4.316000 msecs >> root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 4 canon >> Process 0 of 4 on cluster3 >> Total Time: 21.552000 msecs >> Process 2 of 4 on cluster2 >> Process 1 of 4 on cluster1 >> Process 3 of 4 on cluster1 >> root at cluster3:/mirror/mpich-1.2.7p1# >> >> the process only wotk in 1 node.. >> but when i test the machine it connected to all node.. >> root at cluster3:/mirror/mpich-1.2.7p1# >> /mirror/mpich-1.2.7p1/sbin/tstmachines -v LINUX >> Trying true on cluster1 ... >> Trying true on cluster2 ... >> Trying true on cluster3 ... >> Trying true on cluster4 ... >> Trying ls on cluster1 ... >> Trying ls on cluster2 ... >> Trying ls on cluster3 ... >> Trying ls on cluster4 ... >> Trying user program on cluster1 ... >> Trying user program on cluster2 ... >> Trying user program on cluster3 ... >> Trying user program on cluster4 ... >> >> i don't know where exactly the problem so that my program cannot run >> in all node.. >> please help me... >> my deadline its about 1 week later... >> i'm very excpeting your help... >> >> >> i attached my listing program so you can test on your system >> thank you very much... >> >> >> >> >> regards >> christian >> >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From gus at ldeo.columbia.edu Tue Feb 2 18:31:49 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Tue, 02 Feb 2010 21:31:49 -0500 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: <4B68D829.4010604@ldeo.columbia.edu> References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> Message-ID: <4B68E015.6040202@ldeo.columbia.edu> Hi Christian What is the content of your file /mirror/mpich-1.2.7p1/share/machines.LINUX? Please send it on your next message, it may clarify. It looks like to me that your program is working correctly. (I am guessing a bit, because you didn't send the source code.) When you did "mpirun -np 1 canon" it ran one process on cluster3: See: >>> Process 0 of 1 on cluster3 >>> Total Time: 4.316000 msecs When you did "mpirun -np 4 canon" it ran two processes on cluster1, and one in cluster2 and cluster3. See: >>> Process 0 of 4 on cluster3 >>> Total Time: 21.552000 msecs >>> Process 2 of 4 on cluster2 >>> Process 1 of 4 on cluster1 >>> Process 3 of 4 on cluster1 Did you expect more output than this? Did you expect a different output? Did you expect it to use a different set of computers? Anyway, you would be better off upgrading to OpenMPI or MPICH2. The README file in the OpenMPI tarball has all information you need to install it. Chances are that MPICH1 will break in more complicated programs. And remember not to run user-level programs as root. That's not really safe. I hope this helps. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- Gus Correa wrote: > PS - And don't run the programs as root! > > Gus Correa > > Gus Correa wrote: >> Hi Christian >> >> Somehow your program was not attached to the message. >> >> In any case, you didn't say anything about your "machinefile" contents. >> You need to list the nodes you want to use there. >> The command line will be something like this: >> >> mpirun -np 4 -machinefile my_machinefile canon >> >> "man mpirun" may help you with the details. >> (I assume you are using the mpirun that comes with mpich1.) >> >> Having said that, I suggest that you move from MPICH-1 to >> OpenMPI or to MPICH2. >> MPICH-1 (mpich-1.2.7p1) is old, not maintained or supported anymore, >> and often times breaks in current Linux kernels. >> The MPICH developers also recommend upgrading to MPICH2. >> >> OpenMPI and MPICH2 are free, easy to install, stable, up to date, >> and more efficient than MPICH1. >> Upgrading to one of them is likely to avoid more trouble later, >> specially with your tight deadline. >> >> See: >> http://www.open-mpi.org/ >> http://www.mcs.anl.gov/research/projects/mpich2/ >> >> >> I hope this helps, >> Gus Correa >> --------------------------------------------------------------------- >> Gustavo Correa >> Lamont-Doherty Earth Observatory - Columbia University >> Palisades, NY, 10964-8000 - USA >> --------------------------------------------------------------------- >> >> >> christian suhendra wrote: >>> hello guys >>> i have installed mpich-1.2.7p1 on ubuntu 9.04, i have configured hte >>> NFS and RSH.. >>> i use device=ch_p4,, >>> but when i ran my program it's like not working i've got this result : >>> root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 1 canon >>> Process 0 of 1 on cluster3 >>> Total Time: 4.316000 msecs >>> root at cluster3:/mirror/mpich-1.2.7p1# mpirun -np 4 canon >>> Process 0 of 4 on cluster3 >>> Total Time: 21.552000 msecs >>> Process 2 of 4 on cluster2 >>> Process 1 of 4 on cluster1 >>> Process 3 of 4 on cluster1 >>> root at cluster3:/mirror/mpich-1.2.7p1# >>> >>> the process only wotk in 1 node.. >>> but when i test the machine it connected to all node.. >>> root at cluster3:/mirror/mpich-1.2.7p1# >>> /mirror/mpich-1.2.7p1/sbin/tstmachines -v LINUX >>> Trying true on cluster1 ... >>> Trying true on cluster2 ... >>> Trying true on cluster3 ... >>> Trying true on cluster4 ... >>> Trying ls on cluster1 ... >>> Trying ls on cluster2 ... >>> Trying ls on cluster3 ... >>> Trying ls on cluster4 ... >>> Trying user program on cluster1 ... >>> Trying user program on cluster2 ... >>> Trying user program on cluster3 ... >>> Trying user program on cluster4 ... >>> >>> i don't know where exactly the problem so that my program cannot run >>> in all node.. >>> please help me... >>> my deadline its about 1 week later... >>> i'm very excpeting your help... >>> >>> >>> i attached my listing program so you can test on your system >>> thank you very much... >>> >>> >>> >>> >>> regards >>> christian >>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >>> To change your subscription (digest mode or unsubscribe) visit >>> http://www.beowulf.org/mailman/listinfo/beowulf >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From atp at piskorski.com Tue Feb 2 21:03:05 2010 From: atp at piskorski.com (Andrew Piskorski) Date: Wed, 3 Feb 2010 00:03:05 -0500 Subject: [Beowulf] hardware question - which PSU for this? In-Reply-To: <971A1CC5-ABBD-4E02-8010-1260805E2DD1@xs4all.nl> References: <971A1CC5-ABBD-4E02-8010-1260805E2DD1@xs4all.nl> Message-ID: <20100203050305.GA93678@piskorski.com> On Tue, Feb 02, 2010 at 10:41:44AM +0100, Vincent Diepeveen wrote: > This seems ideal mainboard for beowulf clusters. built in infiniband > it seems. > > http://cgi.ebay.com/Arima-AMD-Opteron-Quad-Core-Socket-F-3000-series-Server_W0QQitemZ390149471460QQcmdZViewItemQQptZCOMP_EN_Networking_Components?hash=item5ad6b87ce4 Vincent, what makes you think that motherboard has Infiniband? Do you see connectors for it in the pictures or something? Arima's older SW500 quad socket 940 Opteron motherboard came in four versions, two of which had Infiniband (Mellanox MT25208). Thus, it's quite possible that this newer quad socket F board also had an Infiniband option, but I have no idea if this particular board has it or not. http://www.flextronics.com/computing/support/server/Product/ViewProduct.asp?View=SW500 > The problem is no manufacturer lists these boards and which PSU fits > on it and what cpu's is unclear. It's an Arima model 40GCMG020-D400-100. Perhaps it was intended as an OEM motherboard for Gateway or Dell; at least, I can't think of any other good reason why there's essentially no information whatsoever about it on web. Arima sold its computer business to Flextronics in 2007, perhaps that's related somehow. Chris Morrell (aka, "[XC] gomeler", a Georgia Tech student and sysadmin) seems to have made good progress in figuring out the functions of most of the pins on the motherboard's non-standard power connectors, but hasn't yet succeeded in getting his to boot: http://www.xtremesystems.org/forums/showpost.php?p=4194423&postcount=137 http://www.xtremesystems.org/forums/showpost.php?p=4224859&postcount=175 http://atrejus.net/arima/arima-gr.jpg http://forums.2cpu.com/showthread.php?s=e603598f0c265e1e0725a0b10b1a3757&p=772296#post772296 -- Andrew Piskorski http://www.piskorski.com/ From henning.fehrmann at aei.mpg.de Tue Feb 2 23:28:45 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Wed, 3 Feb 2010 08:28:45 +0100 Subject: [Beowulf] Transient NFS Problems in New Cluster In-Reply-To: <4B68A085.2070406@berkeley.edu> References: <4B68A085.2070406@berkeley.edu> Message-ID: <20100203072845.GA3809@gretchen.aei.mpg.de> On Tue, Feb 02, 2010 at 02:00:37PM -0800, Jon Forrest wrote: > I have a new cluster running CentOS 5.3. > The cluster uses a Sun 7310 storage server > that provides NFS service over a private > 1Gb/s ethernet with 9K jumbo frames to the > cluster. > > We've noticed that a number of the compute > nodes sometimes generate the > > automount[15023]: umount_autofs_indirect: ask umount returned busy /home > > message. When this happens the program running on the > node dies. This has happened between 10 and 20 times. > We're not sure what's going on on a node when this > happens. Most of the time everything is fine and > the home directories are automounted without problem. > > I've googled for this problem and I see that other people > have seen it too, but I've never seen a resolution, > especially not for RHEL5. I guess the problem has not directly something to do with RHEL5. You might want to post this question to autofs at linux.kernel.org They need to know the version of autofs and the kernel. > > The auto.master line for this mount is > > /home /etc/auto.home --timeout=1200 You could try to reduce the timeout. Nothing speaks against a timeout of 60s. Many things can happen in 1200s - especially on the server side. > noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768 You could try nolock on the client side and async on the server side. The user should take care that not two processes are writing into the same files to avoid race conditions. Cheers, Henning From prentice at ias.edu Wed Feb 3 06:56:36 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 03 Feb 2010 09:56:36 -0500 Subject: [Beowulf] GPU Beowulf Clusters In-Reply-To: <4B65C8B0.9060300@pathscale.com> References: <4B61CB86.9060105@berkeley.edu> <20100130143145.43de588a@vivalunalitshi.luna.local> <4B647949.9030700@berkeley.edu> <4B64B83B.9080303@pathscale.com> <4B64DD37.50201@berkeley.edu> <20100131193358.395acc13@vivalunalitshi.luna.local> <4B65C8B0.9060300@pathscale.com> Message-ID: <4B698EA4.3050601@ias.edu> C. Bergstr?m wrote: >> NVidia techs told me that the performance difference can be about 1:2. >> > That used to be true, but I thought they fixed that? (How old is your > information) I heard this myself many times SC09. And that was in reference to Fermi, so doubt it's changed much since then. -- Prentice From prentice at ias.edu Wed Feb 3 07:22:17 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 03 Feb 2010 10:22:17 -0500 Subject: [Beowulf] Transient NFS Problems in New Cluster In-Reply-To: <4B68A085.2070406@berkeley.edu> References: <4B68A085.2070406@berkeley.edu> Message-ID: <4B6994A9.3090304@ias.edu> Jon Forrest wrote: > I have a new cluster running CentOS 5.3. > The cluster uses a Sun 7310 storage server > that provides NFS service over a private > 1Gb/s ethernet with 9K jumbo frames to the > cluster. > > We've noticed that a number of the compute > nodes sometimes generate the > > automount[15023]: umount_autofs_indirect: ask umount returned busy /home > > message. When this happens the program running on the > node dies. This has happened between 10 and 20 times. > We're not sure what's going on on a node when this > happens. Most of the time everything is fine and > the home directories are automounted without problem. > > I've googled for this problem and I see that other people > have seen it too, but I've never seen a resolution, > especially not for RHEL5. > > The auto.master line for this mount is > > /home /etc/auto.home --timeout=1200 > noatime,nodiratime,rw,noacl,rsize=32768,wsize=32768 > > The network interface configuration is > Jon, I had this same exact problem a couple of weeks ago after changing the autmounting scheme on our network, requiring all nodes to reread the automounter configuration. It only happened on a few nodes. My only solution was reboot the nodes with the problem. After rebooting, 'service autofs reload' or 'service autofs restart' worked without a problem. I'm sure that's not the answer you were looking for, but that's all I got. Sorry. I suspect its a bug in the automount daemon, but I can't prove it. -- Prentice From prentice at ias.edu Wed Feb 3 07:27:35 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 03 Feb 2010 10:27:35 -0500 Subject: [Beowulf] Transient NFS Problems in New Cluster In-Reply-To: <4B68A739.3020805@scalableinformatics.com> References: <4B68A085.2070406@berkeley.edu> <4B68A739.3020805@scalableinformatics.com> Message-ID: <4B6995E7.5050604@ias.edu> Joe Landman wrote: > Jon Forrest wrote: >> I have a new cluster running CentOS 5.3. >> The cluster uses a Sun 7310 storage server >> that provides NFS service over a private >> 1Gb/s ethernet with 9K jumbo frames to the >> cluster. >> >> We've noticed that a number of the compute >> nodes sometimes generate the >> >> automount[15023]: umount_autofs_indirect: ask umount returned busy /home > > [...] > >> Any advice on what to do? > > We still recommend turning off autofs for home directories. We've seen > lots of problems with it on many clusters. Hard mounts are IMO better. > That server should be able to handle it. > How do you handle situations where home directories are spread across multiple servers? Do have a large /etc/fstab on ever NFS client? -- Prentice From prentice at ias.edu Wed Feb 3 07:29:52 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Wed, 03 Feb 2010 10:29:52 -0500 Subject: [Beowulf] Transient NFS Problems in New Cluster In-Reply-To: <4B68B4D8.3010807@berkeley.edu> References: <4B68A085.2070406@berkeley.edu> <4B68A739.3020805@scalableinformatics.com> <4B68B4D8.3010807@berkeley.edu> Message-ID: <4B699670.9050708@ias.edu> Jon Forrest wrote: > On 2/2/2010 2:29 PM, Joe Landman wrote: > >> We still recommend turning off autofs for home directories. We've seen >> lots of problems with it on many clusters. Hard mounts are IMO better. >> That server should be able to handle it. > > These problems were also happening for another > non-home mount, but I hear what you're saying. > > This is the only cluster we're seeing the problem > on, but then this is the only cluster with a > Sun storage server. All the others are using > CentOS 5.X. Do you think this problem could > be caused by the server? Also, what do you > believe the fundamental cause is? In my case, the fileserver was a NetApp, and the client was a rebuild of RHEL 2.8WS -- Prentice From gus at ldeo.columbia.edu Wed Feb 3 11:02:21 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 03 Feb 2010 14:02:21 -0500 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> <4B68E015.6040202@ldeo.columbia.edu> Message-ID: <4B69C83D.2020301@ldeo.columbia.edu> Hi Christian The program attachment didn't come again. You may try to cut and paste the program to the bottom of the message. Now I see, you are worried about MPI performance, not the program correctness at this point. If your program does too little work, it is likely that the initialization/finalization and the whole MPI setup and communication take more time than the actual computation. If this is the case, and particularly if your network is slow (say Ethernet 100), you will see better performance for less nodes when the "problem size is small". There is nothing wrong with this. This phenomenon, and several variants of it, are called "Amdahl's Law": http://en.wikipedia.org/wiki/Amdahl's_law In general the "problem size" is controlled by one or a few numbers on your code or on your parameter files. Problem size may be controlled by, say, the size of an array or matrix, the number of iterations of a main loop, etc. Could you perhaps increase the problem size on your code, say boost it up 10 or 100 times, and see if the performance in many nodes still beats one node alone? I hope this helps, Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- christian suhendra wrote: > oh...the content of /mirror/mpich-1.2.7p1/share/ > machines.LINUX are the hostname of each node.. > here is the content: > cluster1 > cluster2 > cluster3 > cluster4 > > when i ran the canon on 1 node i got the total time is 4.316000 msecs > but when i ran canon in 4 node. > see: > mpirun -np 4 the total is 21.552000 msecs > > it takes a long time then i node..it supposed to be more faster then 1 > node/PC.. > i this case i juzt need the mpich or my program work in all of node so > that the total time would be more faster then run in 1 node.. > > i attached my program so you could investigated the problem, but i > thougt the real problem is on the configuration.. > > > thank you so much mr. gus... > i really need your help i don't know how to solve this problem even my > lecturer on my university doesn't know how to solve this..actually this > is my final project for my thesis.. > and i take this because i wants to be an expert on this field sometimes.. > > > regards > christian From kus at free.net Wed Feb 3 11:11:10 2010 From: kus at free.net (Mikhail Kuzminsky) Date: Wed, 03 Feb 2010 22:11:10 +0300 Subject: [Beowulf] top500 clusters using Message-ID: Last top500 lists contain a set of sites which are not from HPC world. I understand that there may be parallelized applications which may use whole cluster for one task, but this task isn't floating-point oriented. But (in top500) there is a set of "unnamed" sites like "IT service provider" etc. IMHO they may use clusters for Web-hosting etc, where may be load balancing is used. I.e. it's not "supercomputer" (I means computer, all CPUs/cores of which may be used for solving of one task). Am I right - or they really work w/applications, which may involve whole cluster for one task solving ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From Shainer at mellanox.com Wed Feb 3 11:34:49 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Wed, 3 Feb 2010 11:34:49 -0800 Subject: [Beowulf] top500 clusters using References: Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F025D8623@mtiexch01.mti.com> Not all of the Top500 systems are really HPC systems. Several systems are basically used for enterprise applications but before moving them into the "production" environment, the associate vendor run the Linpack benchmark and submit the system to the list. Check out http://www.hpcwire.com/features/17905159.html. Somewhat old but still valid for some cases. Gilad -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Mikhail Kuzminsky Sent: Wednesday, February 03, 2010 11:11 AM To: beowulf at beowulf.org Subject: [Beowulf] top500 clusters using Last top500 lists contain a set of sites which are not from HPC world. I understand that there may be parallelized applications which may use whole cluster for one task, but this task isn't floating-point oriented. But (in top500) there is a set of "unnamed" sites like "IT service provider" etc. IMHO they may use clusters for Web-hosting etc, where may be load balancing is used. I.e. it's not "supercomputer" (I means computer, all CPUs/cores of which may be used for solving of one task). Am I right - or they really work w/applications, which may involve whole cluster for one task solving ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From hahn at mcmaster.ca Wed Feb 3 13:27:00 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 3 Feb 2010 16:27:00 -0500 (EST) Subject: [Beowulf] top500 clusters using In-Reply-To: References: Message-ID: > Last top500 lists contain a set of sites which are not from HPC world. sure - it always has. top500 is not particularly HPC-specific, since linpack only weakly measures important factors like memory and interconnect bandwidth and latency. the main appeal of linpack is that it's pretty well-understood and uses hardware (FPUs) present in all conventional machines... > I understand that there may be parallelized applications which may use whole > cluster for one task, but this task isn't floating-point oriented. I would almost say that linpack is not particularly about FP (since you can derive integer rates from the scores if you want.) > etc. IMHO they may use clusters for Web-hosting etc, where may be load > balancing is used. I.e. it's not "supercomputer" (I means computer, all > CPUs/cores of which may be used for solving of one task). I think it's more common than you think for clusters to be load-balanced among many applications - or conversely that single-job clusters, which might strictly be called "capability", are rare. From gus at ldeo.columbia.edu Wed Feb 3 19:01:17 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Wed, 03 Feb 2010 22:01:17 -0500 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> <4B68E015.6040202@ldeo.columbia.edu> <4B69C83D.2020301@ldeo.columbia.edu> Message-ID: <4B6A387D.4000900@ldeo.columbia.edu> Hi Christian Is the code trying to multiply two matrices using block decomposition? In MPICH2 you need to establish passwordless ssh (not rsh!) connection across your machines. You also need to start the mpd daemon ring (although there is now also the Hydra mechanism, but I am not familiar to it). Finally you launch your program with mpirun (now called also mpiexec), and you can pass your machine file in the command line (in MPICH1 you could do the same too). "man mpirun" will help you. In OpenMPI you also need to establish passwordless ssh connection across the machines. However, there is no daemon ring to start, making the process easier. Likewise, you can give a hostfile/machine file on the command line, or you can list the hosts one by one in the command line. In my opinion OpenMPI is easier to use and install (and more flexible, and has more features). I hope this helps, Gus Correa christian suhendra wrote: > please take a look my source code.. > > mr.gus i wanna ask something about the mpich2 > if i swith my mpich1 to mpich2 where do i have supposed to put the list > of machine of each node.. > if in mpich1 the list in machines.LINUX.. > in mpich2 where is it?? > > about amdahl law..i still confused about that..^_^ > maybe i have to read many times so i can understand.. > > thank you mr.gus.. > i still need your advice > > > > > regards > christian christian suhendra wrote: > hello mr.gus here is my source code.. > when i compile that it works.. > > #include > #include > #include > #include > > #define N 100 /* 512*/ > > void Print_Matrix(int x, int y, int myid,int S, int *M); > int readmat(char *fname, int *mat, int n); > int writemat(char *fname, int *mat, int n); > > int A[N][N],B[N][N],C[N][N]; > int main(int argc, char **argv){ > int myid, S, M, nproc,source,dest; > int i,j,k,m,l,repeat,temp,S2; > int namelen; > char processor_name[MPI_MAX_PROCESSOR_NAME]; > MPI_Status status; > double t1, t2,t3,t4; > > int *T,*TA,*TB,*t,*TC; > MPI_Comm GRID_COMM; > int ndims,dims[2]; > int periods[2],coords[2],grid_rank; > > MPI_Init(&argc, &argv); /* initialization MPI */ > MPI_Comm_rank(MPI_COMM_WORLD,&myid); > MPI_Comm_size(MPI_COMM_WORLD,&nproc); /* #procesor */ > MPI_Get_processor_name(processor_name,&namelen); > > printf("Process %d of %d on %s\n",myid, nproc, processor_name); > > if(myid==0) { > /* read data from files: "A_file", "B_file"*/ > if (readmat("A_file.txt", (int *) A,N) < 0) > exit( 1 + printf("file problem\n") ); > if (readmat("B_file.txt", (int *) B, N) < 0) > exit( 1 + printf("file problem\n") ); > /*catat waktu*/ > t1=MPI_Wtime(); > } > /* topologi*/ > M=(int)sqrt(nproc); > S=N/M; /*dimensi blok*/ > S2=S*S; /*dimensi blok*/ > dims[0]=dims[1]=M; /*dimensi topologi*/ > periods[0]=periods[1]=1; > MPI_Cart_create(MPI_COMM_WORLD,2,dims,periods,0,&GRID_COMM); > > MPI_Comm_rank(GRID_COMM,&grid_rank); > MPI_Cart_coords(GRID_COMM,grid_rank,2,coords); > myid=grid_rank; > source=coords[0]; > dest=coords[1]; > > /*place for matrix input and output*/ > TA=(int *)malloc(sizeof(MPI_INT)*S2); > TB=(int *)malloc(sizeof(MPI_INT)*S2); > TC=(int *)malloc(sizeof(MPI_INT)*S2); > > for(i=0; i TC[i]=0; > > /*start cannon*/ > if(myid==0) > { > T=(int *)malloc(sizeof(MPI_INT)*S2); > t3=MPI_Wtime(); /*timing*/ > for(k=0; k MPI_Cart_coords(GRID_COMM,k,2,coords); > if(k==0){ > t=TA; > for(i=k; i temp=(k*S)%N; > for(j=temp; j *t=A[i][j]; > t++; > } > } > } > else{ > t=T; > for(i=coords[0]*S; i<(coords[0]+1)*S; i++) > for(j=coords[1]*S; j<(coords[1]+1)*S; j++){ > *t=A[i][j]; > t++; > } > MPI_Send(T,S2,MPI_INT,k,0,GRID_COMM); > } > } > > for(k=0; k MPI_Cart_coords(GRID_COMM,k,2,coords); > if(k==0){ > t=TB; > for(i=k; i temp=(k*S)%N; > for(j=temp; j *t=B[i][j]; > t++; > } > } > } > else{ > t=T; > for(i=coords[0]*S; i<(coords[0]+1)*S; i++) > for(j=coords[1]*S; j<(coords[1]+1)*S; j++){ > *t=B[i][j]; > t++; > } > MPI_Send(T,S2,MPI_INT,k,1,GRID_COMM); > } > } > > coords[0]=source; > coords[1]=dest; > t4= MPI_Wtime(); > > /*matriks multiplication*/ > for(repeat=0; repeat for(i=0; i for(j=0; j for(k=0; k TC[i*S+j]+=TA[i*S+k]*TB[j+k*S]; > } > } > } > /*swivel block & fill the value*/ > MPI_Cart_shift(GRID_COMM,1,-1,&source,&dest); > > MPI_Sendrecv_replace(TA,S2,MPI_INT,dest,0,source,0,GRID_COMM,&status); > > MPI_Cart_shift(GRID_COMM,0,-1,&source,&dest); > > MPI_Sendrecv_replace(TB,S2,MPI_INT,dest,0,source,0,GRID_COMM,&status); > } > > for(i=0; i for(j=0; j C[i][j]=TC[i*S+j]; > } > > for(i=1; i l=0; > m=0; > > MPI_Recv(T,S2,MPI_INT,MPI_ANY_SOURCE,MPI_ANY_TAG,GRID_COMM,&status); > MPI_Cart_coords(GRID_COMM,status.MPI_TAG,2,coords); > > for(j=coords[0]*S; j<(coords[0]+1)*S; j++){ > for(k=coords[1]*S; k<(coords[1]+1)*S; k++){ > C[j][k]=T[l*S+m]; > m++; > } > l++; > m=0; > } > } > > t2= MPI_Wtime(); > printf("Total Time: %lf msecs \n",(t2 - t1) / 0.001); > //printf("Transmit Time: %lf msecs \n",(t4 - t3) / 0.001); > writemat("C_file_par", (int *) C, N); > > free(T); > free(TA); > free(TB); > free(TC); > } > else > { > MPI_Recv(TA,S2,MPI_INT,0,0,GRID_COMM,&status); > MPI_Recv(TB,S2,MPI_INT,0,1,GRID_COMM,&status); > > MPI_Cart_shift(GRID_COMM,1,-coords[0],&source,&dest); > MPI_Sendrecv_replace(TA,S2,MPI_INT,dest,0,source,0,GRID_COMM,&status); > > MPI_Cart_shift(GRID_COMM,0,-coords[1],&source,&dest); > MPI_Sendrecv_replace(TB,S2,MPI_INT,dest,0,source,0,GRID_COMM,&status); > for(repeat=0; repeat for(i=0; i for(j=0; j for(k=0; k TC[i*S+j]+=TA[i*S+k]*TB[j+k*S]; > } > } > } > > MPI_Cart_shift(GRID_COMM,1,-1,&source,&dest); > > MPI_Sendrecv_replace(TA,S2,MPI_INT,dest,0,source,0,GRID_COMM,&status); > > MPI_Cart_shift(GRID_COMM,0,-1,&source,&dest); > > MPI_Sendrecv_replace(TB,S2,MPI_INT,dest,0,source,0,GRID_COMM,&status); > } > > MPI_Send(TC,S2,MPI_INT,0,myid,GRID_COMM); > free(TA); > free(TB); > free(TC); > } > MPI_Finalize(); > return(0); > } > /*function of read the input and put to the output/ > #define _mat(i,j) (mat[(i)*n + (j)]) > int readmat(char *fname, int *mat, int n){ > FILE *fp; > int i, j; > if ((fp = fopen(fname, "r")) == NULL) > return (-1); > for (i = 0; i < n; i++) > for (j = 0; j < n; j++) > if (fscanf(fp, "%d", &_mat(i,j)) == EOF){ > fclose(fp); > return (-1); > }; > fclose(fp); > return (0); > } > int writemat(char *fname, int *mat, int n){ > FILE *fp; > int i, j; > > if ((fp = fopen(fname, "w")) == NULL) > return (-1); > for (i = 0; i < n; fprintf(fp, "\n"), i++) > for (j = 0; j < n; j++) > fprintf(fp, "%d ", _mat(i, j)); > fclose(fp); > return (0); > } > void Print_Matrix(int x, int y, int myid, int S, int *M){ > int i,j; > printf("myid:%d\n",myid); > for(i=0; i { > for(j=0; j printf(" %d ",M[i*S+j]); > printf("\n"); > } > } > From tegner at renget.se Thu Feb 4 07:18:13 2010 From: tegner at renget.se (tegner at renget.se) Date: Thu, 4 Feb 2010 16:18:13 +0100 Subject: [Beowulf] Pxe boot over infiniband Message-ID: <91b02d46ce89bb91cd4ad96b2e434e2a.squirrel@webmail01.one.com> When googling for "pxe boot over infiniband" it seems this is not possible for all types of hardware. Is this correct? And if so, are there other solutions (except connecting all nodes to a gigabit switch as well)? Regards, /jon From mdidomenico4 at gmail.com Thu Feb 4 07:48:06 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Thu, 4 Feb 2010 10:48:06 -0500 Subject: [Beowulf] Pxe boot over infiniband In-Reply-To: <91b02d46ce89bb91cd4ad96b2e434e2a.squirrel@webmail01.one.com> References: <91b02d46ce89bb91cd4ad96b2e434e2a.squirrel@webmail01.one.com> Message-ID: at one point etherboot had images for mt23108 and mt25208 cards. last i can recall (3yrs ago) mellanox did some work in this area, but i'm not sure if it went very far when i was with qlogic we looked at the feasibility and found that most compute nodes had an ethernet connection anyhow, so the effort and interest was not there to build out the software On Thu, Feb 4, 2010 at 10:18 AM, wrote: > When googling for "pxe boot over infiniband" it seems this is not possible > for all types of hardware. Is this correct? And if so, are there other > solutions (except connecting all nodes to a gigabit switch as well)? > > Regards, > > /jon > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From john.hearns at mclaren.com Thu Feb 4 07:50:39 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 4 Feb 2010 15:50:39 -0000 Subject: [Beowulf] Pxe boot over infiniband In-Reply-To: <91b02d46ce89bb91cd4ad96b2e434e2a.squirrel@webmail01.one.com> References: <91b02d46ce89bb91cd4ad96b2e434e2a.squirrel@webmail01.one.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F31D053@milexchmb1.mil.tagmclarengroup.com> > > When googling for "pxe boot over infiniband" it seems this is not > possible > for all types of hardware. Is this correct? And if so, are there other > solutions (except connecting all nodes to a gigabit switch as well)? > Go back to basics - get a keyboard/monitor and a USB CD drive. Spend a fun day on your knees being roasted and deafened at the same time installing by hand from DVD. Or get a long, long patch lead from your install server and spend several happy hours running back and forth. Seriously though - an alternate approach might be to use the functionality in BMC cards to present a virtual floppy/CD/DVD drive and boot from that, either booting a minimal ramdisk with infiniband support, or just install from the virtual DVD. I've never done this, mind... Then again this means running network cables for your BMC/IPMI cards... The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From gus at ldeo.columbia.edu Thu Feb 4 08:35:37 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 04 Feb 2010 11:35:37 -0500 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> <4B68E015.6040202@ldeo.columbia.edu> <4B69C83D.2020301@ldeo.columbia.edu> <4B6A387D.4000900@ldeo.columbia.edu> Message-ID: <4B6AF759.9030208@ldeo.columbia.edu> Hi Christian If you already set up passwordless ssh across the nodes OpenMPI will probably get you up and running faster than MPICH2. OpenMPI is very easy to install, say, with gcc, g++, and gfortran (make sure you have them installed on your main machine, use Ubuntu apt-get, if you don't have them). Just configure with something like this: ./configure --prefix=/your/OpenMPI/install/dir CC=gcc CXX=g++ F77=gfortran FC=gfortran You can add optimization flags if you want, but their defaults are good. This will give you the C,C++,Fortran 77, and Fortran 90 MPI bindings. You can install OpenMPI a NFS mounted directory, and set your PATH LD_LIBRARY_PATH, and MANPATH on your .bashrc/.cshrc file to point also to the OpenMPI sub-directories. This way you do a single install, no need to install on the other computer nodes also. A more laborious alternative is to install on all nodes. For details check the README file and their FAQ: http://www.open-mpi.org/faq/ I prefer OpenMPI because it is easier to handle and is more flexible, but that is a matter of personal taste and needs. I hope this helps. Gus Correa PS - If this is a Beowulf discussion, please Cc. your messages to Bewoulf . christian suhendra wrote: > yes..the input is on the txt file..matrix A an matrix B > i didn't send you the file because its to large. and i can't attached it > from my PC.. > > > oh thanks.. > i will try to use mpich2 or openmpi.. > what do you prefer of this mpich2 or openmpi?? > i mean for easy configuration,,,.. > thank you very much Mr.gus > > > > regards > christian From hahn at mcmaster.ca Thu Feb 4 09:27:18 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 4 Feb 2010 12:27:18 -0500 (EST) Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: <4B6A387D.4000900@ldeo.columbia.edu> References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> <4B68E015.6040202@ldeo.columbia.edu> <4B69C83D.2020301@ldeo.columbia.edu> <4B6A387D.4000900@ldeo.columbia.edu> Message-ID: > In MPICH2 you need to establish passwordless ssh (not rsh!) > connection across your machines. it should be said that mpich2 doesn't strictly require this. all mpi flavors can operate with other spawning methods - some have hooks to be spawned by the resource manager, for instance, which bypasses ssh/rsh type access. > In OpenMPI you also need to establish passwordless ssh connection same for OpenMPI: doesn't actually require passwordless ssh. but if you do want passwordless ssh, IMO the only sane solution is to configure hostbased trust. having an unencrypted private key in your home directory is hideous (moral equivalent of putting your password in a file, in the clear...) From Shainer at mellanox.com Thu Feb 4 09:56:38 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Thu, 4 Feb 2010 09:56:38 -0800 Subject: [Beowulf] Pxe boot over infiniband References: <91b02d46ce89bb91cd4ad96b2e434e2a.squirrel@webmail01.one.com> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F025D874C@mtiexch01.mti.com> There is a solution that covers at least all the Mellanox InfiniBand adapters. It is called FlexBoot and it is on the Mellanox web site - http://www.mellanox.com/content/pages.php?pg=products_dyn&product_family =34&menu_section=34#tab-two The Mellanox devices supported are: * ConnectX(r) / ConnectX(r)-2 * InfiniHost(r) III Ex * InfiniHost(r) III Lx Gilad -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of tegner at renget.se Sent: Thursday, February 04, 2010 7:18 AM To: beowulf at beowulf.org Subject: [Beowulf] Pxe boot over infiniband When googling for "pxe boot over infiniband" it seems this is not possible for all types of hardware. Is this correct? And if so, are there other solutions (except connecting all nodes to a gigabit switch as well)? Regards, /jon _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mathog at caltech.edu Thu Feb 4 10:04:34 2010 From: mathog at caltech.edu (David Mathog) Date: Thu, 04 Feb 2010 10:04:34 -0800 Subject: [Beowulf] problem of mpich-1.2.7p1 Message-ID: Gus Correa wrote > If you already set up passwordless ssh across the nodes > OpenMPI will probably get you up and running faster than MPICH2. > > OpenMPI is very easy to install, say, with gcc, g++, and > gfortran (make sure you have them installed on your main machine, > use Ubuntu apt-get, if you don't have them). Well on Linux maybe, but since OpenMPI has been soundly kicking my butt trying to get it installed and working on a Solaris 5.8 Sparc system for the last day, I can't let that slide as a general statement. OpenMPI 1.4.1 needed a few minor code mods to build at all using gcc on this system (it expects some defines that aren't present, this is with the sunfreeware gcc versions), and those mods were just about counting CPUs, which wasn't an issue in this case because it is a single CPU system. These same issues were also reported by another fellow for 1.3.1 on a Solaris 8 system: http://www.open-mpi.org/community/lists/users/2009/02/7994.php The gcc version works so long as mpirun only sends jobs to itself. Sadly, try to send ANYTHING to a remote machine (linux Intel, in case that matters) and it treats one to: mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor This on a build with no warnings or errors. Definitely a problem on the Solaris side, since any of the linux machines can initiate an mpirun to another node, or all other nodes, that works with the example programs. So with gcc, OpenMPI not too useful for the front end of an MPI cluster. Today I'm trying again using Sun's Forte 7 tools, which requires a fairly complex configure line: ./configure --with-sge --prefix=/opt/ompi141 CFLAGS="-xarch=v8plusa" CXXFLAGS="-xarch=v8plusa" FFLAGS="-xarch=v8plusa" FCFLAGS="-xarch=v8plusa" CC=/opt/SUNWspro/bin/cc CXX=/opt/SUNWspro/bin/CC F77=/opt/SUNWspro/bin/f77 FC=/opt/SUNWspro/bin/f95 CCAS=/opt/SUNWspro/bin/cc CCASFLAGS="-xarch=v8plusa" >configure_4.log 2>&1 & Not sure yet if that is sufficient, as none of the preceding configure variants resulted in a set of Makefiles which would actually run to completion, and this one is still building. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From dnlombar at ichips.intel.com Thu Feb 4 10:36:12 2010 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Thu, 4 Feb 2010 10:36:12 -0800 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> <4B68E015.6040202@ldeo.columbia.edu> <4B69C83D.2020301@ldeo.columbia.edu> <4B6A387D.4000900@ldeo.columbia.edu> Message-ID: <20100204183612.GA18535@nlxdcldnl2.cl.intel.com> On Thu, Feb 04, 2010 at 10:27:18AM -0700, Mark Hahn wrote: > > but if you do want passwordless ssh, IMO the only sane solution is to > configure hostbased trust. having an unencrypted private key in your > home directory is hideous (moral equivalent of putting your password > in a file, in the clear...) Completely agree that host-based passwordless SSH is the best approach, especially when jobs are submitted via a resource manager.. Also agree that an empty passphrase is a particularly bad approach. But, when done via ssh-agent, I don't see partiularly onerous security issues for a usage where you're manually launching jobs from an interactive session unless you have no faith in the system's integrity at all... -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From hahn at mcmaster.ca Thu Feb 4 10:55:11 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 4 Feb 2010 13:55:11 -0500 (EST) Subject: [Beowulf] Pxe boot over infiniband In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F31D053@milexchmb1.mil.tagmclarengroup.com> References: <91b02d46ce89bb91cd4ad96b2e434e2a.squirrel@webmail01.one.com> <68A57CCFD4005646957BD2D18E60667B0F31D053@milexchmb1.mil.tagmclarengroup.com> Message-ID: >> When googling for "pxe boot over infiniband" it seems this is not >> possible >> for all types of hardware. Is this correct? And if so, are there other >> solutions (except connecting all nodes to a gigabit switch as well)? > > Go back to basics - get a keyboard/monitor and a USB CD drive. well, maybe a usb flash stick. I think that's what I'd do, and it would be pretty cheap, reliable, fast, etc. the only thing on the flash stick would be to fetch and boot the usual kernel/initrd configuration, so the flash image would never need to change (and would be read-only, so presumably not prone to wear. might need kexec if syslinux/etc can't be persuaded to behave quite like this. > Then again this means running network cables for your BMC/IPMI cards... virtual media via BMC is a good idea, but sounds fragile and vendor-specific to me. From hahn at mcmaster.ca Thu Feb 4 11:10:06 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 4 Feb 2010 14:10:06 -0500 (EST) Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: <20100204183612.GA18535@nlxdcldnl2.cl.intel.com> References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> <4B68E015.6040202@ldeo.columbia.edu> <4B69C83D.2020301@ldeo.columbia.edu> <4B6A387D.4000900@ldeo.columbia.edu> <20100204183612.GA18535@nlxdcldnl2.cl.intel.com> Message-ID: >> but if you do want passwordless ssh, IMO the only sane solution is to >> configure hostbased trust. having an unencrypted private key in your >> home directory is hideous (moral equivalent of putting your password >> in a file, in the clear...) > > Completely agree that host-based passwordless SSH is the best approach, > especially when jobs are submitted via a resource manager.. > > Also agree that an empty passphrase is a particularly bad approach. > > But, when done via ssh-agent, I don't see partiularly onerous security issues > for a usage where you're manually launching jobs from an interactive session > unless you have no faith in the system's integrity at all... absolutely. I spoke sloppily - I use agent-based PK logins myself, and only wanted to badmouth password and unencrypted PK logins. I think it's really important even for end-users to understand the basics of ssh: - first stage is mutual authentication of _machines_. this is what all that "hostkey of xxx has changed; maybe a hack!". once this is done, hosts have an encrypted channel between authentic hosts. - second stage is user PK authentication: the client is challenged to prove knowlege of the private key, which can happen by an un-encrypted private key in ~/.ssh, or by prompting the user for the passphrase to an encrypted privkey, or by interacting with ssh-agent. - finally, as a last resort, username/password can be used - basically the worst case security-wise: maximal exposure to clocal keyboard logging and remote daemon compromise. A QUESTION: how many clusters used/managed by people on this list mandate the use of PK login (ie, rule out passwords)? I know some do, but we haven't, figuring there would be an outcry (not to mention making our systems harder to use for the technically weaker users.) we've thought of providing users with a customized package of windows ssh client with a unique encrypted PK preinstalled. might work... if you think of threat models, it's interesting to note that if an sshable account is attacked through windows-based clients, keylogging is probably the more likley issue. if compromise is of clients on a *nix system, I'm guessing the main risk is unencrypted PKs in /home/*/.ssh. server-side compromise seems to usually be of the daemon, which simply logs password-based logins (not outgoing connections in the versions I've seen, and no compromise of ssh-agent to collect passphrase+key combos...) From gus at ldeo.columbia.edu Thu Feb 4 12:09:43 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 04 Feb 2010 15:09:43 -0500 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> <4B68E015.6040202@ldeo.columbia.edu> <4B69C83D.2020301@ldeo.columbia.edu> <4B6A387D.4000900@ldeo.columbia.edu> Message-ID: <4B6B2987.2090407@ldeo.columbia.edu> Yes, Mark, you are right. Passwordless ssh is not a strict requirement, although it is a simple way to make things work. Yes, host based trust is better than unencrypted private keys. I don't even know if the cluster in question is connected to the Internet, though. My suggestions were directed to someone who is still using MPICH-1, claimed to have trouble with it, not to be familiar to clusters and MPI, and to have a pressing deadline. Therefore, I thought it would be more helpful to give him simple and focused suggestions, rather than a full gamut of possibilities. I guess it would be a great help to him if you post simple instructions (or a link) on how to setup passwordless ssh through host based trust. Regards, Gus Correa Mark Hahn wrote: >> In MPICH2 you need to establish passwordless ssh (not rsh!) >> connection across your machines. > > it should be said that mpich2 doesn't strictly require this. > all mpi flavors can operate with other spawning methods - some have > hooks to be spawned by the resource manager, for instance, which > bypasses ssh/rsh type access. > >> In OpenMPI you also need to establish passwordless ssh connection > > same for OpenMPI: doesn't actually require passwordless ssh. > > but if you do want passwordless ssh, IMO the only sane solution is to > configure hostbased trust. having an unencrypted private key in your > home directory is hideous (moral equivalent of putting your password in > a file, in the clear...) > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From gus at ldeo.columbia.edu Thu Feb 4 12:30:48 2010 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 04 Feb 2010 15:30:48 -0500 Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: References: Message-ID: <4B6B2E78.1080404@ldeo.columbia.edu> Hi David Sorry to hear that OpenMPI is a troublemaker in your Open Solaris machines. Have you questioned the OpenMPI list about that? I installed OpenMPI on Linux (Fedora, CentOS) without problems, on clusters with Infiniband, Gigabit Ethernet, old Ethernet 100, and on standalone machines, using Gnu, PGI, and Intel compilers. In my experience it installs easily and works well on Linux. Among other reasons, I recommended it to the person who asked for help because he said his is a Linux cluster (Ubuntu). We also have and use MPICH2 and MVAPICH2 here, though. Gus Correa David Mathog wrote: > Gus Correa wrote >> If you already set up passwordless ssh across the nodes >> OpenMPI will probably get you up and running faster than MPICH2. >> >> OpenMPI is very easy to install, say, with gcc, g++, and >> gfortran (make sure you have them installed on your main machine, >> use Ubuntu apt-get, if you don't have them). > > Well on Linux maybe, but since OpenMPI has been soundly kicking my butt > trying to get it installed and working on a Solaris 5.8 Sparc system for > the last day, I can't let that slide as a general statement. > > OpenMPI 1.4.1 needed a few minor code mods to build at all using gcc on > this system (it expects some defines that aren't present, this is with > the sunfreeware gcc versions), and those mods were just about counting > CPUs, which wasn't an issue in this case because it is a single CPU > system. These same issues were also reported by another fellow for 1.3.1 > on a Solaris 8 system: > > http://www.open-mpi.org/community/lists/users/2009/02/7994.php > > The gcc version works so long as mpirun only sends jobs to itself. > Sadly, try to send ANYTHING to a remote machine (linux Intel, in case > that matters) and it treats one to: > > mca_oob_tcp_msg_send_handler: writev failed: Bad file descriptor > > This on a build with no warnings or errors. Definitely a problem on the > Solaris side, since any of the linux machines can initiate an mpirun to > another node, or all other nodes, that works with the example programs. > So with gcc, OpenMPI not too useful for the front end of an MPI cluster. > > Today I'm trying again using Sun's Forte 7 tools, which requires a > fairly complex configure line: > > ./configure --with-sge --prefix=/opt/ompi141 CFLAGS="-xarch=v8plusa" > CXXFLAGS="-xarch=v8plusa" FFLAGS="-xarch=v8plusa" > FCFLAGS="-xarch=v8plusa" CC=/opt/SUNWspro/bin/cc > CXX=/opt/SUNWspro/bin/CC F77=/opt/SUNWspro/bin/f77 > FC=/opt/SUNWspro/bin/f95 CCAS=/opt/SUNWspro/bin/cc > CCASFLAGS="-xarch=v8plusa" >configure_4.log 2>&1 & > > Not sure yet if that is sufficient, as none of the preceding configure > variants resulted in a set of Makefiles which would actually run to > completion, and this one is still building. > > Regards, > > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From hahn at mcmaster.ca Thu Feb 4 12:43:35 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 4 Feb 2010 15:43:35 -0500 (EST) Subject: [Beowulf] problem of mpich-1.2.7p1 In-Reply-To: <4B6B2987.2090407@ldeo.columbia.edu> References: <4B68D5D7.1010907@ldeo.columbia.edu> <4B68D829.4010604@ldeo.columbia.edu> <4B68E015.6040202@ldeo.columbia.edu> <4B69C83D.2020301@ldeo.columbia.edu> <4B6A387D.4000900@ldeo.columbia.edu> <4B6B2987.2090407@ldeo.columbia.edu> Message-ID: > simple instructions (or a link) on how to setup passwordless ssh > through host based trust. it's fairly simple. hosts need to know each other (ie, host keys in /etc/ssh/ssh_known_hosts), and each machine needs a list of trusted hosts in /etc/ssh/shosts.equiv. target machines need sshd_config to contain "HostbasedAuthentication yes". source machines need ssh_config to contain "EnableSSHKeysign yes" (I don't remember whether clients can do this via "ssh -oEnableSSHKeysign=yes" or not.) one nice thing about hostbased trust is that it can (and probably should be) asymmetric. to be useful, compute nodes probably need to trust admin and/or login nodes, but your login node doesn't have to trust compute nodes. of course, you should never use this for machines you don't, well, "trust" (such as random client machines outside your admin control...) unencrypted public keys are very easy, and they work - the problem is that it's like putting your password into a file called ".hacker.please.take" ;) From mdidomenico4 at gmail.com Thu Feb 4 14:11:52 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Thu, 4 Feb 2010 17:11:52 -0500 Subject: [Beowulf] 48/96 disk jbods? Message-ID: Can anyone recommend a decent 48 or 96 (if they exist) disk jbod on the market? I don't need the everything that goes along with a raid system or anything that's network attached. I just need to hang a lot of storage from a machine, sas/sata/fc doesn't matter I found this one on the web, but i'm curious if anyone knows of any others that might be out there. http://www.xtore-es.com/downloads/datasheet/XJ2000%20SAS_PUB-00036-B_Final.pdf thanks From landman at scalableinformatics.com Thu Feb 4 14:30:25 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 04 Feb 2010 17:30:25 -0500 Subject: [Beowulf] 48/96 disk jbods? In-Reply-To: References: Message-ID: <4B6B4A81.5020008@scalableinformatics.com> Michael Di Domenico wrote: > Can anyone recommend a decent 48 or 96 (if they exist) disk jbod on > the market? I don't need the everything that goes along with a raid > system or anything that's network attached. I just need to hang a lot > of storage from a machine, sas/sata/fc doesn't matter Self built or pre-built? If the latter, you can use our delta V units as this: http://www.scalableinformatics.com/delta-v . JBOD over iSCSI or similar. We software RAID them by default, though there is no reason we couldn't have many iSCSI targets over 10GbE/IB/ethernet. You don't see it there, but there is a DV5 ... 48 bay top load. The unit you pointed to has the unfortunate problem of requiring you take two devices out to get at one failed drive. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From farmantrout at burnsmcd.com Fri Feb 5 05:06:27 2010 From: farmantrout at burnsmcd.com (Armantrout, Fred) Date: Fri, 5 Feb 2010 07:06:27 -0600 Subject: [Beowulf] 48/96 disk jbods? In-Reply-To: <4B6B4A81.5020008@scalableinformatics.com> References: <4B6B4A81.5020008@scalableinformatics.com> Message-ID: <7F611EB6D6C2064883F59190F87FFD620BF58E58EE@BMCDMAIL01.burnsmcd.com> I have seen a large drive setup from HP called HP StorageWorks 600 Modular Disk System Info includes 5U rackmount form factor Supports seventy 3.5" LFF Universal hot pluggable SAS or SATA drives Two pull-out drive drawers support hot plug large form factor dual-ported SAS or archival-class SATA drives in just 5U of rack space (35 hot plug drives per drawer) -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Joe Landman Sent: Thursday, February 04, 2010 4:30 PM To: Michael Di Domenico Cc: Beowulf Mailing List Subject: Re: [Beowulf] 48/96 disk jbods? Michael Di Domenico wrote: > Can anyone recommend a decent 48 or 96 (if they exist) disk jbod on > the market? I don't need the everything that goes along with a raid > system or anything that's network attached. I just need to hang a lot > of storage from a machine, sas/sata/fc doesn't matter Self built or pre-built? If the latter, you can use our delta V units as this: http://www.scalableinformatics.com/delta-v . JBOD over iSCSI or similar. We software RAID them by default, though there is no reason we couldn't have many iSCSI targets over 10GbE/IB/ethernet. You don't see it there, but there is a DV5 ... 48 bay top load. The unit you pointed to has the unfortunate problem of requiring you take two devices out to get at one failed drive. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From hearnsj at googlemail.com Sat Feb 6 02:55:54 2010 From: hearnsj at googlemail.com (John Hearns) Date: Sat, 6 Feb 2010 10:55:54 +0000 Subject: [Beowulf] Cisco OTV Message-ID: <9f8092cc1002060255t4d130c60sf3d2ffee7755d10@mail.gmail.com> http://www.theregister.co.uk/2010/02/05/cisco_report_otv_nexus_7000/ This looks interesting from the point of view of clusters in a cloud computing environment. Then again, your application is going to see much greater latencies if half your virtual machines get shifted to another data centre! (Having said that, AFAIK Amazon has an option where you can specify all the machins you rent are spatially together). From hearnsj at googlemail.com Sat Feb 6 13:03:32 2010 From: hearnsj at googlemail.com (John Hearns) Date: Sat, 6 Feb 2010 21:03:32 +0000 Subject: [Beowulf] Low cost IB cards Message-ID: <9f8092cc1002061303u3c510a9al3296e65dc7ede011@mail.gmail.com> There was discussion on this list on lwo cost IB cards. Can someone remind me of the vendor? Similarly for switches. From brockp at umich.edu Sat Feb 6 13:45:15 2010 From: brockp at umich.edu (Brock Palen) Date: Sat, 6 Feb 2010 16:45:15 -0500 Subject: [Beowulf] Low cost IB cards In-Reply-To: <9f8092cc1002061303u3c510a9al3296e65dc7ede011@mail.gmail.com> References: <9f8092cc1002061303u3c510a9al3296e65dc7ede011@mail.gmail.com> Message-ID: <00DF270B-AF99-4017-A07F-6EEDE752B77A@umich.edu> Colfax Direct: http://www.colfaxdirect.com/store/pc/home.asp Though had cases where switches with 'integrated subnet manager' really means 'no subnet manager' Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985 On Feb 6, 2010, at 4:03 PM, John Hearns wrote: > There was discussion on this list on lwo cost IB cards. > Can someone remind me of the vendor? Similarly for switches. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From Michael.Frese at NumerEx-LLC.com Sat Feb 6 16:13:18 2010 From: Michael.Frese at NumerEx-LLC.com (Michael H. Frese) Date: Sat, 06 Feb 2010 17:13:18 -0700 Subject: [Beowulf] Low cost IB cards In-Reply-To: <9f8092cc1002061303u3c510a9al3296e65dc7ede011@mail.gmail.co m> References: <9f8092cc1002061303u3c510a9al3296e65dc7ede011@mail.gmail.com> Message-ID: <6.2.5.6.2.20100206170128.06d5db38@NumerEx-LLC.com> Be sure your OS -- as in "cat /etc/issue" -- has all the bells and whistles for IB. We bought those 4X SDR cards and an 8 port switch and tried and tried and tried for over a year off and on and finally failed on various flavors of Fedora to bring up OFA/OFED on IB. We finally succeeded with CentOS 5.3 or 4. Then we brought up CentOS on a mixed cluster and now its NFS is failing miserably to communicate with some of the older OS's, on which problem you hear more later.... Ugh! There didn't seem to be any hardware problems, though. Mike At 02:03 PM 2/6/2010, you wrote: >There was discussion on this list on lwo cost IB cards. >Can someone remind me of the vendor? Similarly for switches. >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From sabujp at gmail.com Sat Feb 6 16:23:13 2010 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Sat, 6 Feb 2010 18:23:13 -0600 Subject: [Beowulf] Low cost IB cards In-Reply-To: <00DF270B-AF99-4017-A07F-6EEDE752B77A@umich.edu> References: <9f8092cc1002061303u3c510a9al3296e65dc7ede011@mail.gmail.com> <00DF270B-AF99-4017-A07F-6EEDE752B77A@umich.edu> Message-ID: Dell. We got a Mellanox 36 port unmanaged infiniscale iv QDR IB switch with a single psu for ~$4.8k and an extra PSU for $640. We looked at all the turnkey vendors. Not a single one could compete with their prices. On Sat, Feb 6, 2010 at 3:45 PM, Brock Palen wrote: > Colfax Direct: > http://www.colfaxdirect.com/store/pc/home.asp From landman at scalableinformatics.com Sat Feb 6 16:46:47 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Sat, 06 Feb 2010 19:46:47 -0500 Subject: [Beowulf] Low cost IB cards In-Reply-To: <6.2.5.6.2.20100206170128.06d5db38@NumerEx-LLC.com> References: <9f8092cc1002061303u3c510a9al3296e65dc7ede011@mail.gmail.com> <6.2.5.6.2.20100206170128.06d5db38@NumerEx-LLC.com> Message-ID: <4B6E0D77.8000103@scalableinformatics.com> Michael H. Frese wrote: > Be sure your OS -- as in "cat /etc/issue" -- has all the bells and > whistles for IB. We bought those 4X SDR cards and an 8 port switch and > tried and tried and tried for over a year off and on and finally failed > on various flavors of Fedora to bring up OFA/OFED on IB. We finally Last I checked, OFED isn't supported by Fedora. Understand, Fedora *is* a rapidly moving target. I don't have anything against it, I just used it to replace a failing Ubuntu 9.10 load on a home franken-machine. But I know that lots of things won't work on it. Right now I am struggling with their ideology ... nouveau versus Nvidia. I want to remove the former and install the latter. And it ain't easy. OFED is picky about its kernels. No matter which distro you use, kernels matter. > succeeded with CentOS 5.3 or 4. Then we brought up CentOS on a mixed > cluster and now its NFS is failing miserably to communicate with some of > the older OS's, on which problem you hear more later.... Older RHEL kernels that Centos are based upon aren't terribly good at NFS. > > Ugh! > > There didn't seem to be any hardware problems, though. We typically build and install our own kernels. Its hard to stabilize RHEL kernels at high data rates from IO systems (and networks, but thats another story). This said, there are lots of crappy hardware bits out there. IB tends to be pretty good. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Sat Feb 6 18:41:22 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 6 Feb 2010 21:41:22 -0500 (EST) Subject: [Beowulf] Low cost IB cards In-Reply-To: <4B6E0D77.8000103@scalableinformatics.com> References: <9f8092cc1002061303u3c510a9al3296e65dc7ede011@mail.gmail.com> <6.2.5.6.2.20100206170128.06d5db38@NumerEx-LLC.com> <4B6E0D77.8000103@scalableinformatics.com> Message-ID: > Last I checked, OFED isn't supported by Fedora. Understand, Fedora *is* a > rapidly moving target. fedora is great for desktops and other less "entangled" machines; centos is more appropriate for the latter. I have no opinion about debian derivatives except that they seem redundant and often come with unwelcome and unwarranted attitude... > work on it. Right now I am struggling with their ideology ... nouveau versus > Nvidia. I want to remove the former and install the latter. And it ain't > easy. it _is_ easy with akmods. until vendors do what it takes for open-source drivers, getting along with binary blobs is good. as far as I can see, akmods are the right way to do it: just recompile the shim when necessary. regards, mark hahn. From gerry.creager at tamu.edu Mon Feb 8 13:05:52 2010 From: gerry.creager at tamu.edu (Gerald Creager) Date: Mon, 08 Feb 2010 15:05:52 -0600 Subject: [Beowulf] UPS signaling scripts? Message-ID: <4B707CB0.1000808@tamu.edu> Looking for a usable script that will allow me to listen to an APC data center UPS and power down a cluster when we go to UPS power. Anyone got a solution? My last experience with PowerChute was pretty sad, I'm afraid. gerry -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From orion at cora.nwra.com Mon Feb 8 13:28:17 2010 From: orion at cora.nwra.com (Orion Poplawski) Date: Mon, 08 Feb 2010 14:28:17 -0700 Subject: [Beowulf] UPS signaling scripts? In-Reply-To: <4B707CB0.1000808@tamu.edu> References: <4B707CB0.1000808@tamu.edu> Message-ID: <4B7081F1.5050803@cora.nwra.com> On 2/8/2010 2:05 PM, Gerald Creager wrote: > Looking for a usable script that will allow me to listen to an APC > data center UPS and power down a cluster when we go to UPS power. > Anyone got a solution? My last experience with PowerChute was pretty > sad, I'm afraid. > > gerry Have you tried apcupsd http://www.apcupsd.com/ ? From dnlombar at ichips.intel.com Mon Feb 8 13:32:44 2010 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Mon, 8 Feb 2010 13:32:44 -0800 Subject: [Beowulf] UPS signaling scripts? In-Reply-To: <4B707CB0.1000808@tamu.edu> References: <4B707CB0.1000808@tamu.edu> Message-ID: <20100208213244.GB8320@nlxcldnl2.cl.intel.com> On Mon, Feb 08, 2010 at 01:05:52PM -0800, Gerald Creager wrote: > Looking for a usable script that will allow me to listen to an APC data > center UPS and power down a cluster when we go to UPS power. Anyone got > a solution? My last experience with PowerChute was pretty sad, I'm afraid. $ yum info apcupsd Loaded plugins: refresh-packagekit Available Packages Name : apcupsd Arch : x86_64 Version : 3.14.8 Release : 1.fc12 Size : 283 k Repo : updates Summary : APC UPS Power Control Daemon for Linux URL : http://www.apcupsd.com License : GPLv2 Description: Apcupsd can be used for controlling most APC UPSes. During a : power failure, apcupsd will inform the users about the power : failure and that a shutdown may occur. If power is not restored, : a system shutdown will follow when the battery is exausted, a : timeout (seconds) expires, or the battery runtime expires based : on internal APC calculations determined by power consumption : rates. If the power is restored before one of the above shutdown : conditions is met, apcupsd will inform users about this fact. : Some features depend on what UPS model you have (simple or smart). $ -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From brs at usf.edu Mon Feb 8 13:43:09 2010 From: brs at usf.edu (Brian Smith) Date: Mon, 08 Feb 2010 16:43:09 -0500 Subject: [Beowulf] UPS signaling scripts? In-Reply-To: <4B707CB0.1000808@tamu.edu> References: <4B707CB0.1000808@tamu.edu> Message-ID: <1265665389.2478.26.camel@localhost.localdomain> http://www.networkupstools.org/ This worked for me back in the day. Otherwise, bash+snmpget+cron is the alternative. -- Brian Smith Senior Systems Administrator IT Research Computing, University of South Florida 4202 E. Fowler Ave. ENB204 Office Phone: +1 813 974-1467 Organization URL: http://rc.usf.edu On Mon, 2010-02-08 at 15:05 -0600, Gerald Creager wrote: > Looking for a usable script that will allow me to listen to an APC data > center UPS and power down a cluster when we go to UPS power. Anyone got > a solution? My last experience with PowerChute was pretty sad, I'm afraid. > > gerry From smulcahy at atlanticlinux.ie Mon Feb 8 13:57:58 2010 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Mon, 08 Feb 2010 21:57:58 +0000 Subject: [Beowulf] UPS signaling scripts? In-Reply-To: <4B707CB0.1000808@tamu.edu> References: <4B707CB0.1000808@tamu.edu> Message-ID: <4B7088E6.1060905@atlanticlinux.ie> On 08/02/2010 21:05, Gerald Creager wrote: > Looking for a usable script that will allow me to listen to an APC data > center UPS and power down a cluster when we go to UPS power. Anyone got > a solution? My last experience with PowerChute was pretty sad, I'm afraid. > > gerry I've been using apcupsd for a few months and have found it to work well - I think I tested the shutdown part once during initial install, thankfully haven't needed it since. -stephen -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From prentice at ias.edu Mon Feb 8 14:07:59 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Mon, 08 Feb 2010 17:07:59 -0500 Subject: [Beowulf] UPS signaling scripts? In-Reply-To: <4B7088E6.1060905@atlanticlinux.ie> References: <4B707CB0.1000808@tamu.edu> <4B7088E6.1060905@atlanticlinux.ie> Message-ID: <4B708B3F.3050800@ias.edu> stephen mulcahy wrote: > On 08/02/2010 21:05, Gerald Creager wrote: >> Looking for a usable script that will allow me to listen to an APC data >> center UPS and power down a cluster when we go to UPS power. Anyone got >> a solution? My last experience with PowerChute was pretty sad, I'm >> afraid. >> >> gerry > > I've been using apcupsd for a few months and have found it to work well > - I think I tested the shutdown part once during initial install, > thankfully haven't needed it since. > > -stephen > I've also had excellent experience with apcupsd in the past. (I'm not using it now, since we're not using APC UPSes here). -- Prentice From henning.fehrmann at aei.mpg.de Mon Feb 8 22:14:14 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Tue, 9 Feb 2010 07:14:14 +0100 Subject: [Beowulf] UPS signaling scripts? In-Reply-To: <4B707CB0.1000808@tamu.edu> References: <4B707CB0.1000808@tamu.edu> Message-ID: <20100209061414.GA3585@gretchen.aei.mpg.de> Hi Gerry, On Mon, Feb 08, 2010 at 03:05:52PM -0600, Gerald Creager wrote: > Looking for a usable script that will allow me to listen to an APC > data center UPS and power down a cluster when we go to UPS power. > Anyone got a solution? My last experience with PowerChute was pretty > sad, I'm afraid. There is NUT - Network UPS Tools. It works fine for at least ~ 1700 clients. http://www.networkupstools.org/ The NUT server provides the information whether a UPS is running on battery and how much battery charge is approximately left. The clients get the information and start scripts - e.g. a shutdown routine. Cheers, Henning From forum.san at gmail.com Mon Feb 8 23:12:49 2010 From: forum.san at gmail.com (Sangamesh B) Date: Tue, 9 Feb 2010 12:42:49 +0530 Subject: [Beowulf] UPS signaling scripts? In-Reply-To: <20100209061414.GA3585@gretchen.aei.mpg.de> References: <4B707CB0.1000808@tamu.edu> <20100209061414.GA3585@gretchen.aei.mpg.de> Message-ID: Hi Gerry, What problem are you facing with APC Powerchute software? Because we also have started using this. As of now its working fine. Thank you, Sangamesh On Tue, Feb 9, 2010 at 11:44 AM, Henning Fehrmann < henning.fehrmann at aei.mpg.de> wrote: > Hi Gerry, > > > On Mon, Feb 08, 2010 at 03:05:52PM -0600, Gerald Creager wrote: > > Looking for a usable script that will allow me to listen to an APC > > data center UPS and power down a cluster when we go to UPS power. > > Anyone got a solution? My last experience with PowerChute was pretty > > sad, I'm afraid. > > There is NUT - Network UPS Tools. It works fine for at least > ~ 1700 clients. > > http://www.networkupstools.org/ > > The NUT server provides the information whether a UPS is running on > battery and how much battery charge is approximately left. > The clients get the information and start scripts - e.g. a shutdown > routine. > > Cheers, > Henning > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Thu Feb 11 13:51:24 2010 From: mathog at caltech.edu (David Mathog) Date: Thu, 11 Feb 2010 13:51:24 -0800 Subject: [Beowulf] Thinking about going used Message-ID: There are a lot of rack mount servers showing up on ebay and elsewhere after they come out of service at data centers after a few years. These might not cut it for the cutting edge folks, but would be fine for us to replace our existing (ancient) cluster. Any suggestions for models to look at for a good price/performance point for say 2-4 cores in the box, 1 or 2 SATA disks, and at least one 1000baseT interface, and a good reliability history? (The disks would probably be bought separately as a lot of these are sold diskless, and new disks aren't that expensive.) For instance, there have recently been a lot of Arima/Rioworks HDAMA based dual dual core Opteron systems advertised. Possibly because they are old enough that nobody wants to buy them ;-). Those would be fast enough for us, but the SATA controller seems to be just for RAID usage. Might work though if these support "raid 0" on a single disk. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From gerry.creager at tamu.edu Thu Feb 11 14:52:39 2010 From: gerry.creager at tamu.edu (Gerald Creager) Date: Thu, 11 Feb 2010 16:52:39 -0600 Subject: [Beowulf] Thinking about going used In-Reply-To: References: Message-ID: <4B748A37.6060005@tamu.edu> I've been getting Dell 1425's for <$100 and adding $150 worth of memory to make 'em a usable small server. I think they're OK for low-end compute nodes, but watching the list there are other machines more suitable. gerry David Mathog wrote: > There are a lot of rack mount servers showing up on ebay and elsewhere > after they come out of service at data centers after a few years. These > might not cut it for the cutting edge folks, but would be fine for us to > replace our existing (ancient) cluster. > > Any suggestions for models to look at for a good price/performance point > for say 2-4 cores in the box, 1 or 2 SATA disks, and at least one > 1000baseT interface, and a good reliability history? (The disks would > probably be bought separately as a lot of these are sold diskless, and > new disks aren't that expensive.) > > For instance, there have recently been a lot of Arima/Rioworks HDAMA > based dual dual core Opteron systems advertised. Possibly because they > are old enough that nobody wants to buy them ;-). Those would be fast > enough for us, but the SATA controller seems to be just for RAID usage. > Might work though if these support "raid 0" on a single disk. > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From james.p.lux at jpl.nasa.gov Thu Feb 11 16:18:55 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Thu, 11 Feb 2010 16:18:55 -0800 Subject: [Beowulf] Thinking about going used In-Reply-To: <4B748A37.6060005@tamu.edu> References: <4B748A37.6060005@tamu.edu> Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Gerald Creager > Sent: Thursday, February 11, 2010 2:53 PM > To: David Mathog > Cc: beowulf at beowulf.org > Subject: Re: [Beowulf] Thinking about going used > > I've been getting Dell 1425's for <$100 and adding $150 worth of memory > to make 'em a usable small server. I think they're OK for low-end > compute nodes, but watching the list there are other machines more suitable. > > gerry > At first, I thought you were referring to the inspiron 1425s, which are a laptop of sorts.. then I found the SC1425. For a demo cluster or fooling around, at $200-250/each (4GB RAM, 80GB disk, etc.) this kind of thing seems pretty attractive. Makes the under $2K non-trivial cluster doable. From rpnabar at gmail.com Thu Feb 11 23:41:36 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 12 Feb 2010 01:41:36 -0600 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? Message-ID: I came across this very interesting thread on a related mailing list that I thought would be quite relevant to those of us using Dell hardware. Apparently Dell has started hardware-blocking hard-drives that are not "Dell certified". http://lists.us.dell.com/pipermail/linux-poweredge/2010-February/041274.html Don't get mad at me, Dell; I believe the larger public interest outweighs any reluctance on my part to name vendor names. Conflict of interest statement: I've been very annoyed by Dell tech-support last week. I'm prejudiced. -- Rahul From rpnabar at gmail.com Fri Feb 12 00:06:31 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 12 Feb 2010 02:06:31 -0600 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: On Fri, Feb 12, 2010 at 1:41 AM, Rahul Nabar wrote: > I came across this very interesting thread on a related mailing ?list > that I thought would be quite relevant to those of us using Dell > hardware. Apparently Dell has started hardware-blocking hard-drives > that are not "Dell certified". I guess I should qualify the original statement in the interest of fairness: So far only with the Gen11 Dell servers. R710 etc.with H700 and a couple of other PERC RAID controllers seem to be the ones that have this policy. So, it seems only their new and cutting edge hardware has this. I still feel this isn't the right approach. -- Rahul From kilian.cavalotti.work at gmail.com Fri Feb 12 00:51:09 2010 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Fri, 12 Feb 2010 09:51:09 +0100 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: On Fri, Feb 12, 2010 at 9:06 AM, Rahul Nabar wrote: > On Fri, Feb 12, 2010 at 1:41 AM, Rahul Nabar wrote: >> I came across this very interesting thread on a related mailing ?list >> that I thought would be quite relevant to those of us using Dell >> hardware. Apparently Dell has started hardware-blocking hard-drives >> that are not "Dell certified". http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ Cheers, -- Kilian From smulcahy at atlanticlinux.ie Fri Feb 12 02:21:21 2010 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Fri, 12 Feb 2010 10:21:21 +0000 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: <4B752BA1.2040901@atlanticlinux.ie> Kilian CAVALOTTI wrote: > On Fri, Feb 12, 2010 at 9:06 AM, Rahul Nabar wrote: >> On Fri, Feb 12, 2010 at 1:41 AM, Rahul Nabar wrote: >>> I came across this very interesting thread on a related mailing list >>> that I thought would be quite relevant to those of us using Dell >>> hardware. Apparently Dell has started hardware-blocking hard-drives >>> that are not "Dell certified". > > http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ > > Cheers, I purchased some cheap Dell servers a few years ago Poweredge 1600SC's I think. I went to upgrade their CDROM drives with some DVDROM drives we had spare and the systems refused to boot. When I phoned Dell support they told me the servers wouldn't operate with non-Dell drives so I don't think this is a new policy - but maybe they had different policies for different servers. Anyways, I haven't had much interest in Dell servers since then - I'm ok with your support entitlement being degraded if the BIOS detects non-qualified parts in the system but refusing to operate a commodity PC with other commodity PC components isn't what I want from my system vendor. -stephen -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From reuti at staff.uni-marburg.de Fri Feb 12 02:21:43 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Fri, 12 Feb 2010 11:21:43 +0100 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: <6B51FF25-AFF1-42F3-9829-1E127D6CFD2E@staff.uni-marburg.de> Hi, Am 12.02.2010 um 09:51 schrieb Kilian CAVALOTTI: > On Fri, Feb 12, 2010 at 9:06 AM, Rahul Nabar > wrote: >> On Fri, Feb 12, 2010 at 1:41 AM, Rahul Nabar >> wrote: >>> I came across this very interesting thread on a related mailing >>> list >>> that I thought would be quite relevant to those of us using Dell >>> hardware. Apparently Dell has started hardware-blocking hard-drives >>> that are not "Dell certified". > > http://www.channelregister.co.uk/2010/02/10/ > dell_perc_11th_gen_qualified_hdds_only/ are they just blocking non-qualified drives, but you could still install qualified ones bought not from Dell? And: will their qualified ones work on other controllers? This sounds like ProStor prevents usage of other disks in RDX drives' cartridges (AFAIK they are doing this by ATA passwords*), and you are limited to their cartridges. And in emergency case you can't use the disks with other controllers due to this. -- Reuti * http://www.heise.de/foren/S-Re-Erfahrungen-Tandberg-RDX-QuikStor/ forum-7273/msg-16513128/read/ > Cheers, > -- > Kilian > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From gerry.creager at tamu.edu Fri Feb 12 08:52:03 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Fri, 12 Feb 2010 10:52:03 -0600 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: <6B51FF25-AFF1-42F3-9829-1E127D6CFD2E@staff.uni-marburg.de> References: <6B51FF25-AFF1-42F3-9829-1E127D6CFD2E@staff.uni-marburg.de> Message-ID: <4B758733.3040101@tamu.edu> We discussed this in our HPC group meeting yesterday. I've long been dissatisfied with PERC controllers, but this is now a show-stopper for me. I might order Dell, but not the PERC, ever again. Who do they think they are? NetApp? gerry Reuti wrote: > Hi, > > Am 12.02.2010 um 09:51 schrieb Kilian CAVALOTTI: > >> On Fri, Feb 12, 2010 at 9:06 AM, Rahul Nabar wrote: >>> On Fri, Feb 12, 2010 at 1:41 AM, Rahul Nabar wrote: >>>> I came across this very interesting thread on a related mailing list >>>> that I thought would be quite relevant to those of us using Dell >>>> hardware. Apparently Dell has started hardware-blocking hard-drives >>>> that are not "Dell certified". >> >> http://www.channelregister.co.uk/2010/02/10/dell_perc_11th_gen_qualified_hdds_only/ >> > > are they just blocking non-qualified drives, but you could still install > qualified ones bought not from Dell? And: will their qualified ones work > on other controllers? > > This sounds like ProStor prevents usage of other disks in RDX drives' > cartridges (AFAIK they are doing this by ATA passwords*), and you are > limited to their cartridges. And in emergency case you can't use the > disks with other controllers due to this. > > -- Reuti > > * > http://www.heise.de/foren/S-Re-Erfahrungen-Tandberg-RDX-QuikStor/forum-7273/msg-16513128/read/ > > > >> Cheers, >> -- >> Kilian From rpnabar at gmail.com Fri Feb 12 09:36:14 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 12 Feb 2010 11:36:14 -0600 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: <4B758733.3040101@tamu.edu> References: <6B51FF25-AFF1-42F3-9829-1E127D6CFD2E@staff.uni-marburg.de> <4B758733.3040101@tamu.edu> Message-ID: On Fri, Feb 12, 2010 at 10:52 AM, Gerry Creager wrote: > We discussed this in our HPC group meeting yesterday. I've long been > dissatisfied with PERC controllers, but this is now a show-stopper for me. I > might order Dell, but not the PERC, ever again. ?Who do they think they are? > NetApp? Don't you *have* to use the PERC? Will the HDDs talk with other controllers? Can I hear some more about your PERC dissatisfaction, Gerry? I just bought a few and might be better knowing what I'm up against! -- Rahul From john.hearns at mclaren.com Fri Feb 12 09:56:56 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Fri, 12 Feb 2010 17:56:56 -0000 Subject: [Beowulf] Register survey on HPC Message-ID: <68A57CCFD4005646957BD2D18E60667B0F4FDB76@milexchmb1.mil.tagmclarengroup.com> http://www.theregister.co.uk/2010/02/12/hpc_for_the_masses/ It's as close as the IT industry will ever get to "2 Fast, 2 Furious" - gangs of highly technical experts pushing their custom-built computers to the limit with an aim to win that ultimate prize, a place in the world supercomputing rankings. That's me that is! I'm suddenly young and happening. Yayyyy..... Challenge yer all to a race - 11pm tonight, car park behind the Mall. How many Tflops you got under the hood then? The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From lindahl at pbm.com Fri Feb 12 10:12:24 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 12 Feb 2010 10:12:24 -0800 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: <4B758733.3040101@tamu.edu> References: <6B51FF25-AFF1-42F3-9829-1E127D6CFD2E@staff.uni-marburg.de> <4B758733.3040101@tamu.edu> Message-ID: <20100212181224.GC5965@bx9.net> On Fri, Feb 12, 2010 at 10:52:03AM -0600, Gerry Creager wrote: > Who do they think they are? NetApp? That's what you get for buying commodity stuff that's not marketed to you. If you have a business-critical database that needs a lot of 9's of uptime, and to never, ever lose data, it makes sense to only buy disks which are the exact hardware/firmware that's been tested with that controller. Note that this is what Dell considers a high end controller. If you're an HPC shop, you don't want to pay for that kind of reliability. I'm having a similar fun situation with HP. They don't sell any MLC SSDs, because they aren't reliable enough to run a traditional database that gets lots of writes. Well, I know I don't write that much. Also, their support organization no longer understands the concept of parts that wear out. I managed to talk them into selling me caddies for MLC SSDs, but only because it was the end of their fiscal year. And don't get me started about the fun of getting support when I'm using their "high end" P410 RAID card as a JBOD... -- greg From kuenching at gmail.com Thu Feb 11 10:43:12 2010 From: kuenching at gmail.com (Tsz Kuen Ching) Date: Thu, 11 Feb 2010 13:43:12 -0500 Subject: [Beowulf] PVM 3.4.5-12 terminates when adding Host on Ubuntu 9.10 Message-ID: Whenever I attempt to add a host in PVM it ends up terminating the process in the master program. The process does run in the slave node, however because the PVM terminates I do not get access to the node. I'm currently using Ubuntu 9.10, and I used apt-get to install pvm ( pvmlib, pvmdev, pvm). Thus $PVM_ROOT is set automatically, and so is $PVM_ARCH As for the other variables, I have not looked for them. I can ssh into the the slave without the need of a password. Any Ideas or suggestions? This is what happens: user at laptop> pvm pvm> add slave-slave add slave-slave Terminated user at laptop> ... The logs are as followed: Laptop log --- [t80040000] 02/11 10:23:32 laptop (127.0.1.1:55884) LINUX 3.4.5 [t80040000] 02/11 10:23:32 ready Thu Feb 11 10:23:32 2010 [t80040000] 02/11 10:23:32 netoutput() sendto: errno=22 [t80040000] 02/11 10:23:32 em=0x2c24f0 [t80040000] 02/11 10:23:32 [49/?][6e/?][76/?][61/?][6c/?][69/?][64/?][20/?][61/?][72/?] [t80040000] 02/11 10:23:32 netoutput() sendto: Invalid argument [t80040000] 02/11 10:23:32 pvmbailout(0) slave-log --- [t80080000] 02/11 10:23:25 slave-slave (xxx.x.x.xxx:57344) LINUX64 3.4.5 [t80080000] 02/11 10:23:25 ready Thu Feb 11 10:23:25 2010 [t80080000] 02/11 10:28:26 work() run = STARTUP, timed out waiting for master [t80080000] 02/11 10:28:26 pvmbailout(0) -------------- next part -------------- An HTML attachment was scrubbed... URL: From himikehawaii1 at yahoo.com Fri Feb 12 10:51:28 2010 From: himikehawaii1 at yahoo.com (MDG) Date: Fri, 12 Feb 2010 10:51:28 -0800 (PST) Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: Message-ID: <708266.89474.qm@web54104.mail.re2.yahoo.com> Well here is a small taste of problems with Dell and PERC controllers, I have tried to use Dell Perc Controllers all the way back to Perc 2 Quad-Channels, I needed new batteries for the back up memory I ordered using Dell's own part number they sent the wrong part, I returned, I sent them the ACTUAL BATTERY, THEY SENT ME ANOTHER BATTERY AGAIN WRONG, this after dozens of phone calls.? Will I ever buy another Dell product? Not on your life, when they cannot even get a battery right when they have the actual battery in hand that is very bad service! They may make a good product till you need service then forget it, and you can also forget stripping out working parts from a dead server for use as they match all products to the original computer case's serial number, so if you want to keep legacy systems running but need to make a change such as a new CPU box then Dell not only has hard-code/password protected the use of non-Dell Hard drive but also refuses to service Dell equipment, such as sale you the back-up battery, unless you happen to have the old serial case number and you better be the original buyer otherwise their database says you are not the customer and another rejection.? Assuming they can ever find it even when they have the old one in hand so it is impossible not to have the correct part number the tag is on it, as well as they have the part in their hand, as well as when they pull the alleged correct part from stock, after weeks of calls, the replacement they send the wrong one along with the old one and they are obviously incorrect as not even a size, or connector that will plug into their own RAID disk controller, made by Adaptec, who I called after trying to get the battery, Adpatec, who generally I find excellent, says it was made solely for Dell so they cannot help, turns out also made for HP but with some firmware changes I lster found out.? Note, Dell will say it is a Perc but it may be an AMI or and AAdaptec, and parts are totally differant yet they are both called a Perc X model, so you may not even be buying what you think as well as their performance test will be on the higher preformasnce unit but you end up with the lower end part.? You can forget their "White Papers" performance statements as the part tested may not be what you bought. So my comment, forget buying Dell computers if you expect them to be cooperative.? Also many of the so called non-certified disks, or other parts are really from 3rd parts with Dell's label on the case if you take it out of the hard disk cage for example, are the exact same disk such as the top of the line Seagate large storage drives if you remove them from the Dell case.? So why would a Seagate XYZ (fake serial or model number) from Seagate of same size and model number not work with the exact same one Dell sales at a huge mark up.? Both are high performance drives, 10-15,000 high speed iSCSI or Sata, the same model number and there is no way Dell tests ever disk beyond basic test.? Dells original philosophy was buy "off the shelf part and assembly them for a low cost and good performance and serice, Michael Dell started this in his dorm room at college and grew it into a major computer company, then left but tried to come back to "fix" Dell's problems, I guess locking you to them for over-priced fixes, assuming the customer service can even locate the fix, is their idea of how to fix, I think I will just order my own parts from MOBO's to controllers at least then I know what is in them and where to get my parts at often a lower price and better performance"? It seems their philosophy since has been lets over-charge, under-service and now make it where our customers have no choice but to continue with us.? Basically starts to sound like Computer addiction, get a customer and make sure they have no choice but to buy there computer fix from Dell (assuming they can find the part) at an inflated price and service so poor? they cannot read their own part numbers.? So I will assume that if you ever have a drive problem you will end up as I did in an endless loop of calls, wrong parts, etc.? Good luck. Mike --- On Fri, 2/12/10, Rahul Nabar wrote: From: Rahul Nabar Subject: Re: [Beowulf] Re: Third-party drives not permitted on new Dell servers? To: "Gerry Creager" Cc: "Beowulf ML" Date: Friday, February 12, 2010, 7:36 AM On Fri, Feb 12, 2010 at 10:52 AM, Gerry Creager wrote: > We discussed this in our HPC group meeting yesterday. I've long been > dissatisfied with PERC controllers, but this is now a show-stopper for me. I > might order Dell, but not the PERC, ever again. ?Who do they think they are? > NetApp? Don't you *have* to use the PERC? Will the HDDs talk with other controllers? Can I hear some more about your PERC dissatisfaction, Gerry? I just bought a few and might be better knowing what I'm up against! -- Rahul _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From douglas.guptill at dal.ca Fri Feb 12 17:49:37 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Fri, 12 Feb 2010 21:49:37 -0400 Subject: [Beowulf] Register survey on HPC In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0F4FDB76@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0F4FDB76@milexchmb1.mil.tagmclarengroup.com> Message-ID: <20100213014937.GB15187@sopalepc> On Fri, Feb 12, 2010 at 05:56:56PM -0000, Hearns, John wrote: > http://www.theregister.co.uk/2010/02/12/hpc_for_the_masses/ There is a sentence in there which, taken out of context and suitably edited, really tickles me: "Microsoft's Windows is ... seen as potentially useful by almost half of respondents..." Douglas. From gerry.creager at tamu.edu Sat Feb 13 08:09:48 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sat, 13 Feb 2010 10:09:48 -0600 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: <6B51FF25-AFF1-42F3-9829-1E127D6CFD2E@staff.uni-marburg.de> <4B758733.3040101@tamu.edu> Message-ID: <4B76CECC.3020002@tamu.edu> Rahul Nabar wrote: > On Fri, Feb 12, 2010 at 10:52 AM, Gerry Creager wrote: >> We discussed this in our HPC group meeting yesterday. I've long been >> dissatisfied with PERC controllers, but this is now a show-stopper for me. I >> might order Dell, but not the PERC, ever again. Who do they think they are? >> NetApp? > > > Don't you *have* to use the PERC? Will the HDDs talk with other controllers? > > Can I hear some more about your PERC dissatisfaction, Gerry? I just > bought a few and might be better knowing what I'm up against! You don't have to use PERC but you'll get the full guilt trip if you try ordering without them. At least in the past, PERC was a firmware-tweaked LSI RAID controller. I've learned how the LSI RAID controllers work, and I'm very happy with them: They are, among other things, one of a small set of real hardware RAID controllers. My second choice is 3Ware. PERCs have some tuning that makes sense to the Dell guys who did the tweaks, but which don't tend to make a lot of sense to me. I can't create multiple LUNs on a PERC array (or at least haven't figured out how, yet) and I've had problems with mounting root file systems on them. They don't gracefully give up one or two drives for boot access in, say, a RAID 0/1 config. Having said this, I'm not the most facile when configuring the PERC controllers are discussed, partly because I was scarred while young. Rather than learning about them, I just order around 'em now. gerry From hahn at mcmaster.ca Sat Feb 13 11:05:55 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Sat, 13 Feb 2010 14:05:55 -0500 (EST) Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: > hardware. Apparently Dell has started hardware-blocking hard-drives > that are not "Dell certified". I believe this should be met head-on: if a controller claiming to support SATA does not permit the use of any conforming SATA disk, then the controller is not conforming. we need to lobby the standards organizations to consistently apply their trademark. the situation is easy to understand: this is just one step beyond "warranty void if opened". but it's a step too far, since it's not merely a warranty situation (where the device will continue to work even if opened), but rather standard non-conformity for lock-in. I think the issue of standards needs to be talked about more in the industry press, actually. there's too much damaging fuzziness about "defacto" standards, what interop really means to the customer, and how you should run away when a vendor lists "supported" models. defacto standards means "not a standard, but the way X does it, and everyone thinks X is unchallengable." this completely negates the actual meaning of standard, which is that if you have two conforming devices, they will interoperate. a defacto standard means nothing more than "has worked with X in the tests we've done". no promise for any non-tested configuration. no promises if X decides to change. interop is what the customer desires: instead of an N^2 problem of deciding whether option A is compatible with option B, each option merely has to conform to the standard. 2N tests less work than N^2 (for sufficiently large N ;) when a vendor lists supported configs, they are implicitly saying that they have no faith in standard conformity, and are instead merely going to mark out a few places in the N^2 grid where they will take the blame for failure. this is profoundly anti-customer. we expect standards in most places (just think: tires, roads, gasoline - imagine if Ford only "supported" driving with Ford-brand gas, on "OEM" tires, on Ford-approved roads where the only other cars are Fords...) I don't know whether having standards organizations police their brands would be good enough. obviously, there is some conflict of interest, since most of the support for, say, SATA-IO is let by vendors (incl Dell), and is probably less arms-length than INCITS T10/13 committees. but I think this could be dealt with as a criminal matter as well, since claiming standard conformance is clearly a product-liability issue. my personal experience is with HP products: nearly all HP disk controllers refuse to work with products not bought through HP channels. there is some escape at the lowest end where HP doesn't bother to break any of the chipset-integrated controllers (afaik). to me, this difference indicts. as MAKERs say, if you can't open it, you don't own it. I think as customers we should demand standard-conformity, even though vendors have often gotten away with it in the past. the same vendors _do_ actually support standards- based interop for some products (ethernet, power cables, vga/dvi/hdmi, pci/pcie, usually even dimms). if a product doesn't conform to its specs, then it's broken. how many class-action suits would it take to get vendors to recognize this? From reuti at staff.uni-marburg.de Sun Feb 14 15:18:00 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Mon, 15 Feb 2010 00:18:00 +0100 Subject: [Beowulf] PVM 3.4.5-12 terminates when adding Host on Ubuntu 9.10 In-Reply-To: References: Message-ID: <0D2F92CE-AAEB-4D9E-9AC6-F591C9AE1773@staff.uni-marburg.de> Am 11.02.2010 um 19:43 schrieb Tsz Kuen Ching: > Whenever I attempt to add a host in PVM it ends up terminating the > process in the master program. The process does run in the slave > node, however because the PVM terminates I do not get access to the > node. > > I'm currently using Ubuntu 9.10, and I used apt-get to install pvm > ( pvmlib, pvmdev, pvm). > Thus $PVM_ROOT is set automatically, and so is $PVM_ARCH > As for the other variables, I have not looked for them. > > I can ssh into the the slave without the need of a password. Do you have any firwall on the machines which blocks certain ports? -- Reuti > > Any Ideas or suggestions? > > This is what happens: > > user at laptop> pvm > pvm> add slave-slave > add slave-slave > Terminated > user at laptop> ... > > The logs are as followed: > > Laptop log > --- > [t80040000] 02/11 10:23:32 laptop (127.0.1.1:55884) LINUX 3.4.5 > [t80040000] 02/11 10:23:32 ready Thu Feb 11 10:23:32 2010 > [t80040000] 02/11 10:23:32 netoutput() sendto: errno=22 > [t80040000] 02/11 10:23:32 em=0x2c24f0 > [t80040000] 02/11 10:23:32 [49/?][6e/?][76/?][61/?][6c/?][69/?][64/ > ?][20/?][61/?][72/?] > [t80040000] 02/11 10:23:32 netoutput() sendto: Invalid argument > [t80040000] 02/11 10:23:32 pvmbailout(0) > > slave-log > --- > [t80080000] 02/11 10:23:25 slave-slave (xxx.x.x.xxx:57344) LINUX64 > 3.4.5 > [t80080000] 02/11 10:23:25 ready Thu Feb 11 10:23:25 2010 > [t80080000] 02/11 10:28:26 work() run = STARTUP, timed out waiting > for master > [t80080000] 02/11 10:28:26 pvmbailout(0) > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From kus at free.net Mon Feb 15 09:32:40 2010 From: kus at free.net (Mikhail Kuzminsky) Date: Mon, 15 Feb 2010 20:32:40 +0300 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: Message-ID: And what is known about other vendors (Sun, HP, IBM) x86 standard 1u/2u servers ? SGI hardware even in their "big" UNIX SMPs like Power Challenge allowed using of 3rd party drives - although they were not supported officially. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From rpnabar at gmail.com Mon Feb 15 09:51:28 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 15 Feb 2010 11:51:28 -0600 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: This was the response from Dell, I especially like the analogy: [snip] >There are a number of benefits for using Dell qualified drives in particular ensuring a ***positive experience*** and protecting ***our data***. >While SAS and SATA are industry standards there are differences which occur in implementation. An analogy is that English is spoken in the UK, US >and Australia. While the language is generally the same, there are subtle differences in word usage which can lead to confusion. This exists in >storage subsystems as well. As these subsystems become more capable, faster and more complex, these differences in implementation can have >greater impact. [snip] I added the emphasis. I am in love Dell-disks that get me "the positive experience". :) -- Rahul From deadline at eadline.org Mon Feb 15 10:56:38 2010 From: deadline at eadline.org (Douglas Eadline) Date: Mon, 15 Feb 2010 13:56:38 -0500 (EST) Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: <39109.192.168.1.1.1266260198.squirrel@mail.eadline.org> There are two "ISO standard" English words I have for this kind of marketing response. -- Doug > This was the response from Dell, I especially like the analogy: > > [snip] >>There are a number of benefits for using Dell qualified drives in >> particular ensuring a ***positive experience*** and protecting ***our >> data***. >>While SAS and SATA are industry standards there are differences which >> occur in implementation. An analogy is that English is spoken in the UK, >> US >and Australia. While the language is generally the same, there are >> subtle differences in word usage which can lead to confusion. This exists >> in >storage subsystems as well. As these subsystems become more capable, >> faster and more complex, these differences in implementation can have >> >greater impact. > [snip] > > I added the emphasis. I am in love Dell-disks that get me "the > positive experience". :) > > -- > Rahul > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From rpnabar at gmail.com Mon Feb 15 13:56:28 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 15 Feb 2010 15:56:28 -0600 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance Message-ID: Are there any generic "scheduler visualization" tools out there? Sometimes I feel it'd be nice if I had a way to find out how my sceduling was performing. i.e. blocks of empty procs; how fragmented the job assignment was; large / small job split, utilization efficiency, backfill status; etc. I use openpbs (torque) + maui and it does have some text mode accounting reports. But sometimes they are hard to digest and a birds eye view might be easier via a visualization. I haven't found any toolkits yet. Of course, I could parse and plot myself with a bunch of sed / awk / gnuplot but I don't want to unnecessarily reinvent the wheel if I can avoid it. Also, I remembered seeing some cool visualizations (quite animated at that) at one of the supercomputing agencies a while ago but just can't seem to find which one it was now that I need it. Admittedly, some of the visualizations can sway more towards the "coolness" factor than actual insights but still it's worth a shot. Any pointers or scripts other Beowulfers might have are greatly appreciated. -- Rahul From hearnsj at googlemail.com Mon Feb 15 15:02:01 2010 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 15 Feb 2010 23:02:01 +0000 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: References: Message-ID: <9f8092cc1002151502i7f17c32dk9db4aaa14619e2c@mail.gmail.com> That's a good question. PBS are promoting PBS Analytics http://www.pbsgridworks.com/Product.aspx?id=7 and SGE has Arco http://wikis.sun.com/display/GridEngine/Installing+ARCo If I have it right, both of these work by accumulating information about completed jobs. I may be wrong, and would like to see Analytics working (hint - I have tried). Your question about displaying blocks of empty nodes is interesting. I guess in PBS you would run 'pbsnodes' and look for the free ones! From rpnabar at gmail.com Mon Feb 15 15:07:36 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 15 Feb 2010 17:07:36 -0600 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: <9f8092cc1002151502i7f17c32dk9db4aaa14619e2c@mail.gmail.com> References: <9f8092cc1002151502i7f17c32dk9db4aaa14619e2c@mail.gmail.com> Message-ID: On Mon, Feb 15, 2010 at 5:02 PM, John Hearns wrote: > That's a good question. > > PBS are promoting PBS Analytics http://www.pbsgridworks.com/Product.aspx?id=7 > and SGE has Arco http://wikis.sun.com/display/GridEngine/Installing+ARCo > If I have it right, both of these work by accumulating information > about completed jobs. > I may be wrong, and would like to see Analytics working (hint - I have tried). Thanks John! THose sound interesting. > Your question about displaying blocks of empty nodes is interesting. > I guess in PBS you would run 'pbsnodes' and look for the free ones! pbsnodes does have more than sufficient info. It's only the scripting around it that needs to be in place. -- Rahul From rpnabar at gmail.com Mon Feb 15 15:10:57 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 15 Feb 2010 17:10:57 -0600 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: <9f8092cc1002151502i7f17c32dk9db4aaa14619e2c@mail.gmail.com> References: <9f8092cc1002151502i7f17c32dk9db4aaa14619e2c@mail.gmail.com> Message-ID: On Mon, Feb 15, 2010 at 5:02 PM, John Hearns wrote: > That's a good question. > > PBS are promoting PBS Analytics http://www.pbsgridworks.com/Product.aspx?id=7 > and SGE has Arco http://wikis.sun.com/display/GridEngine/Installing+ARCo Too bad they both seem "paid". I was hoping to find something in the "free" domain. I doubt I can justify the $$ for a scheduler-visualizer especially if (like most things in the scheduling universe) the licenses tend to be stubbornly per-core. The per-core licensing has been the single biggest factor that prevents me from even evaluating out any of the "paid" schedulers. -- Rahul From beckerjes at mail.nih.gov Mon Feb 15 15:43:46 2010 From: beckerjes at mail.nih.gov (Jesse Becker) Date: Mon, 15 Feb 2010 18:43:46 -0500 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: References: <9f8092cc1002151502i7f17c32dk9db4aaa14619e2c@mail.gmail.com> Message-ID: <20100215234346.GJ12997@mail.nih.gov> On Mon, Feb 15, 2010 at 06:10:57PM -0500, Rahul Nabar wrote: >On Mon, Feb 15, 2010 at 5:02 PM, John Hearns wrote: >> That's a good question. >> >> PBS are promoting PBS Analytics http://www.pbsgridworks.com/Product.aspx?id=7 >> and SGE has Arco http://wikis.sun.com/display/GridEngine/Installing+ARCo > >Too bad they both seem "paid". I was hoping to find something in the >"free" domain. I doubt I can justify the $$ for a scheduler-visualizer ARCo can be be had for zero $ cost. It will, however, cost something in time and effort to configure, and requires a moderately beefy box on which to run. -- Jesse Becker NHGRI Linux support (Digicon Contractor) From landman at scalableinformatics.com Mon Feb 15 17:41:08 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Mon, 15 Feb 2010 20:41:08 -0500 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: <4B79F7B4.9020808@scalableinformatics.com> Rahul Nabar wrote: > This was the response from Dell, I especially like the analogy: > > [snip] >> There are a number of benefits for using Dell qualified drives in >> particular ensuring a ***positive experience*** and protecting >> ***our data***. While SAS and SATA are industry standards there are >> differences which occur in implementation. An analogy is that >> English is spoken in the UK, US >and Australia. While the language >> is generally the same, there are subtle differences in word usage >> which can lead to confusion. This exists in >storage subsystems as >> well. As these subsystems become more capable, faster and more >> complex, these differences in implementation can have >greater >> impact. > [snip] > > I added the emphasis. I am in love Dell-disks that get me "the > positive experience". :) Please indulge my taking a contrarian view based upon the products we sell/support/ship. I see significant derision heaped upon these decisions, which are called "marketing decisions" by Dell and others. It couldn't be possible, in most commenter's minds that they might actually have a point ... ... I am not defending Dell's language (I wouldn't use this or allow this to be used in our outgoing marketing/customer communications). Let me share an anecdote. I have elided the disk manufacturers name to protect the guilty. I will not give hints as to whom they are, though some may be able to guess ... I will not confirm. We ship units with 2TB (and 1.5TB) drives among others. We burn in and test these drives. We work very hard to insure compatibility, and to make sure that when users get the units, that the things work. We aren't perfect, and we do occasionally mess up. When we do, we own up to it and fix it right away. Its a different style of support. The buck stops with us. Period. So along comes a drive manufacturer, with some nice looking specs on 2TB (and some 1.5 and 1 TB) drives. They look great on paper. We get them into our labs, and play with them, and they seem to run really well. Occasional hiccup on building RAIDs, but you get that in large batches of drives. So now they are out in the field for months, under various loads. Some in our DeltaV's, some in our JackRabbits. The units in the DeltaV's seem to have a ridiculously high failure rate. This is not something we see in the lab. Even with constant stress, horrific sustained workloads ... they don't fail in ou testing. But get these same drives out into the users hands ... and whammo. Slightly different drives in our JackRabbit units, with a variety of RAID controllers. Same types of issues. Timeouts, RAID fall outs, etc. This is not something we see in the lab in our testing. We try emulating their environments, and we can't generate the failures. Worse, we get the drives back after exchanging them at our cost with new replacements, only to find out, upon running diagnostics, that the drives haven't failed according to the test tool. This failing drive vendor refuses to acknowledge firmware bugs, effectively refuses to release patches/fixes. Our other main drive vendor, while not currently with a 2TB drive unit, doesn't have anything like this manufacturers failure rate in the field. When drives die in the field, they really ... really die in the field. And they do fix their firmware. So we are now moving off this failing manufacturer (its a shame as they used to produce quality parts for RAID several years ago), and we are evaluating replacements for them. Firmware updates are a critical aspect of a replacement. If the vendor won't allow for a firmware update, we won't use them. So ... this anecdote complete, if someone called me up and said "Joe, I really want you to build us an siCluster for our storage, and I want you to use [insert failing manufacturer's name here] drives because we like them", what do you think my reaction should be? Should it be "sure, no problem, whatever you want" ... with the subsequent problems and pain, for which we would be blamed ... or should it be "no, these drives don't work well ... deep and painful experience at customer sites shows that they have bugs in their firmware which are problematic for RAID users ... we are attempting to get them to give us the updated firmware to help the existing users, but we would not consider shipping more units with these drives due to their issues." Is that latter answer, which is the correct answer, a marketing answer? Yeah, SATA and SAS are standards. Yeah, in theory, they all do work together. In reality, they really don't, and you have to test. Everyone does some aspect slightly different and usually in software, so they can fix it if they messed up. If their is a RAID timeout bug due to head settling timing, yeah, this is fixable. But if the disk manufacturer doesn't want to fix it ... its your companies name on the outside of that box. You are going to take the heat for their problems. Note: This isn't just SATA/SAS drives, there are a whole mess of things that *should* work well together, but do not. We had some exciting times in the recent past with SAS backplanes that refused to work with SAS RAID cards. We've had some excitment from 10GbE cards, IB cards, etc. that we shouldn't have had. I can't and won't sanction their tone to you ... they should have explained things correctly. Given that PERC are rebadged LSI, yeah, I know perfectly well a whole mess of drives that *do not* work correctly with them. So please don't take Dell to task for trying to help you avoid making what they consider a bad decision on specific components. There could be a marketing aspect to it, but support is a cost, and they want to minimize costs. Look at failure rates, and toss the suppliers who have very high ones. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Mon Feb 15 17:44:06 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Mon, 15 Feb 2010 20:44:06 -0500 (EST) Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: > And what is known about other vendors (Sun, HP, IBM) x86 standard 1u/2u > servers ? as I mentioned, vendors don't seem to bother rewriting third-party disk controller bioses to enforce brand lock-in. for instance, most 1U's with 1-4 SATA will drive those bays from the AMD/Intel chipset controller, and so not be crippled. supporting SAS or hardware raid5/6 in those slots would also be a tip-off that a "higher end" controller is in use (which means "less functional" and "non-conforming" in this context.) > SGI hardware even in their "big" UNIX SMPs like Power Challenge allowed using > of 3rd party drives - although they were not supported officially. well, those would be SCSI and FC drives, where vendors seem to be more timid in doing nonconformity-for-lock-in. I guess they figure that if you're going to pay for gold-plated drives, you'll probably buy them from the original vendor. the lock-in is focused on more commoditized drives, where the vendor typically charges a >100% premium to relabel a generic drive from wd/seagate/etc. From rpnabar at gmail.com Mon Feb 15 18:12:08 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 15 Feb 2010 20:12:08 -0600 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <4B79F7B4.9020808@scalableinformatics.com> References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: On Mon, Feb 15, 2010 at 7:41 PM, Joe Landman wrote: > Please indulge my taking a contrarian view based upon the products we > sell/support/ship. > > I can't and won't sanction their tone to you ... they should have explained > things correctly. ?Given that PERC are rebadged LSI, yeah, I know perfectly > well a whole mess of drives that *do not* work correctly with them. > > So please don't take Dell to task for trying to help you avoid making what > they consider a bad decision on specific components. ?There could be a > marketing aspect to it, but support is a cost, and they want to minimize > costs. ?Look at failure rates, and toss the suppliers who have very high > ones. To me the test is: Is there a price-markup on the specific part recommended. If a vendor just said "Drive X is compatible and tested; please use it" and then I peg Drive X against competing drives and see a significant price markup without commensurate observable statistics improvement then I smell a rat. I feel further that a Vendor could make itself more neutral in this exercise by just naming one or more compatible, validated drive-models rather than trying to sell those themselves after re-branding. That creates an obvious conflict of interest. It makes it difficult to deconvolute monopoly-pricing from a genuine desire to promote reliability. I'm not sure how much of a price markup there is on the approved Dell drives. -- Rahul From landman at scalableinformatics.com Mon Feb 15 18:30:04 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Mon, 15 Feb 2010 21:30:04 -0500 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: <4B7A032C.2080207@scalableinformatics.com> Rahul Nabar wrote: > On Mon, Feb 15, 2010 at 7:41 PM, Joe Landman > wrote: >> Please indulge my taking a contrarian view based upon the products we >> sell/support/ship. >> >> I can't and won't sanction their tone to you ... they should have explained >> things correctly. Given that PERC are rebadged LSI, yeah, I know perfectly >> well a whole mess of drives that *do not* work correctly with them. >> >> So please don't take Dell to task for trying to help you avoid making what >> they consider a bad decision on specific components. There could be a >> marketing aspect to it, but support is a cost, and they want to minimize >> costs. Look at failure rates, and toss the suppliers who have very high >> ones. > > To me the test is: Is there a price-markup on the specific part > recommended. If a vendor just said "Drive X is compatible and tested; I can't speak to Dell's comments. I can speak to ours. If a customer asks us if we have tested a drive, we look it up and see if we have. If they want to try it, we offer them help. We have an interest in making sure that we work well with the drives in our units. This is part of the reason we make various decisions on configurations. Drive markup isn't a factor in configurations. Stuff working correctly is. Suppose Dell buys 50M drives per year. Shaving $1 per drive will net them $50M more to their bottom line. Which, in the larger scheme of things, doesn't do much to their bottom line. Far less than 1% motion on their P&L. > please use it" and then I peg Drive X against competing drives and see > a significant price markup without commensurate observable statistics > improvement then I smell a rat. I feel further that a Vendor could Hmmm.... if it won't impact their pricing that much to begin with, even if they could get it to a 1% cost of good sold reduction, it gets very hard to make an argument that they will perform these actions for economic reasons that simply won't have a significant economic impact upon their bottom line. OTOH, if you have manufacturer X with drive X.X with a known failure rate 10x that of manufacturer Y with a drive Y.Y, and your liabilities column on your balance sheet is drastically negatively impacted by support issues ... yeah, you are going to do all you can to minimize this liability side. You can't impact the drive costs much, but by careful selection of drive units you sure can reduce your support liabilities. > make itself more neutral in this exercise by just naming one or more > compatible, validated drive-models rather than trying to sell those > themselves after re-branding. That creates an obvious conflict of They do make margin on drives. If you object, you can always buy the unit bare, and perform your own validation. Which means you buy your own test drives, and spend your own time and effort to do this. Which means spending your own money to do this. Support costs money, and they are seeking to keep those costs under control. > interest. It makes it difficult to deconvolute monopoly-pricing from a > genuine desire to promote reliability. Hmmm .... Dell only has a monopoly if you let them. If you want to buy servers from other companies, by all means, buy them from other companies. Many universities I am aware of have signed agreements with Dell, HP, Sun, etc to buy exclusively from them. Whether or not these are legal in the face of universities requirements on maximizing value on their purchase is a completely separate discussion, one you ought to have with your purchasing departments if you feel that you are not getting the value you need from their actions. My old research group used to write sole source memos for every purchase, so that we could get what we wanted, and not what our purchasing department wanted to buy. > > I'm not sure how much of a price markup there is on the approved Dell drives. > You should be able to calculate it by configuring units with various numbers of drives. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Mon Feb 15 23:08:32 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 16 Feb 2010 02:08:32 -0500 (EST) Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <4B7A032C.2080207@scalableinformatics.com> References: <4B79F7B4.9020808@scalableinformatics.com> <4B7A032C.2080207@scalableinformatics.com> Message-ID: > Drive markup isn't a factor in configurations. Stuff working correctly is. > Suppose Dell buys 50M drives per year. Shaving $1 per drive will net them > $50M more to their bottom line. Which, in the larger scheme of things, > doesn't do much to their bottom line. Far less than 1% motion on their P&L. "working correctly" is, as you point out, difficult to prove (proving a negative versus "so far so good".) the real issue here is several-fold: 1. the markup is vastly more than $1/drive. I don't know Dell prices as well as another vendor where the markup is O(300%) (es.2 750G 408 $Cdn after public-sector discount, versus $135 from newegg.ca. ironically Seagate's warranty on this is 5 years; the vendor's is 1 year...) 2. big-name vendors are incredibly slow on their feet: usually lagging at least one whole disk generation. I understand that the bigger the supply chain, the more momentum and reluctance to stock new products. and testing takes time, but this is a huge disadvantage to customers - especially in a domain like storage which is on as steep price-performance curve as GPUs. 3. the main issue is still whether it's defensible to cripple a controller to refuse to interact with products that don't come through the vendor's supply chain (even if identical in make, model, fw rev) (yes, vendor-specific fw is a way to wiggle out of this...) > balance sheet is drastically negatively impacted by support issues ... yeah, > you are going to do all you can to minimize this liability side. You can't > impact the drive costs much, but by careful selection of drive units you sure > can reduce your support liabilities. I'm curious: you imply that vendors have no recourse to force the _disk_ vendors to supply parts which work right (as defined by standard). is that really true? I'd be surprised if Dell doesn't get pretty emphatic cooperation from their disk vendor(s). >> make itself more neutral in this exercise by just naming one or more >> compatible, validated drive-models rather than trying to sell those >> themselves after re-branding. That creates an obvious conflict of > > They do make margin on drives. If you object, you can always buy the unit > bare, and perform your own validation. Which means you buy your own test > drives, and spend your own time and effort to do this. Which means spending > your own money to do this. the fact is that disks are incredibly cheap and getting moreso. sure, replacing a bunch of them is noticable, but as a fraction of the the systems they support, they're a pittance. even as a fraction of the storage subsystem, they're very small (reading off the same public-sector price-list, I see a 10 TB SATA-based storage system for over $30k - that's not a premium product from this vendor and I can't see any way that the disks would cost more than $5k.) I think the real paradigm shift is that disks have become a consumable which you want to be able to replace in 1-2 product generations (2-3 years). along with this, disks just aren't that important, individually - even something _huge_ like seagate's firmware problem, for instance, only drove up random failures, no? but it also begs the question: what's really so different about disks? is the disk protocol really that much more subtle and prone to problems than, say, PCI-E? > Support costs money, and they are seeking to keep those costs under control. sure, the free market is all about finding ways to make money; that doesn't imply that customers, the market and the society as a whole has to permit every possible way ;) >> interest. It makes it difficult to deconvolute monopoly-pricing from a >> genuine desire to promote reliability. > > Hmmm .... Dell only has a monopoly if you let them. If you want to buy > servers from other companies, by all means, buy them from other companies. few markets are free; any individual purchasing decision is almost certainly under a massive market asymmetry. no project/consortium/university can actually require change in policy from an entity as large as Dell/HP/IBM/etc. > Many universities I am aware of have signed agreements with Dell, HP, Sun, > etc to buy exclusively from them. Whether or not these are legal in the face > of universities requirements on maximizing value on their purchase is a > completely separate discussion, one you ought to have with your purchasing > departments if you feel that you are not getting the value you need from hah. purchasing departments are interested primarily in survival like any organism. unfortunately, they tend not to have natural predators. but funding organizations often effectively require use of BigCo; even if not, there's always the no-one-got-fired-for-buying-IBM sort of institutional conservativism. > their actions. My old research group used to write sole source memos for > every purchase, so that we could get what we wanted, and not what our > purchasing department wanted to buy. right, but did you succeed in changing how BigCo handles its supply chain? >> I'm not sure how much of a price markup there is on the approved Dell >> drives. > > You should be able to calculate it by configuring units with various numbers > of drives. many vendors also publish price lists (though the cost will be different when calculated these two ways.) From lynesh at Cardiff.ac.uk Tue Feb 16 03:31:26 2010 From: lynesh at Cardiff.ac.uk (Huw Lynes) Date: Tue, 16 Feb 2010 11:31:26 +0000 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: References: <9f8092cc1002151502i7f17c32dk9db4aaa14619e2c@mail.gmail.com> Message-ID: <1266319886.2321.9.camel@w609.insrv.cf.ac.uk> On Mon, 2010-02-15 at 17:10 -0600, Rahul Nabar wrote: > On Mon, Feb 15, 2010 at 5:02 PM, John Hearns wrote: > > That's a good question. > > > > PBS are promoting PBS Analytics http://www.pbsgridworks.com/Product.aspx?id=7 > > and SGE has Arco http://wikis.sun.com/display/GridEngine/Installing+ARCo > > Too bad they both seem "paid". I was hoping to find something in the > "free" domain. I doubt I can justify the $$ for a scheduler-visualizer > especially if (like most things in the scheduling universe) the > licenses tend to be stubbornly per-core. The per-core licensing has > been the single biggest factor that prevents me from even evaluating > out any of the "paid" schedulers. We are also looking at this area. We've looked at PBS Analytics (expensive for extra licenses on top of our existing PBSPro) and Cluster Resources MOAB Access Portal (expensive, with a problematic authentication model). Uni Dusseldorf have an in-house tool called myJAM which we are hoping to test in the near future. The very first script I wrote when we took delivery of our PBSPro cluster was a tool to output a summary of node use. It will probably work fine on Torque too. It's not a GUI, but at least it's something. http://webdocs.arcca.cf.ac.uk/external/scripts/qstate Thanks, Huw -- Huw Lynes | Advanced Research Computing HEC Sysadmin | Cardiff University | Redwood Building, Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB From michf at post.tau.ac.il Tue Feb 16 04:53:28 2010 From: michf at post.tau.ac.il (Micha Feigin) Date: Tue, 16 Feb 2010 14:53:28 +0200 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <4B79F7B4.9020808@scalableinformatics.com> References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: <20100216145328.52f048a9@math018-24.tau.ac.il> On Mon, 15 Feb 2010 20:41:08 -0500 Joe Landman wrote: > Rahul Nabar wrote: > > This was the response from Dell, I especially like the analogy: > > > > [snip] > >> There are a number of benefits for using Dell qualified drives in > >> particular ensuring a ***positive experience*** and protecting > >> ***our data***. While SAS and SATA are industry standards there are > >> differences which occur in implementation. An analogy is that > >> English is spoken in the UK, US >and Australia. While the language > >> is generally the same, there are subtle differences in word usage > >> which can lead to confusion. This exists in >storage subsystems as > >> well. As these subsystems become more capable, faster and more > >> complex, these differences in implementation can have >greater > >> impact. > > [snip] > > > > I added the emphasis. I am in love Dell-disks that get me "the > > positive experience". :) > > Please indulge my taking a contrarian view based upon the products we > sell/support/ship. > > I see significant derision heaped upon these decisions, which are called > "marketing decisions" by Dell and others. It couldn't be possible, in > most commenter's minds that they might actually have a point ... > > ... I am not defending Dell's language (I wouldn't use this or allow > this to be used in our outgoing marketing/customer communications). > > Let me share an anecdote. I have elided the disk manufacturers name to > protect the guilty. I will not give hints as to whom they are, though > some may be able to guess ... I will not confirm. > > We ship units with 2TB (and 1.5TB) drives among others. We burn in and > test these drives. We work very hard to insure compatibility, and to > make sure that when users get the units, that the things work. We > aren't perfect, and we do occasionally mess up. When we do, we own up > to it and fix it right away. Its a different style of support. The > buck stops with us. Period. > > So along comes a drive manufacturer, with some nice looking specs on 2TB > (and some 1.5 and 1 TB) drives. They look great on paper. We get them > into our labs, and play with them, and they seem to run really well. > Occasional hiccup on building RAIDs, but you get that in large batches > of drives. > > So now they are out in the field for months, under various loads. Some > in our DeltaV's, some in our JackRabbits. The units in the DeltaV's > seem to have a ridiculously high failure rate. This is not something we > see in the lab. Even with constant stress, horrific sustained workloads > ... they don't fail in ou testing. But get these same drives out into > the users hands ... and whammo. > > Slightly different drives in our JackRabbit units, with a variety of > RAID controllers. Same types of issues. Timeouts, RAID fall outs, etc. > > This is not something we see in the lab in our testing. We try > emulating their environments, and we can't generate the failures. > > Worse, we get the drives back after exchanging them at our cost with new > replacements, only to find out, upon running diagnostics, that the > drives haven't failed according to the test tool. This failing drive > vendor refuses to acknowledge firmware bugs, effectively refuses to > release patches/fixes. > > Our other main drive vendor, while not currently with a 2TB drive unit, > doesn't have anything like this manufacturers failure rate in the field. > When drives die in the field, they really ... really die in the field. > And they do fix their firmware. > > So we are now moving off this failing manufacturer (its a shame as they > used to produce quality parts for RAID several years ago), and we are > evaluating replacements for them. Firmware updates are a critical > aspect of a replacement. If the vendor won't allow for a firmware > update, we won't use them. > > So ... this anecdote complete, if someone called me up and said "Joe, I > really want you to build us an siCluster for our storage, and I want you > to use [insert failing manufacturer's name here] drives because we > like them", what do you think my reaction should be? Should it be > "sure, no problem, whatever you want" ... with the subsequent problems > and pain, for which we would be blamed ... or should it be "no, these > drives don't work well ... deep and painful experience at customer sites > shows that they have bugs in their firmware which are problematic for > RAID users ... we are attempting to get them to give us the updated > firmware to help the existing users, but we would not consider shipping > more units with these drives due to their issues." > > Is that latter answer, which is the correct answer, a marketing answer? > But what if the customer tells you, ship me your system without a drive, I'll put whatever I want in there so you are not my point of contact for failing drives but you say, no, I won't allow them in my system and I won't even sell you a replacement of what I do allow in the system? > Yeah, SATA and SAS are standards. Yeah, in theory, they all do work > together. In reality, they really don't, and you have to test. > Everyone does some aspect slightly different and usually in software, so > they can fix it if they messed up. If their is a RAID timeout bug due > to head settling timing, yeah, this is fixable. But if the disk > manufacturer doesn't want to fix it ... its your companies name on the > outside of that box. You are going to take the heat for their problems. > > Note: This isn't just SATA/SAS drives, there are a whole mess of things > that *should* work well together, but do not. We had some exciting > times in the recent past with SAS backplanes that refused to work with > SAS RAID cards. We've had some excitment from 10GbE cards, IB cards, > etc. that we shouldn't have had. > > I can't and won't sanction their tone to you ... they should have > explained things correctly. Given that PERC are rebadged LSI, yeah, I > know perfectly well a whole mess of drives that *do not* work correctly > with them. > > So please don't take Dell to task for trying to help you avoid making > what they consider a bad decision on specific components. There could > be a marketing aspect to it, but support is a cost, and they want to > minimize costs. Look at failure rates, and toss the suppliers who have > very high ones. > > > From landman at scalableinformatics.com Tue Feb 16 05:32:30 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 16 Feb 2010 08:32:30 -0500 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <20100216145328.52f048a9@math018-24.tau.ac.il> References: <4B79F7B4.9020808@scalableinformatics.com> <20100216145328.52f048a9@math018-24.tau.ac.il> Message-ID: <4B7A9E6E.9010500@scalableinformatics.com> On 2/16/2010 7:53 AM, Micha Feigin wrote: >> So ... this anecdote complete, if someone called me up and said "Joe, I >> really want you to build us an siCluster for our storage, and I want you >> to use [insert failing manufacturer's name here] drives because we >> like them", what do you think my reaction should be? Should it be >> "sure, no problem, whatever you want" ... with the subsequent problems >> and pain, for which we would be blamed ... or should it be "no, these >> drives don't work well ... deep and painful experience at customer sites >> shows that they have bugs in their firmware which are problematic for >> RAID users ... we are attempting to get them to give us the updated >> firmware to help the existing users, but we would not consider shipping >> more units with these drives due to their issues." >> >> Is that latter answer, which is the correct answer, a marketing answer? >> >> > But what if the customer tells you, ship me your system without a drive, I'll > put whatever I want in there so you are not my point of contact for failing > drives but you say, no, I won't allow them in my system and I won't even sell > you a replacement of what I do allow in the system? > We have tried this before. Invariably they set up the disks wrong. And then we get the blame and a bad rap from the customer for selling them a slow unit. Even though it was their own fault it was slow. Or we got a number of different flavors of drive, no guarantees on same batch or firmware revision, or even same brand. And yes, we see and deal with those issues. We simply don't sell bare chassis any longer, because it is our name on the box. When you get it, it works and is fast. If you make changes, and you are welcome to, it will likely slow down. Several of our customers ignored our warnings on this, reloaded the unit and yelled at us over their slower performance. One of our earliest prospective customers, absolutely convinced they knew more than us about how to configure/build fast storage, reconfigured a working unit with fast IO system to be a much slower unit, and then proceeded to benchmark us versus one of our competitors. Yeah, we are sensitive to this. I won't defend Dell's position. I will point out that they have a point with some of their restrictions. You want to service the branded drives we ship, you are welcome to do this yourself. We have no issues with that. We do ensure consistent firmware revs between disks, so you'd need to do this yourself. You want to wipe the OS and the install and set it up, you are welcome to do this, though we caution that you are going to throw lots of performance away by doing so. My other point about this is that if you don't like Dell's policies, you have freedom to choose other vendors. But giving them heat over reducing their exposure to support liability strikes me as not a reasonable gripe. But then again, I am on the vendor side of the equation. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc., email: landman at scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 866 888 3112 cell : +1 734 612 4615 From john.hearns at mclaren.com Tue Feb 16 05:34:46 2010 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 16 Feb 2010 13:34:46 -0000 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <20100216145328.52f048a9@math018-24.tau.ac.il> References: <4B79F7B4.9020808@scalableinformatics.com> <20100216145328.52f048a9@math018-24.tau.ac.il> Message-ID: <68A57CCFD4005646957BD2D18E60667B0F573CE4@milexchmb1.mil.tagmclarengroup.com> > > But what if the customer tells you, ship me your system without a > drive, I'll > put whatever I want in there so you are not my point of contact for > failing > drives but you say, no, I won't allow them in my system and I won't > even sell > you a replacement of what I do allow in the system? I have been in a situation like this, but not with disks. Company X sells them a system, integrates it on site. However hardware maintenance SHOULD come from Company Y which is the supplier of the hardware. Customers never see it that way, and Company X gets the phone calls and is expected to book service calls, take hardware apart and run tests on behalf of Company Y. That's why people like Joe are very wary of letting customers have a free-for-all in choosing $high-performance-component to save a few dollars. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From deadline at eadline.org Tue Feb 16 07:28:09 2010 From: deadline at eadline.org (Douglas Eadline) Date: Tue, 16 Feb 2010 10:28:09 -0500 (EST) Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: References: Message-ID: <58062.192.168.1.1.1266334089.squirrel@mail.eadline.org> For SGE: http://xml-qstat.org/index.html -- Doug > Are there any generic "scheduler visualization" tools out there? > Sometimes I feel it'd be nice if I had a way to find out how my > sceduling was performing. i.e. blocks of empty procs; how fragmented > the job assignment was; large / small job split, utilization > efficiency, backfill status; etc. I use openpbs (torque) + maui and it > does have some text mode accounting reports. But sometimes they are > hard to digest and a birds eye view might be easier via a > visualization. > > I haven't found any toolkits yet. Of course, I could parse and plot > myself with a bunch of sed / awk / gnuplot but I don't want to > unnecessarily reinvent the wheel if I can avoid it. Also, I remembered > seeing some cool visualizations (quite animated at that) at one of the > supercomputing agencies a while ago but just can't seem to find which > one it was now that I need it. Admittedly, some of the visualizations > can sway more towards the "coolness" factor than actual insights but > still it's worth a shot. > > Any pointers or scripts other Beowulfers might have are greatly > appreciated. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Doug From mathog at caltech.edu Tue Feb 16 08:49:07 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 16 Feb 2010 08:49:07 -0800 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? Message-ID: Joe Landman wrote > So along comes a drive manufacturer, with some nice looking specs on 2TB > (and some 1.5 and 1 TB) drives. They look great on paper. We get them > into our labs, and play with them, and they seem to run really well. > Occasional hiccup on building RAIDs, but you get that in large batches > of drives. > > So now they are out in the field for months, under various loads. Some > in our DeltaV's, some in our JackRabbits. The units in the DeltaV's > seem to have a ridiculously high failure rate. This is not something we > see in the lab. Even with constant stress, horrific sustained workloads > ... they don't fail in ou testing. But get these same drives out into > the users hands ... and whammo. > > Slightly different drives in our JackRabbit units, with a variety of > RAID controllers. Same types of issues. Timeouts, RAID fall outs, etc. > > This is not something we see in the lab in our testing. We try > emulating their environments, and we can't generate the failures. > > Worse, we get the drives back after exchanging them at our cost with new > replacements, only to find out, upon running diagnostics, that the > drives haven't failed according to the test tool. This failing drive > vendor refuses to acknowledge firmware bugs, effectively refuses to > release patches/fixes. While there is no doubting that these drives didn't work reliably in your arrays, that doesn't necessarily mean they were "defective". Just playing devil's advocate here, but it could be the array controller is using some feature where there is a bit of wiggle room in the standard, so that both the disk and the controller are "conforming", but they still won't work together reliably. In a situation like that I would expect the vendor to disclose the issue, so it would be clear why the disks had to come from A and not B. As long as the vendor explained the problem clearly most customers would be fine buying the preferred disks. It's when the vendor says "you have to use OUR disks" and doesn't tell you why, and when, as far as you can tell, these are the same devices that you could buy directly from the manufacturer without the 5X markup, that things smell bad. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From landman at scalableinformatics.com Tue Feb 16 09:09:45 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 16 Feb 2010 12:09:45 -0500 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: Message-ID: <4B7AD159.50000@scalableinformatics.com> David Mathog wrote: > Joe Landman wrote > >> So along comes a drive manufacturer, with some nice looking specs on 2TB >> (and some 1.5 and 1 TB) drives. They look great on paper. We get them >> into our labs, and play with them, and they seem to run really well. >> Occasional hiccup on building RAIDs, but you get that in large batches >> of drives. >> >> So now they are out in the field for months, under various loads. Some >> in our DeltaV's, some in our JackRabbits. The units in the DeltaV's >> seem to have a ridiculously high failure rate. This is not something we >> see in the lab. Even with constant stress, horrific sustained workloads >> ... they don't fail in ou testing. But get these same drives out into >> the users hands ... and whammo. >> >> Slightly different drives in our JackRabbit units, with a variety of >> RAID controllers. Same types of issues. Timeouts, RAID fall outs, etc. >> >> This is not something we see in the lab in our testing. We try >> emulating their environments, and we can't generate the failures. >> >> Worse, we get the drives back after exchanging them at our cost with new >> replacements, only to find out, upon running diagnostics, that the >> drives haven't failed according to the test tool. This failing drive >> vendor refuses to acknowledge firmware bugs, effectively refuses to >> release patches/fixes. > > While there is no doubting that these drives didn't work reliably in > your arrays, that doesn't necessarily mean they were "defective". Just > playing devil's advocate here, but it could be the array controller is > using some feature where there is a bit of wiggle room in the standard, > so that both the disk and the controller are "conforming", but they > still won't work together reliably. In a situation like that I would > expect the vendor to disclose the issue, so it would be clear why the > disks had to come from A and not B. As long as the vendor explained the > problem clearly most customers would be fine buying the preferred disks. I agree that some devices work well with others. This is what we see. Some do not. We have a few boxful's of 1TB drives that don't play well with others. And yes, standards do leave wiggle room. Interop testing days are critical. A connect-a-thon very helpful. But the point is, just because it says SATA, you shouldn't expect that it will work with all SATA controllers. No ... seriously. Likewise this is true with many other components. Some stuff doesn't play well with others. I didn't sanction the language used, I thought it wrong. But from a support scenario, it can be (and often is) a nightmare. We take ownership of as little or as much of what our customers want us to do. If your name is on the box, no-one appreciates a finger pointing exercise rather than a path to solution. > It's when the vendor says "you have to use OUR disks" and doesn't tell > you why, and when, as far as you can tell, these are the same devices > that you could buy directly from the manufacturer without the 5X markup, > that things smell bad. I agree with this paragraph. We won't name specific names in public, we do speak about our drive issues in private with our customers. 5X markup? We must be doing something wrong :/ > > Regards, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Tue Feb 16 09:38:06 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 16 Feb 2010 11:38:06 -0600 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <4B79F7B4.9020808@scalableinformatics.com> References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: On Mon, Feb 15, 2010 at 7:41 PM, Joe Landman wrote: > Please indulge my taking a contrarian view based upon the products we > sell/support/ship. > I can't and won't sanction their tone to you ... they should have explained > things correctly. ?Given that PERC are rebadged LSI, yeah, I know perfectly > well a whole mess of drives that *do not* work correctly with them. > > So please don't take Dell to task for trying to help you avoid making what > they consider a bad decision on specific components. ?There could be a > marketing aspect to it, but support is a cost, and they want to minimize > costs. ?Look at failure rates, and toss the suppliers who have very high > ones. Another worry is what happens in the long run if the vendor either folds shop or stops selling and / or supporting that particular model of drive. Frequently the lifecycle of these devices is longer than the warranty. The inability to shop around for drives could be an issue. Especially with this rigid approach of firmware rejecting a foreign component and not just a warning. Perhaps this is a paranoid scenario since these are big vendors and not likely to go bankrupt. -- Rahul From landman at scalableinformatics.com Tue Feb 16 09:44:59 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 16 Feb 2010 12:44:59 -0500 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: <4B7AD99B.90605@scalableinformatics.com> Rahul Nabar wrote: > On Mon, Feb 15, 2010 at 7:41 PM, Joe Landman > wrote: > >> Please indulge my taking a contrarian view based upon the products we >> sell/support/ship. >> I can't and won't sanction their tone to you ... they should have explained >> things correctly. Given that PERC are rebadged LSI, yeah, I know perfectly >> well a whole mess of drives that *do not* work correctly with them. >> >> So please don't take Dell to task for trying to help you avoid making what >> they consider a bad decision on specific components. There could be a >> marketing aspect to it, but support is a cost, and they want to minimize >> costs. Look at failure rates, and toss the suppliers who have very high >> ones. > > Another worry is what happens in the long run if the vendor either > folds shop or stops selling and / or supporting that particular model > of drive. Frequently the lifecycle of these devices is longer than the > warranty. The inability to shop around for drives could be an issue. > Especially with this rigid approach of firmware rejecting a foreign > component and not just a warning. This is an issue with any proprietary technology. We talk about this in terms of "freedom from bricking". For example, with Sun, there are quite a few (now quite nervous) Thumper/Thor owners. Thumper has been EOLed, and the future of Thor is uncertain at best. We have customers ask us constantly if we take trade-ins, and others asking us if they can buy the trade-ins for spares. This is a real issue. But it is tangential to the specific issue as initially discussed. > > Perhaps this is a paranoid scenario since these are big vendors and > not likely to go bankrupt. Erm ... uh ... Sun, SGI, LNXI, ... Big != Safe. > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Tue Feb 16 09:52:02 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 16 Feb 2010 09:52:02 -0800 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: <4B7AD159.50000@scalableinformatics.com> Message-ID: On 2/16/10 9:09 AM, "Joe Landman" wrote: > > > 5X markup? We must be doing something wrong :/ > Depends on what the price includes. I could easily see a commodity drive in a case lot being dropped on the loading dock at, say, $100 each, and the drive with installation, system integrator testing, downstream support, etc. being $500. Doesn't take many hours on the phone tracking down an idiosyncracy or setup to cost $500 in labor. A lot of times, a company will price things to spread the NRE (all that testing of drives in the lab in various configurations, working out the driver parameters that make it work best, etc... Which can easily be in the tens, if not hundreds, of $K range) across the sell price of the many boxes. Then you wind up with folks asking about "how come the *same drive* from X costs twice as much from Y?", while conveniently forgetting that X isn't charging you for the $100K NRE or the $1000/incident support fee or the... From james.p.lux at jpl.nasa.gov Tue Feb 16 10:32:01 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 16 Feb 2010 10:32:01 -0800 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Rahul Nabar > Sent: Tuesday, February 16, 2010 9:38 AM > To: landman at scalableinformatics.com > Cc: Mikhail Kuzminsky; beowulf at beowulf.org > Subject: Re: [Beowulf] Third-party drives not permitted on new Dell servers? > > On Mon, Feb 15, 2010 at 7:41 PM, Joe Landman > wrote: > > > Please indulge my taking a contrarian view based upon the products we > > sell/support/ship. > > I can't and won't sanction their tone to you ... they should have explained > > things correctly. ?Given that PERC are rebadged LSI, yeah, I know perfectly > > well a whole mess of drives that *do not* work correctly with them. > > > > So please don't take Dell to task for trying to help you avoid making what > > they consider a bad decision on specific components. ?There could be a > > marketing aspect to it, but support is a cost, and they want to minimize > > costs. ?Look at failure rates, and toss the suppliers who have very high > > ones. > > Another worry is what happens in the long run if the vendor either > folds shop or stops selling and / or supporting that particular model > of drive. Frequently the lifecycle of these devices is longer than the > warranty. The inability to shop around for drives could be an issue. > Especially with this rigid approach of firmware rejecting a foreign > component and not just a warning. > "lifecycle" has a lot of meanings, and I think that's where the problems arise, in some cases. For the most part, PC hardware these days is designed based on a 3 year replacement schedule, so a vendor will set up their warranty terms, model introduction and retirement schedule based on that, regardless of whether the actual life is much (sometimes much, much) longer (as all of us with old NT4.0 and DOS 3.x machines lurking in the lab know). Unfortunately, the HPC (Beowulf) world is driven by the economics of the ordinary consumer/office desktop computer. That's what lets you build a teraflop machine without incurring the debt of a small country: you can leverage the mass production for consumers which drives the prices down, but also has very short product cycles. The 3 year cycle is driven by in large part by IRS depreciation rules which call computer equipment a "5-year" piece of gear, but MACRS means that by the end of year 3, you've already depreciated over 70% of the purchase price. Considering that purchase price is about half the "total cost of ownership" (at most), it makes sense to buy new gear that often (since the help-desk, configuration management, networking, etc, support costs remain fixed per month). The hardware cost is often a small fraction of the overall cost to put a computer on someone's desk. If you look at a typical desktop PC scenario, you might have a $2500 computer, where the first year depreciation is $42/mo, the second year is $67/mo, and the third year is $40/mo. On top of that, support costs might be $100-200/mo. In the fourth year, under MACRS, the depreciation is $24/mo. So you could get a brand new computer (which will be easier to support, is faster, etc.) for a big $22/month hit on your budget(which is <10% of the total monthly tab, counting the support costs). It's a no brainer to turn over the computers that fast, especially if you are trying to save on support costs by having a limited number of different models in the installed base at any given time (which is what large companies do). From lindahl at pbm.com Tue Feb 16 10:57:15 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 16 Feb 2010 10:57:15 -0800 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> <4B7A032C.2080207@scalableinformatics.com> Message-ID: <20100216185715.GA16132@bx9.net> On Tue, Feb 16, 2010 at 02:08:32AM -0500, Mark Hahn wrote: > I'm curious: you imply that vendors have no recourse to force the _disk_ > vendors to supply parts which work right (as defined by standard). > is that really true? I'd be surprised if Dell doesn't get pretty emphatic > cooperation from their disk vendor(s). If you recall our past discussions here about "raid duty" disks, there are a couple of things not in the standard which are significant: vibration resistance, and the disk's maximum retry time vs. raid controllers deciding there's a timout. Most standards have problems like that. You can't imagine the interesting time PathScale had getting our InfiniBand products to cooperate well with Mellanox parts. -- greg From spandey at csse.unimelb.edu.au Sun Feb 14 14:33:28 2010 From: spandey at csse.unimelb.edu.au (Suraj Pandey) Date: Mon, 15 Feb 2010 09:33:28 +1100 Subject: [Beowulf] [hpc-announce] CCGrid 2010: Call for Research/Product Demos Message-ID: <623657FD-64B6-4146-A186-4ADB6E59647F@csse.unimelb.edu.au> [Please accept our apologies if you receive multiple copies of this email] ------------------------------------------------------------------------ CCGrid 2010: Call for Research/Product Demos ------------------------------------------------------------------------ The 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid 2010) May 17-20, Melbourne, Victoria, Australia. http://www.manjrasoft.com/ccgrid2010/callfordemos.html We invite research demonstrations from laboratories or research groups (academic, government, or industrial) to showcase new innovations and technologies in Cloud Computing and HPC/scientific applications at the CCGrid 2010 conference. CCGrid is a highly successful and well-recognized International conference for presenting the latest breakthroughs in Cluster, Cloud and Grid technologies. Although we are looking for demos on the emerging Cloud Computing and GPU-based computing areas, we welcome demos from all areas within the scope of CCGrid 2010. The topics of interest including the following, but not limited to are: - Scientific, Engineering, Commercial or e-Science Applications using Cloud Computing - Middleware - Demonstrable Open-Challenges - Resource Management - Scheduling and Load Balancing - Programming Models, Tools, and Environments - Performance Evaluation and Modeling The proposal should include: - Up to 2 page description of the demo and the set up, and - A short (half-page) description of the research lab Accepted research demonstrations will be invited to present in the conference. These abstracts will not be included in the conference proceedings, but will be published on the conference website. Live demos are expected to be up and running during the Research Demonstration Session. We strongly discourage recorded presentations. In addition, authors of accepted demonstrations are expected to communicate more general information about the specific demo and other work being performed at the lab using their own posters (poster boards are provided). We will be providing wireless Internet access ONLY. This means, the hardware (compute and data resources) and software needed for the demo should reside either in public Clouds such as Amazon or private infrastructure of research labs/enterprises. A Best Research Demo Award will be presented to the winning demo/team selected by the award committee at the CCGrid 2010 conference. Please submit a pdf version of the proposal to: spandey at csse.unimelb.edu.au by the due date. Demo Co-ordinators -------------------- Pavan Balaji, Argonne National Laboratory A,B.M. Russel, VeRSI Suraj Pandey, University of Melbourne Important Dates -------------------- Deadline for submissions: March 31st, 2010 Notification of Acceptance: April 5th, 2010 ------------------------------------------------------------------------ Sincerely, Suraj Pandey Phd Candidate, Cloud Computing and Distributed Systems (CLOUDS) Lab Dept. of Computer Science and Software Engineering The University of Melbourne ICT Building, 111 Barry Street, Carlton, Melbourne, VIC 3053, Australia Phone: +61-3-8344-1355 (Off) Fax: +61-3-9348-1184 email: spandey at csse.unimelb.edu.au url: http://www.csse.unimelb.edu.au/~spandey ----------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From Z.Wu at leeds.ac.uk Tue Feb 16 03:53:59 2010 From: Z.Wu at leeds.ac.uk (Zhili Wu) Date: Tue, 16 Feb 2010 11:53:59 +0000 Subject: [Beowulf] SMLA'10 Workshop CFP: Submission Deadline Extended to March 10, 2010 In-Reply-To: <15AB31C404448F4A918D077CA7E74C6E012B1989ADE7@HERMES7.ds.leeds.ac.uk> References: <15AB31C404448F4A918D077CA7E74C6E012B1989ADE7@HERMES7.ds.leeds.ac.uk> Message-ID: <15AB31C404448F4A918D077CA7E74C6E012B1BC626EB@HERMES7.ds.leeds.ac.uk> [Please accept our apologies if you receive multiple copies of this email] CALL FOR PAPERS The 2010 International Workshop on Scalable Machine Learning and Applications (SMLA-10) To be held in conjunction with CIT'10 (Supported by IEEE Computer Society), June 29 - July 1, 2010, Bradford, UK http://smlc09.leeds.ac.uk/smla/ http://www.scim.brad.ac.uk/~ylwu/CIT2010/ SCOPE: Machine learning and data mining have been playing an increasing role in many real scenarios, such as web mining, language processing, image search, financial engineering, etc. In these application domains, data are surpassing the scale of terabyte in an ever faster pace, but the techniques for processing and mining them often lag behind in far too many aspects. To deal with billions of web pages, images, transaction records and capacity-intensive audio and video data stream, machine learning and data mining techniques and their underlying computing infrastructure are facing great challenges. In this SMLA workshop we are willing to bring together researchers and practitioners for getting advancement in scalable machine learning and applications. On one hand we expect works on how to dramatically empower existing machine learning and data mining methods via grid/cloud or other novel computing models. On the other hand we value the effort of building or extending machine learning and data mining methods that are scalable to huge datasets. Papers can be related to any subset of the following topics, or any unconventional direction to scale up machine learning and data mining methods: -- Cloud Computing -- Large Scale Data Mining -- Fast Support Vector Machines -- Data Abstraction, Dimension Reduction -- User Personalization and Recommendation -- Natural Language Processing -- Ontology and Semantic Technologies -- Parallelization of Machine Learning Methods -- Fast Machine Learning Model Tuning and Selection -- Large Scale Webpage Topic, Genre, Sentiment Classification -- Financial Engineering STEERING COMMITTEE Chih-Jen Lin, Natinal Taiwan University, Taiwan Serge Sharoff, University of Leeds, UK Katja Markert, University of Leeds, UK Ivor Wai-Hung Tsang, Nanyang Technological University, Singapore PROGRAM CHAIRS Zhili Wu, University of Leeds, UK Xiaolong Jin, University of Bradford, UK PUBLICITY CHAIRS Evi Syukur, University of New South Wales, Australia Lei Liu, University of Bradford, UK PROGRAM COMMITTEE Please refer to http://smlc09.leeds.ac.uk/smla/committee.htm for a complete list of program committee PAPER SUBMISSION: Authors are invited to submit manuscripts reporting original unpublished research and recent developments in the topics related to the workshop. The length of the papers should not exceed 6 pages + 2 pages for overlength charges (IEEE Computer Society Proceedings Manuscripts style: two columns, single-spaced), including figures and references, using 10 fonts, and number each page. Papers should be submitted electronically in PDF format (or postscript) by sending it as an e-mail attachment to Zhili Wu (z.wu at leeds.ac.uk). All papers will be peer reviewed and the comments will be provided to the authors. The accepted papers will be published together with those of other CIT'10 workshops by the IEEE Computer Society Press. *********************************************************************** Distinguished selected papers, after further extensions, will be published in CIT 2010's special issues of the following prestigious SCI-indexed journals: -- The Journal of Supercomputing ?C Springer -- Journal of Computer and System Sciences ?C Elsevier -- Concurrency and Computation: Practice and Experience - John Wiley & Sons *********************************************************************** IMPORTANT DATES: Paper submission: March 10, 2010, GMT-11 Notification of Acceptance: April 01, 2010 Camera-ready due: April 18, 2010 Author registration: April 18, 2010 Conference: June 29 - July 1, 2010 *********************************************************************** From vallard at benincosa.com Tue Feb 16 07:43:22 2010 From: vallard at benincosa.com (Vallard Benincosa) Date: Tue, 16 Feb 2010 07:43:22 -0800 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: <58062.192.168.1.1.1266334089.squirrel@mail.eadline.org> References: <58062.192.168.1.1.1266334089.squirrel@mail.eadline.org> Message-ID: <56E58A53-65F9-4177-95FA-54F35064C24C@benincosa.com> pbstop is a great curses tool for TORQUE to get an idea of where jobs are allocated in a visual way On Feb 16, 2010, at 7:28 AM, "Douglas Eadline" wrote: > For SGE: > > http://xml-qstat.org/index.html > > -- > Doug > >> Are there any generic "scheduler visualization" tools out there? >> Sometimes I feel it'd be nice if I had a way to find out how my >> sceduling was performing. i.e. blocks of empty procs; how fragmented >> the job assignment was; large / small job split, utilization >> efficiency, backfill status; etc. I use openpbs (torque) + maui and >> it >> does have some text mode accounting reports. But sometimes they are >> hard to digest and a birds eye view might be easier via a >> visualization. >> >> I haven't found any toolkits yet. Of course, I could parse and plot >> myself with a bunch of sed / awk / gnuplot but I don't want to >> unnecessarily reinvent the wheel if I can avoid it. Also, I >> remembered >> seeing some cool visualizations (quite animated at that) at one of >> the >> supercomputing agencies a while ago but just can't seem to find which >> one it was now that I need it. Admittedly, some of the visualizations >> can sway more towards the "coolness" factor than actual insights but >> still it's worth a shot. >> >> Any pointers or scripts other Beowulfers might have are greatly >> appreciated. >> >> -- >> Rahul >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin >> Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > > -- > Doug > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From oneal at dbi.udel.edu Tue Feb 16 10:20:39 2010 From: oneal at dbi.udel.edu (Doug O'Neal) Date: Tue, 16 Feb 2010 13:20:39 -0500 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B7AD159.50000@scalableinformatics.com> Message-ID: On 02/16/2010 12:52 PM, Lux, Jim (337C) wrote: > > > On 2/16/10 9:09 AM, "Joe Landman" wrote: >> >> 5X markup? We must be doing something wrong :/ >> > > > Depends on what the price includes. I could easily see a commodity drive in > a case lot being dropped on the loading dock at, say, $100 each, and the > drive with installation, system integrator testing, downstream support, etc. > being $500. Doesn't take many hours on the phone tracking down an > idiosyncracy or setup to cost $500 in labor. But when you're installing anywhere from eight to forty-eight drives in a single system the required hours to make up that $400/drive overhead does get larger. And if you spread the system integrator testing over eight drives per unit and hundreds to thousands of units the cost per drive shouldn't be measured in hundreds of dollars. From prentice at ias.edu Tue Feb 16 11:18:55 2010 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 16 Feb 2010 14:18:55 -0500 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <39109.192.168.1.1.1266260198.squirrel@mail.eadline.org> References: <39109.192.168.1.1.1266260198.squirrel@mail.eadline.org> Message-ID: <4B7AEF9F.8050407@ias.edu> Actually, I think the ISO standard calls for concatenating those two words into one. Douglas Eadline wrote: > There are two "ISO standard" English words I have for this kind of > marketing response. > > -- > Doug > > >> This was the response from Dell, I especially like the analogy: >> >> [snip] >>> There are a number of benefits for using Dell qualified drives in >>> particular ensuring a ***positive experience*** and protecting ***our >>> data***. >>> While SAS and SATA are industry standards there are differences which >>> occur in implementation. An analogy is that English is spoken in the UK, >>> US >and Australia. While the language is generally the same, there are >>> subtle differences in word usage which can lead to confusion. This exists >>> in >storage subsystems as well. As these subsystems become more capable, >>> faster and more complex, these differences in implementation can have >>>> greater impact. >> [snip] >> >> I added the emphasis. I am in love Dell-disks that get me "the >> positive experience". :) >> >> -- >> Rahul -- Prentice From gerry.creager at tamu.edu Tue Feb 16 11:33:35 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 16 Feb 2010 13:33:35 -0600 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: <4B7AF30F.1010804@tamu.edu> On 2/16/10 12:32 PM, Lux, Jim (337C) wrote: >> -----Original Message----- >> From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Rahul Nabar >> Sent: Tuesday, February 16, 2010 9:38 AM >> To: landman at scalableinformatics.com >> Cc: Mikhail Kuzminsky; beowulf at beowulf.org >> Subject: Re: [Beowulf] Third-party drives not permitted on new Dell servers? >> >> On Mon, Feb 15, 2010 at 7:41 PM, Joe Landman >> wrote: >> >>> Please indulge my taking a contrarian view based upon the products we >>> sell/support/ship. >>> I can't and won't sanction their tone to you ... they should have explained >>> things correctly. Given that PERC are rebadged LSI, yeah, I know perfectly >>> well a whole mess of drives that *do not* work correctly with them. >>> >>> So please don't take Dell to task for trying to help you avoid making what >>> they consider a bad decision on specific components. There could be a >>> marketing aspect to it, but support is a cost, and they want to minimize >>> costs. Look at failure rates, and toss the suppliers who have very high >>> ones. >> >> Another worry is what happens in the long run if the vendor either >> folds shop or stops selling and / or supporting that particular model >> of drive. Frequently the lifecycle of these devices is longer than the >> warranty. The inability to shop around for drives could be an issue. >> Especially with this rigid approach of firmware rejecting a foreign >> component and not just a warning. >> > > "lifecycle" has a lot of meanings, and I think that's where the problems arise, in some cases. > For the most part, PC hardware these days is designed based on a 3 year replacement schedule, so a vendor will set up their warranty terms, model introduction and retirement schedule based on that, regardless of whether the actual life is much (sometimes much, much) longer (as all of us with old NT4.0 and DOS 3.x machines lurking in the lab know). > > Unfortunately, the HPC (Beowulf) world is driven by the economics of the ordinary consumer/office desktop computer. That's what lets you build a teraflop machine without incurring the debt of a small country: you can leverage the mass production for consumers which drives the prices down, but also has very short product cycles. > > The 3 year cycle is driven by in large part by IRS depreciation rules which call computer equipment a "5-year" piece of gear, but MACRS means that by the end of year 3, you've already depreciated over 70% of the purchase price. Considering that purchase price is about half the "total cost of ownership" (at most), it makes sense to buy new gear that often (since the help-desk, configuration management, networking, etc, support costs remain fixed per month). The hardware cost is often a small fraction of the overall cost to put a computer on someone's desk. If you look at a typical desktop PC scenario, you might have a $2500 computer, where the first year depreciation is $42/mo, the second year is $67/mo, and the third year is $40/mo. On top of that, support costs might be $100-200/mo. In the fourth year, under MACRS, the depreciation is $24/mo. So you could get a brand new computer (which will be easier to support, is faster, etc.) for a big $22/month hit on your budget (! > which is<10% of the total monthly tab, counting the support costs). It's a no brainer to turn over the computers that fast, especially if you are trying to save on support costs by having a limited number of different models in the installed base at any given time (which is what large companies do). > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf OK, my brain hurts, now. You've been in Management too long, Jim! From rpnabar at gmail.com Tue Feb 16 11:35:52 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 16 Feb 2010 13:35:52 -0600 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <4B7AEF9F.8050407@ias.edu> References: <39109.192.168.1.1.1266260198.squirrel@mail.eadline.org> <4B7AEF9F.8050407@ias.edu> Message-ID: On Tue, Feb 16, 2010 at 1:18 PM, Prentice Bisbal wrote: > Actually, I think the ISO standard calls for concatenating those two > words into one. > > Douglas Eadline wrote: >> There are two "ISO standard" English words I have for this kind of >> marketing response. >> I think Doug was utilizing the wiggle-room in the ISO standards for the word(s). ;) -- Rahul From rpnabar at gmail.com Tue Feb 16 11:42:49 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 16 Feb 2010 13:42:49 -0600 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: On Tue, Feb 16, 2010 at 12:32 PM, Lux, Jim (337C) wrote: > Unfortunately, the HPC (Beowulf) world is driven by the economics of the ordinary consumer/office desktop computer. ?That's what lets you build a teraflop machine without incurring the debt of a small country: you can leverage the mass production for consumers which drives the prices down, but also has very short product cycles. > > The 3 year cycle is driven by in large part by IRS depreciation rules which call computer equipment a "5-year" piece of gear, but On the other hand many of the Beowulfers are in the govt. / university / higher-ed. domain where things run somewhat "tax free"? Not sure if then these IRS writeoffs then factor much into decision making or not. All the more reason to avoid getting locked-in to 3-year vendor cycles. -- Rahul From james.p.lux at jpl.nasa.gov Tue Feb 16 11:48:51 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 16 Feb 2010 11:48:51 -0800 Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B7AD159.50000@scalableinformatics.com> Message-ID: James Lux, P.E. Task Manager, SOMD Software Defined Radios Flight Communications Systems Section Jet Propulsion Laboratory 4800 Oak Grove Drive, Mail Stop 161-213 Pasadena, CA, 91109 +1(818)354-2075 phone +1(818)393-6875 fax > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Doug O'Neal > Sent: Tuesday, February 16, 2010 10:21 AM > To: beowulf at beowulf.org > Subject: [Beowulf] Re: Third-party drives not permitted on new Dell servers? > > On 02/16/2010 12:52 PM, Lux, Jim (337C) wrote: > > > > > > On 2/16/10 9:09 AM, "Joe Landman" wrote: > >> > >> 5X markup? We must be doing something wrong :/ > >> > > > > > > Depends on what the price includes. I could easily see a commodity drive in > > a case lot being dropped on the loading dock at, say, $100 each, and the > > drive with installation, system integrator testing, downstream support, etc. > > being $500. Doesn't take many hours on the phone tracking down an > > idiosyncracy or setup to cost $500 in labor. > > But when you're installing anywhere from eight to forty-eight drives in a > single system the required hours to make up that $400/drive overhead does > get larger. And if you spread the system integrator testing over eight > drives > per unit and hundreds to thousands of units the cost per drive shouldn't be > measured in hundreds of dollars. > True, IFF the costing strategy is based on that sort of approach. Various companies can and do price the NRE and support tail cost in a variety of ways. They might have a "notional" system size and base the pricing model on that: Say they, through research, find that most customers are buying, say, 32 systems at a crack. Now the support tail (which is basically "per system") is spread across only 32 drives, not thousands. If you happen to buy 64 systems, then you basically are paying twice. Most companies don't have infinite granularity in this sort of thing, and try to pick a few breakpoints that make sense. (NRE = non recurring engineering) As far as the NRE goes, say they get a batch of a dozen drives each of half a dozen kinds. They have to set up half a dozen test systems (either in parallel or sequentially), run the tests on all of them, and wind up with maybe 2 or 3 leading candidates that they decide to list on their "approved disk" list. The cost of testing the disks that didn't make the cut has to be added to the cost of the disks that did. There's a lot that goes into pricing that isn't obvious at first glance, or even second glance, especially if you're looking at a single instance (your own purchase) and trying to work backwards from there. There are weird anomalies that crop up in supposedly commodity items from things like fuel prices (e.g. you happened to buy that container load of disks when fuel prices were high, so shipping cost more). A couple years ago, there were huge fluctuations in the price of copper, so there would be 2:1 differences in the retail cost of copper wire and tubing at the local Home Depot and Lowes, basically depending on when they happened to have bought the stuff wholesale. (this is the kind of thing that arbitrageurs look for, of course) Some of it is "paying for convenience", too. Rather than do all the testing yourself, or writing a detailed requirements and procurement document for a third party, both of which cost you some non-zero amount of time and money, you just pay the increased price to a vendor who's done it for you. It's like eating sausage. You can buy already made sausage, and the sausage maker has done the experimenting with seasoning and process controls to come out with something that people like. Or, you can spend the time to make it yourself, potentially saving some money and getting a more customized sausage taste, BUT, you're most likely going to have some less-than-ideal sausage in the process. The more computers or sausage you're consuming, the more likely it is that you could do better with a customized approach, but, even there, you may be faced with resource limits (e.g. you could spend your time getting a better deal on the disks or you could spend your time doing research with the disks. Ultimately, the research MUST get done, so you have to trade off how much you're willing to spend.) From james.p.lux at jpl.nasa.gov Tue Feb 16 11:52:23 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 16 Feb 2010 11:52:23 -0800 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: > -----Original Message----- > From: Rahul Nabar [mailto:rpnabar at gmail.com] > Sent: Tuesday, February 16, 2010 11:43 AM > To: Lux, Jim (337C) > Cc: landman at scalableinformatics.com; Mikhail Kuzminsky; beowulf at beowulf.org > Subject: Re: [Beowulf] Third-party drives not permitted on new Dell servers? > > On Tue, Feb 16, 2010 at 12:32 PM, Lux, Jim (337C) > wrote: > > Unfortunately, the HPC (Beowulf) world is driven by the economics of the ordinary consumer/office > desktop computer. ?That's what lets you build a teraflop machine without incurring the debt of a small > country: you can leverage the mass production for consumers which drives the prices down, but also has > very short product cycles. > > > > The 3 year cycle is driven by in large part by IRS depreciation rules which call computer equipment > a "5-year" piece of gear, but > > On the other hand many of the Beowulfers are in the govt. / university > / higher-ed. domain where things run somewhat "tax free"? Not sure if > then these IRS writeoffs then factor much into decision making or not. > All the more reason to avoid getting locked-in to 3-year vendor > cycles. > Yes, but my point was that Beowulfers are a relatively small consumer of what the industry produces, and what *industry* decides to produce is driven by what most people face in terms of lifecycle practices. So, even if you don't live with the IRS rules (and even if you're tax free, that doesn't mean that your accountants don't follow Generally Accepted Accounting Practices, which map pretty much one to one), you wind up being affected by them. From gerry.creager at tamu.edu Tue Feb 16 11:57:07 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 16 Feb 2010 13:57:07 -0600 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: References: <4B79F7B4.9020808@scalableinformatics.com> Message-ID: <4B7AF893.5000103@tamu.edu> On 2/16/10 1:42 PM, Rahul Nabar wrote: > On Tue, Feb 16, 2010 at 12:32 PM, Lux, Jim (337C) > wrote: >> Unfortunately, the HPC (Beowulf) world is driven by the economics of the ordinary consumer/office desktop computer. That's what lets you build a teraflop machine without incurring the debt of a small country: you can leverage the mass production for consumers which drives the prices down, but also has very short product cycles. >> >> The 3 year cycle is driven by in large part by IRS depreciation rules which call computer equipment a "5-year" piece of gear, but > > On the other hand many of the Beowulfers are in the govt. / university > / higher-ed. domain where things run somewhat "tax free"? Not sure if > then these IRS writeoffs then factor much into decision making or not. > All the more reason to avoid getting locked-in to 3-year vendor > cycles. At MY University, and in all my grants, I plan a 3-year depreciation cycle (regardless of the grief I gave Jim privately). You've gotta have a plan. That said, however, I also get to beat the inventory folk senseless when they tell me I've got a $100k computer on inventory that's 8 years old and is listed as a '386. I tend to look for best price/performance ratios. I don't want something that will perform poorly because it impacts my research as well as that of the folks who use our clusters. I don't have to go for the cheapest, especially if it's incompatible. But, I don't go for the most expensive, without due diligence, just 'cause it costs more than every one else. gerry From james.p.lux at jpl.nasa.gov Tue Feb 16 13:59:27 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 16 Feb 2010 13:59:27 -0800 Subject: [Beowulf] Third-party drives not permitted on new Dell servers? In-Reply-To: <4B7AF893.5000103@tamu.edu> References: <4B79F7B4.9020808@scalableinformatics.com> <4B7AF893.5000103@tamu.edu> Message-ID: > >> The 3 year cycle is driven by in large part by IRS depreciation rules which call computer equipment > a "5-year" piece of gear, but > > > > On the other hand many of the Beowulfers are in the govt. / university > > / higher-ed. domain where things run somewhat "tax free"? Not sure if > > then these IRS writeoffs then factor much into decision making or not. > > All the more reason to avoid getting locked-in to 3-year vendor > > cycles. > > At MY University, and in all my grants, I plan a 3-year depreciation > cycle (regardless of the grief I gave Jim privately). You've gotta have > a plan. That said, however, I also get to beat the inventory folk > senseless when they tell me I've got a $100k computer on inventory > that's 8 years old and is listed as a '386. How else would you be able to find ads for surplus equipment: widget $5; original government cost $50,000. (note carefully government cost != government value) And yes.. every year, we get to fight that same battle. What do you mean you can't find that 8" floppy drive! It's in the property database as costing over $5000 (when we acquired it in 1981). I'll bet we have some Apple IIs around still in inventory. Jim From brian.ropers.huilman at gmail.com Wed Feb 17 07:25:34 2010 From: brian.ropers.huilman at gmail.com (Brian D. Ropers-Huilman) Date: Wed, 17 Feb 2010 09:25:34 -0600 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: References: Message-ID: On Mon, Feb 15, 2010 at 15:56, Rahul Nabar wrote: > Are there any generic "scheduler visualization" tools out there? We've been developing in-house tools for this (both representing usage overall, but recently focusing on a scheduler view) for a while. We have a set of command-line tools to produce reports and graphs. Here's one graph for a snapshot of a view for the month of December on one of our clusters with 256 nodes and 2048 cores: http://www.msi.umn.edu/~bropers/calhoun_december.png The top is each job as it fits on the various nodes and time and the bottom is utilization. We have recently added the ability to show these graphs per user or group with their jobs highlighted and the rest dimmed and we've also added queue wait, number of jobs, and several other graphs, similar to the utilization, at the bottom: http://www.msi.umn.edu/~bropers/group_calhoun_jan2010.png We run torque with Moab and this is a result of parsing the torque logs. We are still going through and validating the code and adding features and may be at a point, somewhere in the future, where we'd be comfortable releasing it. If you want to discuss more, please contact me off-list. -- Brian D. Ropers-Huilman, Director Systems Administration and Technical Operations Minnesota Supercomputing Institute 599 Walter Library +1 612-626-5948 (V) 117 Pleasant Street S.E. +1 612-624-8861 (F) University of Minnesota Twin Cities Campus Minneapolis, MN 55455-0255 http://www.msi.umn.edu/ From hahn at mcmaster.ca Wed Feb 17 10:52:20 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 17 Feb 2010 13:52:20 -0500 (EST) Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: References: Message-ID: > http://www.msi.umn.edu/~bropers/calhoun_december.png we've done this kind of color-job band before, and found that it was difficult to read. another approach is to show jobs as logical blocks, rather than cpus mapped directly to y-axis: https://www.sharcnet.ca/dynamic_images/clusterJobsPlot.saw.png admittedly, that's not terribly pretty. and MPI implementations that busy-wait make the %cpu report less useful than it might be. > We run torque with Moab and this is a result of parsing the torque > logs. We are still going through and validating the code and adding we run LSF, a home-grown scheduler and Maui on ~21 clusters, and feed job data into a central DB which permanently records all history. graphs like above (and others that show various usage metrics by user/group/cluster/jobsize/jobtype) are derived from the DB. -mark hahn. From jac67 at georgetown.edu Wed Feb 17 11:18:41 2010 From: jac67 at georgetown.edu (Jess Cannata) Date: Wed, 17 Feb 2010 14:18:41 -0500 Subject: [Beowulf] Upcoming GPU Programming Seminar at Georgetown University Message-ID: <4B7C4111.8010008@georgetown.edu> For those of you in the D.C. area: *** Next week Wednesday (24 Feb) we are hosting a free, three hour GPU programming seminar at Georgetown University. Details can be found at http://training.arc.georgetown.edu/gpu_seminar-feb2010.html It is open to anyone interested in GPU programming. Feel free to forward to anyone who may be interested. Please RSVP to me if you plan on attending. Thanks. -- Jess Cannata High Performance Computing Georgetown University 202-687-3661 From kuenching at gmail.com Wed Feb 17 11:23:57 2010 From: kuenching at gmail.com (Tsz Kuen Ching) Date: Wed, 17 Feb 2010 14:23:57 -0500 Subject: [Beowulf] PVM 3.4.5-12 terminates when adding Host on Ubuntu 9.10 In-Reply-To: <0D2F92CE-AAEB-4D9E-9AC6-F591C9AE1773@staff.uni-marburg.de> References: <0D2F92CE-AAEB-4D9E-9AC6-F591C9AE1773@staff.uni-marburg.de> Message-ID: Hello, Thanks for the reply, I have asked around and found out that there are no firewall on the machine which blocks certain ports. Does anyone else have an idea or answer? On Sun, Feb 14, 2010 at 6:18 PM, Reuti wrote: > Am 11.02.2010 um 19:43 schrieb Tsz Kuen Ching: > > > Whenever I attempt to add a host in PVM it ends up terminating the process >> in the master program. The process does run in the slave node, however >> because the PVM terminates I do not get access to the node. >> >> I'm currently using Ubuntu 9.10, and I used apt-get to install pvm ( >> pvmlib, pvmdev, pvm). >> Thus $PVM_ROOT is set automatically, and so is $PVM_ARCH >> As for the other variables, I have not looked for them. >> >> I can ssh into the the slave without the need of a password. >> > > Do you have any firwall on the machines which blocks certain ports? > > -- Reuti > > > >> Any Ideas or suggestions? >> >> This is what happens: >> >> user at laptop> pvm >> pvm> add slave-slave >> add slave-slave >> Terminated >> user at laptop> ... >> >> The logs are as followed: >> >> Laptop log >> --- >> [t80040000] 02/11 10:23:32 laptop (127.0.1.1:55884) LINUX 3.4.5 >> [t80040000] 02/11 10:23:32 ready Thu Feb 11 10:23:32 2010 >> [t80040000] 02/11 10:23:32 netoutput() sendto: errno=22 >> [t80040000] 02/11 10:23:32 em=0x2c24f0 >> [t80040000] 02/11 10:23:32 >> [49/?][6e/?][76/?][61/?][6c/?][69/?][64/?][20/?][61/?][72/?] >> [t80040000] 02/11 10:23:32 netoutput() sendto: Invalid argument >> [t80040000] 02/11 10:23:32 pvmbailout(0) >> >> slave-log >> --- >> [t80080000] 02/11 10:23:25 slave-slave (xxx.x.x.xxx:57344) LINUX64 3.4.5 >> [t80080000] 02/11 10:23:25 ready Thu Feb 11 10:23:25 2010 >> [t80080000] 02/11 10:28:26 work() run = STARTUP, timed out waiting for >> master >> [t80080000] 02/11 10:28:26 pvmbailout(0) >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at googlemail.com Wed Feb 17 22:57:12 2010 From: hearnsj at googlemail.com (John Hearns) Date: Thu, 18 Feb 2010 06:57:12 +0000 Subject: [Beowulf] Top500 power consumption Message-ID: <9f8092cc1002172257m4dfa8022s9f47a7502cbead4@mail.gmail.com> As I remember, the Top500 site now lists power consumption of systems, there cenrtainly is an section on the site from a few years ago discussing this. However I could not extract any figures. Does anyone know the magic buttons to press? I did find the Green500 site, which isn't very well populated with systems. From hearnsj at googlemail.com Wed Feb 17 23:11:08 2010 From: hearnsj at googlemail.com (John Hearns) Date: Thu, 18 Feb 2010 07:11:08 +0000 Subject: [Beowulf] Connecting QSFP to SFP+ Message-ID: <9f8092cc1002172311p6c39aea8ga99cc6ac8093b7fd@mail.gmail.com> I realise this is a forlorn quest. Can anyone think of a way of converting between a QSFP plug and an SFP+ socket? Ie. if I wanted to connect a QSFP cable into a 10gigabit ethernet switch which has an SFP port do I stand a cat's chance in hell? The gentleman from Mellanox who gave me a lot of help here may be permitted to have a small smile. And yes, the Q is QSFP = quad, so there are four of the things. I guess I should just bite the bullet and go 40Gbps http://www.networkworld.com/news/2009/021009-voltaire-switch.html (no, I'm not being really serious here From Shainer at mellanox.com Wed Feb 17 23:19:57 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Wed, 17 Feb 2010 23:19:57 -0800 Subject: [Beowulf] Connecting QSFP to SFP+ References: <9f8092cc1002172311p6c39aea8ga99cc6ac8093b7fd@mail.gmail.com> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F026629CB@mtiexch01.mti.com> There are several options, one for example is to use a hybrid cable - QSFP to SFP+, which you can find in the market today. Ping me for more info or other options. Gilad -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of John Hearns Sent: Wednesday, February 17, 2010 11:11 PM To: Beowulf Mailing List Subject: [Beowulf] Connecting QSFP to SFP+ I realise this is a forlorn quest. Can anyone think of a way of converting between a QSFP plug and an SFP+ socket? Ie. if I wanted to connect a QSFP cable into a 10gigabit ethernet switch which has an SFP port do I stand a cat's chance in hell? The gentleman from Mellanox who gave me a lot of help here may be permitted to have a small smile. And yes, the Q is QSFP = quad, so there are four of the things. I guess I should just bite the bullet and go 40Gbps http://www.networkworld.com/news/2009/021009-voltaire-switch.html (no, I'm not being really serious here _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From dnlombar at ichips.intel.com Thu Feb 18 06:47:54 2010 From: dnlombar at ichips.intel.com (David N. Lombard) Date: Thu, 18 Feb 2010 06:47:54 -0800 Subject: [Beowulf] Visualization toolkit to monitor scheduler performance In-Reply-To: References: Message-ID: <20100218144754.GB15298@nlxcldnl2.cl.intel.com> On Wed, Feb 17, 2010 at 07:25:34AM -0800, Brian D. Ropers-Huilman wrote: > On Mon, Feb 15, 2010 at 15:56, Rahul Nabar wrote: > > Are there any generic "scheduler visualization" tools out there? > > We've been developing in-house tools for this (both representing usage > overall, but recently focusing on a scheduler view) for a while. We > have a set of command-line tools to produce reports and graphs. Here's > one graph for a snapshot of a view for the month of December on one of > our clusters with 256 nodes and 2048 cores: > > http://www.msi.umn.edu/~bropers/calhoun_december.png ... > http://www.msi.umn.edu/~bropers/group_calhoun_jan2010.png > > We run torque with Moab and this is a result of parsing the torque > logs. We are still going through and validating the code and adding > features and may be at a point, somewhere in the future, where we'd be > comfortable releasing it. Brian, That's an outstanding job! There's a tremendous amount of information that's quite easy to understand. I do hope you release it soon. -- David N. Lombard, Intel, Irvine, CA I do not speak for Intel Corporation; all comments are strictly my own. From reuti at staff.uni-marburg.de Thu Feb 18 08:33:37 2010 From: reuti at staff.uni-marburg.de (Reuti) Date: Thu, 18 Feb 2010 17:33:37 +0100 Subject: [Beowulf] PVM 3.4.5-12 terminates when adding Host on Ubuntu 9.10 In-Reply-To: References: <0D2F92CE-AAEB-4D9E-9AC6-F591C9AE1773@staff.uni-marburg.de> Message-ID: <74775D56-4E92-4846-B9B4-F5E3B70D759A@staff.uni-marburg.de> Hi, Am 17.02.2010 um 20:23 schrieb Tsz Kuen Ching: > Thanks for the reply, I have asked around and found out that there > are no firewall on the machine which blocks certain ports. > > Does anyone else have an idea or answer? > > On Sun, Feb 14, 2010 at 6:18 PM, Reuti > wrote: > Am 11.02.2010 um 19:43 schrieb Tsz Kuen Ching: > > > Whenever I attempt to add a host in PVM it ends up terminating the > process in the master program. The process does run in the slave > node, however because the PVM terminates I do not get access to the > node. > > I'm currently using Ubuntu 9.10, and I used apt-get to install pvm > ( pvmlib, pvmdev, pvm). > Thus $PVM_ROOT is set automatically, and so is $PVM_ARCH > As for the other variables, I have not looked for them. > > I can ssh into the the slave without the need of a password. > > Do you have any firwall on the machines which blocks certain ports? > > -- Reuti > > > > Any Ideas or suggestions? > > This is what happens: > > user at laptop> pvm > pvm> add slave-slave > add slave-slave > Terminated > user at laptop> ... > > The logs are as followed: > > Laptop log > --- > [t80040000] 02/11 10:23:32 laptop (127.0.1.1:55884) LINUX 3.4.5 Does the laptop have a real address instead of 127.0.1.1, from which it can be accessed from slave-slave? Instead of using ssh, you can also startup pvm without any rsh/ssh by specifying: so=ms in the hostfile for this particular slave-slave and type a command by hand on slave-slave. -- Reuto > [t80040000] 02/11 10:23:32 ready Thu Feb 11 10:23:32 2010 > [t80040000] 02/11 10:23:32 netoutput() sendto: errno=22 > [t80040000] 02/11 10:23:32 em=0x2c24f0 > [t80040000] 02/11 10:23:32 [49/?][6e/?][76/?][61/?][6c/?][69/?][64/ > ?][20/?][61/?][72/?] > [t80040000] 02/11 10:23:32 netoutput() sendto: Invalid argument > [t80040000] 02/11 10:23:32 pvmbailout(0) > > slave-log > --- > [t80080000] 02/11 10:23:25 slave-slave (xxx.x.x.xxx:57344) LINUX64 > 3.4.5 > [t80080000] 02/11 10:23:25 ready Thu Feb 11 10:23:25 2010 > [t80080000] 02/11 10:28:26 work() run = STARTUP, timed out waiting > for master > [t80080000] 02/11 10:28:26 pvmbailout(0) > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From rpnabar at gmail.com Thu Feb 18 09:37:36 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 18 Feb 2010 11:37:36 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? Message-ID: Discussions that I read on this list in the last couple of months tempt me to do away with hardware RAID entirely for a new mini-storage-project I have to do. I am thinking of going for a JBOD with Linux Software RAID via mdadm. Hardware RAID just doesn't have the original awesomeness that it had me mesmerized with. Any recommendations for a good JBOD? The requirements are simple. 5 Terabytes total capacity. SATA drives. Don't need high performance: these are for archival home dirs. No active jobs run from this storage. Reliability and low price are key. Some kind of Direct-Attached Storage box. RAID5 or RAID6 maybe. Already have a pretty fast 8 core server with lots of RAM that I can hook this up to. Neither bandwidth nor IOPS need to be terribly high. Most of the data here is pretty static and not often moved around. One of the things I notice is that 5 Terabytes seems too low-end these days. Can't find many solutions tailored to this size. Most come with 12 or 16 bays etc. which seems excessive for this application. -- Rahul From gerry.creager at tamu.edu Thu Feb 18 10:12:05 2010 From: gerry.creager at tamu.edu (Gerald Creager) Date: Thu, 18 Feb 2010 12:12:05 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: Message-ID: <4B7D82F5.9050503@tamu.edu> For what you're describing, I'd consider CoRAID's AoE technology and system, and use their RAID6 capability. Otherwise, get yourself a box with up to 8 slots, preferably with hot-swap capability, and forge ahead. gerry Rahul Nabar wrote: > Discussions that I read on this list in the last couple of months > tempt me to do away with hardware RAID entirely for a new > mini-storage-project I have to do. I am thinking of going for a JBOD > with Linux Software RAID via mdadm. Hardware RAID just doesn't have > the original awesomeness that it had me mesmerized with. > > Any recommendations for a good JBOD? The requirements are simple. 5 > Terabytes total capacity. SATA drives. Don't need high performance: > these are for archival home dirs. No active jobs run from this > storage. Reliability and low price are key. Some kind of > Direct-Attached Storage box. RAID5 or RAID6 maybe. Already have a > pretty fast 8 core server with lots of RAM that I can hook this up to. > Neither bandwidth nor IOPS need to be terribly high. Most of the data > here is pretty static and not often moved around. > > One of the things I notice is that 5 Terabytes seems too low-end these > days. Can't find many solutions tailored to this size. Most come with > 12 or 16 bays etc. which seems excessive for this application. > -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From chekh at pcbi.upenn.edu Thu Feb 18 10:14:01 2010 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Thu, 18 Feb 2010 13:14:01 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: Message-ID: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> On Thu, 18 Feb 2010 11:37:36 -0600 Rahul Nabar wrote: > Discussions that I read on this list in the last couple of months > tempt me to do away with hardware RAID entirely for a new > mini-storage-project I have to do. I am thinking of going for a JBOD > with Linux Software RAID via mdadm. Hardware RAID just doesn't have > the original awesomeness that it had me mesmerized with. > > Any recommendations for a good JBOD? The requirements are simple. 5 > Terabytes total capacity. SATA drives. Don't need high performance: > these are for archival home dirs. No active jobs run from this > storage. Reliability and low price are key. Some kind of > Direct-Attached Storage box. RAID5 or RAID6 maybe. Already have a > pretty fast 8 core server with lots of RAM that I can hook this up to. > Neither bandwidth nor IOPS need to be terribly high. Most of the data > here is pretty static and not often moved around. > > One of the things I notice is that 5 Terabytes seems too low-end these > days. Can't find many solutions tailored to this size. Most come with > 12 or 16 bays etc. which seems excessive for this application. > Does it need to be rack-mount? What kind of interface? If performance is really not an issue, a consumer-level NAS box is probably your cheapest option. A QNap TS-410 is ~$450 plus 4 x 2TB drives... -- Alex Chekholko chekh at pcbi.upenn.edu From fly at anydata.co.uk Thu Feb 18 10:28:42 2010 From: fly at anydata.co.uk (Fred Youhanaie) Date: Thu, 18 Feb 2010 18:28:42 +0000 Subject: [Beowulf] Top500 power consumption In-Reply-To: <9f8092cc1002172257m4dfa8022s9f47a7502cbead4@mail.gmail.com> References: <9f8092cc1002172257m4dfa8022s9f47a7502cbead4@mail.gmail.com> Message-ID: <4B7D86DA.9000506@anydata.co.uk> On 23/12/42 20:59, John Hearns wrote: > As I remember, the Top500 site now lists power consumption of systems, > there cenrtainly is an section on the site from a few years ago > discussing this. However I could not extract any figures. Does anyone > know the magic buttons to press? > I did find the Green500 site, which isn't very well populated with systems. > Hi John Try the XML or Excel link from any of the specific lists, e.g. from http://top500.org/lists/2009/11 use http://top500.org/static/lists/xml/TOP500_200911_all.xml or http://top500.org/static/lists/2009/11/TOP500_200911.xls HTH Cheers f. From rpnabar at gmail.com Thu Feb 18 10:53:22 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 18 Feb 2010 12:53:22 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> Message-ID: On Thu, Feb 18, 2010 at 12:14 PM, Alex Chekholko wrote: > On Thu, 18 Feb 2010 11:37:36 -0600 > Does it need to be rack-mount? ?What kind of interface? Preferably rack-mount. But cost is a compelling argument . I could be convinced if a non-rack unit was significantly cheaper. I was thinking SAS / SCSI / iSCSI is probably easiest and cheapest. > If performance is really not an issue, a consumer-level NAS box is > probably your cheapest option. ?A QNap TS-410 is ~$450 plus 4 x 2TB > drives... Hmm...NAS. I was more thinking in terms of a DAS. Don't the NAS's come with their own CPU's / RAM and stuff? (Like the Sun Thumper) -- Rahul From rpnabar at gmail.com Thu Feb 18 11:00:51 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 18 Feb 2010 13:00:51 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <4B7D82F5.9050503@tamu.edu> References: <4B7D82F5.9050503@tamu.edu> Message-ID: On Thu, Feb 18, 2010 at 12:12 PM, Gerald Creager wrote: > For what you're describing, I'd consider CoRAID's AoE technology and system, > and use their RAID6 capability. Otherwise, get yourself a box with up to 8 > slots, preferably with hot-swap capability, and forge ahead. > Thanks Gerry! That one looks promising. One thing that always confuses me: Let's say I do software RAID. So just buy a JBOD device. It still has some "controller" on it, correct? Is the quality of this controller very critical or not so much? And if a unit advertises itself as Hardware RAID can I still use it as JBOD mode? There seems more hits for "RAID" out there than for "JBOD". Maybe I am looking up the wrong keyword. -- Rahul From mwill at penguincomputing.com Thu Feb 18 11:07:16 2010 From: mwill at penguincomputing.com (Michael Will) Date: Thu, 18 Feb 2010 11:07:16 -0800 Subject: [Beowulf] Any recommendations for a good JBOD? References: <4B7D82F5.9050503@tamu.edu> Message-ID: <433093DF7AD7444DA65EFAFE3987879CCBD88B@orca.penguincomputing.com> Often a jbod is just sas/sata attached, and the real controller is in the host that attaches to it. It could then be a hardware raid controller from adaptec, or one of the very fast lsi sas/sata hca's which you could then use with software raid and/or LVM... Michael -----Original Message----- From: beowulf-bounces at beowulf.org on behalf of Rahul Nabar Sent: Thu 2/18/2010 11:00 AM To: gerry.creager at tamu.edu Cc: Beowulf Mailing List Subject: Re: [Beowulf] Any recommendations for a good JBOD? On Thu, Feb 18, 2010 at 12:12 PM, Gerald Creager wrote: > For what you're describing, I'd consider CoRAID's AoE technology and system, > and use their RAID6 capability. Otherwise, get yourself a box with up to 8 > slots, preferably with hot-swap capability, and forge ahead. > Thanks Gerry! That one looks promising. One thing that always confuses me: Let's say I do software RAID. So just buy a JBOD device. It still has some "controller" on it, correct? Is the quality of this controller very critical or not so much? And if a unit advertises itself as Hardware RAID can I still use it as JBOD mode? There seems more hits for "RAID" out there than for "JBOD". Maybe I am looking up the wrong keyword. -- Rahul _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Thu Feb 18 11:21:35 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 18 Feb 2010 14:21:35 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: Message-ID: <86E799F6-FA24-4B70-8BFD-1E907026DD9F@scalableinformatics.com> 5TB is fairly low end. Our 6 and 9 TB DV units do this with 12 drives. Uses mdadm and our tools atop it. Don't have pricing in front of me but they are quite inexpensive. Iscsi nfs cifs yadda yadda. 2 gbe ports you can drive at full speed. Please pardon brevity and typos ... Sent from my iPhone On Feb 18, 2010, at 12:37 PM, Rahul Nabar wrote: > Discussions that I read on this list in the last couple of months > tempt me to do away with hardware RAID entirely for a new > mini-storage-project I have to do. I am thinking of going for a JBOD > with Linux Software RAID via mdadm. Hardware RAID just doesn't have > the original awesomeness that it had me mesmerized with. > > Any recommendations for a good JBOD? The requirements are simple. 5 > Terabytes total capacity. SATA drives. Don't need high performance: > these are for archival home dirs. No active jobs run from this > storage. Reliability and low price are key. Some kind of > Direct-Attached Storage box. RAID5 or RAID6 maybe. Already have a > pretty fast 8 core server with lots of RAM that I can hook this up to. > Neither bandwidth nor IOPS need to be terribly high. Most of the data > here is pretty static and not often moved around. > > One of the things I notice is that 5 Terabytes seems too low-end these > days. Can't find many solutions tailored to this size. Most come with > 12 or 16 bays etc. which seems excessive for this application. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From beckerjes at mail.nih.gov Thu Feb 18 11:26:04 2010 From: beckerjes at mail.nih.gov (Jesse Becker) Date: Thu, 18 Feb 2010 14:26:04 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <4B7D82F5.9050503@tamu.edu> References: <4B7D82F5.9050503@tamu.edu> Message-ID: <20100218192604.GP15788@mail.nih.gov> On Thu, Feb 18, 2010 at 01:12:05PM -0500, Gerald Creager wrote: >For what you're describing, I'd consider CoRAID's AoE technology and >system, and use their RAID6 capability. Otherwise, get yourself a box >with up to 8 slots, preferably with hot-swap capability, and forge ahead. I'll second this recommendation. The Coraid servers are fairly inexpensive, variously support 4, 16 or 24 drives depending on model, and will accept any drives you care to throw in it. Coraid has been very good about this in the past, although they do maintain a list of problematic drives they recommend against using. That said, they will sell you a 'certified' drive if you want one. Performance is decent, especially given the price/capacity ratio. It does not need to be fully populated either, so you can grow into the system over time. The AoE protocol is well supported in Linux, (and theoretically other OSes, but I've not tested those). I also agree with using the built-in RAID abilties instead of using it as a JBOD--the rebuild times are murder. Coraid also provide tools to "extract" your data from bare drives in an emergency situation as well. >gerry > >Rahul Nabar wrote: >> Discussions that I read on this list in the last couple of months >> tempt me to do away with hardware RAID entirely for a new >> mini-storage-project I have to do. I am thinking of going for a JBOD >> with Linux Software RAID via mdadm. Hardware RAID just doesn't have >> the original awesomeness that it had me mesmerized with. >> >> Any recommendations for a good JBOD? The requirements are simple. 5 >> Terabytes total capacity. SATA drives. Don't need high performance: >> these are for archival home dirs. No active jobs run from this >> storage. Reliability and low price are key. Some kind of >> Direct-Attached Storage box. RAID5 or RAID6 maybe. Already have a >> pretty fast 8 core server with lots of RAM that I can hook this up to. >> Neither bandwidth nor IOPS need to be terribly high. Most of the data >> here is pretty static and not often moved around. >> >> One of the things I notice is that 5 Terabytes seems too low-end these >> days. Can't find many solutions tailored to this size. Most come with >> 12 or 16 bays etc. which seems excessive for this application. >> > >-- >Gerry Creager -- gerry.creager at tamu.edu >Texas Mesonet -- AATLT, Texas A&M University >Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 >Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Jesse Becker NHGRI Linux support (Digicon Contractor) From chekh at pcbi.upenn.edu Thu Feb 18 11:55:37 2010 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Thu, 18 Feb 2010 14:55:37 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> Message-ID: <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> On Thu, 18 Feb 2010 12:53:22 -0600 Rahul Nabar wrote: > On Thu, Feb 18, 2010 at 12:14 PM, Alex Chekholko wrote: > > On Thu, 18 Feb 2010 11:37:36 -0600 > > Does it need to be rack-mount? ?What kind of interface? > > Preferably rack-mount. But cost is a compelling argument . I could be > convinced if a non-rack unit was significantly cheaper. > > I was thinking SAS / SCSI / iSCSI is probably easiest and cheapest. > Do you already have a suitable SAS or SCSI controller in the host machine? If not, then you have to factor in the cost of the controller. If you want iSCSI, then you're looking at a low-end SAN as opposed to a DAS. But the SAN/NAS distinction is blurry these days, as many devices can give you either block or file-level access. > > If performance is really not an issue, a consumer-level NAS box is > > probably your cheapest option. ?A QNap TS-410 is ~$450 plus 4 x 2TB > > drives... > > Hmm...NAS. I was more thinking in terms of a DAS. Don't the NAS's come > with their own CPU's / RAM and stuff? (Like the Sun Thumper) Yes, they do. But if you want to access 5TB via iSCSI (or NFS), that's likely the cheapest option. The cheapest 4-bay NAS I can find via the comparison charts here: http://www.smallnetbuilder.com/nas is only $281 (plus drives). -- Alex Chekholko chekh at pcbi.upenn.edu From gerry.creager at tamu.edu Thu Feb 18 12:15:45 2010 From: gerry.creager at tamu.edu (Gerald Creager) Date: Thu, 18 Feb 2010 14:15:45 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <433093DF7AD7444DA65EFAFE3987879CCBD88B@orca.penguincomputing.com> References: <4B7D82F5.9050503@tamu.edu> <433093DF7AD7444DA65EFAFE3987879CCBD88B@orca.penguincomputing.com> Message-ID: <4B7D9FF1.6040202@tamu.edu> Friends don't let friends use Adaptec controllers if they really want RAID. gerry Michael Will wrote: > Often a jbod is just sas/sata attached, and the real controller is in > the host that attaches to it. It could then be a hardware raid controller > from adaptec, or one of the very fast lsi sas/sata hca's which you could > then use with software raid and/or LVM... > > Michael > > > -----Original Message----- > From: beowulf-bounces at beowulf.org on behalf of Rahul Nabar > Sent: Thu 2/18/2010 11:00 AM > To: gerry.creager at tamu.edu > Cc: Beowulf Mailing List > Subject: Re: [Beowulf] Any recommendations for a good JBOD? > > On Thu, Feb 18, 2010 at 12:12 PM, Gerald Creager > wrote: > > For what you're describing, I'd consider CoRAID's AoE technology and > system, > > and use their RAID6 capability. Otherwise, get yourself a box with up > to 8 > > slots, preferably with hot-swap capability, and forge ahead. > > > > Thanks Gerry! That one looks promising. > > One thing that always confuses me: > > Let's say I do software RAID. So just buy a JBOD device. It still has > some "controller" on it, correct? Is the quality of this controller > very critical or not so much? And if a unit advertises itself as > Hardware RAID can I still use it as JBOD mode? There seems more hits > for "RAID" out there than for "JBOD". Maybe I am looking up the wrong > keyword. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From kuenching at gmail.com Thu Feb 18 10:43:29 2010 From: kuenching at gmail.com (Tsz Kuen Ching) Date: Thu, 18 Feb 2010 13:43:29 -0500 Subject: [Beowulf] PVM 3.4.5-12 terminates when adding Host on Ubuntu 9.10 In-Reply-To: <74775D56-4E92-4846-B9B4-F5E3B70D759A@staff.uni-marburg.de> References: <0D2F92CE-AAEB-4D9E-9AC6-F591C9AE1773@staff.uni-marburg.de> <74775D56-4E92-4846-B9B4-F5E3B70D759A@staff.uni-marburg.de> Message-ID: Hello, Thanks for your help! It works now, after changing the host file to point at it's own IP address instead of the default localhost, things worked fine. - Kuen On Thu, Feb 18, 2010 at 11:33 AM, Reuti wrote: > Hi, > > Am 17.02.2010 um 20:23 schrieb Tsz Kuen Ching: > > > Thanks for the reply, I have asked around and found out that there are no >> firewall on the machine which blocks certain ports. >> >> Does anyone else have an idea or answer? >> >> On Sun, Feb 14, 2010 at 6:18 PM, Reuti >> wrote: >> Am 11.02.2010 um 19:43 schrieb Tsz Kuen Ching: >> >> >> Whenever I attempt to add a host in PVM it ends up terminating the process >> in the master program. The process does run in the slave node, however >> because the PVM terminates I do not get access to the node. >> >> I'm currently using Ubuntu 9.10, and I used apt-get to install pvm ( >> pvmlib, pvmdev, pvm). >> Thus $PVM_ROOT is set automatically, and so is $PVM_ARCH >> As for the other variables, I have not looked for them. >> >> I can ssh into the the slave without the need of a password. >> >> Do you have any firwall on the machines which blocks certain ports? >> >> -- Reuti >> >> >> >> Any Ideas or suggestions? >> >> This is what happens: >> >> user at laptop> pvm >> pvm> add slave-slave >> add slave-slave >> Terminated >> user at laptop> ... >> >> The logs are as followed: >> >> Laptop log >> --- >> [t80040000] 02/11 10:23:32 laptop (127.0.1.1:55884) LINUX 3.4.5 >> > > Does the laptop have a real address instead of 127.0.1.1, from which it can > be accessed from slave-slave? Instead of using ssh, you can also startup pvm > without any rsh/ssh by specifying: so=ms in the hostfile for this particular > slave-slave and type a command by hand on slave-slave. > > -- Reuto > > > > [t80040000] 02/11 10:23:32 ready Thu Feb 11 10:23:32 2010 >> [t80040000] 02/11 10:23:32 netoutput() sendto: errno=22 >> [t80040000] 02/11 10:23:32 em=0x2c24f0 >> [t80040000] 02/11 10:23:32 >> [49/?][6e/?][76/?][61/?][6c/?][69/?][64/?][20/?][61/?][72/?] >> [t80040000] 02/11 10:23:32 netoutput() sendto: Invalid argument >> [t80040000] 02/11 10:23:32 pvmbailout(0) >> >> slave-log >> --- >> [t80080000] 02/11 10:23:25 slave-slave (xxx.x.x.xxx:57344) LINUX64 3.4.5 >> [t80080000] 02/11 10:23:25 ready Thu Feb 11 10:23:25 2010 >> [t80080000] 02/11 10:28:26 work() run = STARTUP, timed out waiting for >> master >> [t80080000] 02/11 10:28:26 pvmbailout(0) >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eugen at leitl.org Fri Feb 19 01:30:12 2010 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 19 Feb 2010 10:30:12 +0100 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> Message-ID: <20100219093011.GT17686@leitl.org> On Thu, Feb 18, 2010 at 12:53:22PM -0600, Rahul Nabar wrote: > On Thu, Feb 18, 2010 at 12:14 PM, Alex Chekholko wrote: > > On Thu, 18 Feb 2010 11:37:36 -0600 > > Does it need to be rack-mount? ?What kind of interface? > > Preferably rack-mount. But cost is a compelling argument . I could be > convinced if a non-rack unit was significantly cheaper. > > I was thinking SAS / SCSI / iSCSI is probably easiest and cheapest. > > > > If performance is really not an issue, a consumer-level NAS box is > > probably your cheapest option. ?A QNap TS-410 is ~$450 plus 4 x 2TB > > drives... > > Hmm...NAS. I was more thinking in terms of a DAS. Don't the NAS's come > with their own CPU's / RAM and stuff? (Like the Sun Thumper) I know this isn't your design space, but Matt on zfs-discuss posted the following part list http://www.acmemicro.com/estore/merchant.ihtml?pid=5440&lastcatid=53&step=4 http://www.newegg.com/Product/Product.aspx?Item=N82E16820139043 http://www.acmemicro.com/estore/merchant.ihtml?pid=4518&step=4 http://www.acmemicro.com/estore/merchant.ihtml?pid=6708&step=4 http://www.newegg.com/Product/Product.aspx?Item=N82E16819117187 http://www.newegg.com/Product/Product.aspx?Item=N82E16835203002 later he added I'm just going to use the single 4x SAS. 1200MB/sec should be a great plenty for 24 drives total. I'm going to be mounting 2x SSD for ZIL and 2x SSD for ARC, then 20-2TB drives. I'm guessing that with a random I/O workload, I'll never hit the 1200MB/sec peak that the 4x SAS can sustain. Also - for the ZIL I will be using 2x 32GB Intel X25-E SLC drives, and for the ARC I'll be using 2x 160GB Intel X25M MLC drives. I'm hoping that the cache will allow me to saturate gigabit and eventually infiniband. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From rpnabar at gmail.com Fri Feb 19 06:22:13 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 19 Feb 2010 08:22:13 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> Message-ID: On Thu, Feb 18, 2010 at 1:55 PM, Alex Chekholko wrote: > On Thu, 18 Feb 2010 12:53:22 -0600 > Rahul Nabar wrote: > >> >> I was thinking SAS / SCSI / iSCSI is probably easiest and cheapest. > Do you already have a suitable SAS or SCSI controller in the host > machine? ?If not, then you have to factor in the cost of the controller. No. true. I have to factor in that price. But almost any kind of disk array I can think of will need a controller, correct? Or are there any JBOD formats that can be attached without putting in a controller in the server. Are SAS / SCSI controllers generic? Or are they paired to the JBOD one buys? In the past I've only had hardware RAID so the card usually came from te vendor that was selling the storage array. Also those cards did RAID+Controller whereas this time I'll be shopping around for a controller-only type of card since I want to do RAID via mdadm. > If you want iSCSI, then you're looking at a low-end SAN as opposed to a > DAS. ?But the SAN/NAS distinction is blurry these days, as many devices > can give you either block or file-level access. Yes, true. I'm dropping iSCSC entirely. Don't have the $$ to do a SAN with fibre switches etc. >> > If performance is really not an issue, a consumer-level NAS box is >> > probably your cheapest option. ?A QNap TS-410 is ~$450 plus 4 x 2TB >> > drives... >> >> Hmm...NAS. I was more thinking in terms of a DAS. Don't the NAS's come >> with their own CPU's / RAM and stuff? (Like the Sun Thumper) > > Yes, they do. ?But if you want to access 5TB via iSCSI (or NFS), that's > likely the cheapest option. That's quite non-intuitive to me. If it' a NAS they must need procs+RAM+NICs on board. How does that get cheaper than an equivalent "dumb" JBOD which outsources all these 3 functions to the attached host server? Maybe I am missing a part of the argument. I already have a server that the JBOD can be attached to so that cost to me is a sunk cost. I just need to consider the incrementals above that. -- Rahul From hahn at mcmaster.ca Fri Feb 19 09:29:43 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 19 Feb 2010 12:29:43 -0500 (EST) Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> Message-ID: >>> I was thinking SAS / SCSI / iSCSI is probably easiest and cheapest. the concept of scsi/sas being cheap is rather amusing. >> Do you already have a suitable SAS or SCSI controller in the host >> machine? ?If not, then you have to factor in the cost of the controller. > > No. true. I have to factor in that price. But almost any kind of disk > array I can think of will need a controller, correct? Or are there any unless it already has the controller, of course. most motherboards these days come with at least 6x 3 Gb sata ports, for instance. > JBOD formats that can be attached without putting in a controller in > the server. I was thinking of esata, and a 5-disk external enclosure with port-multiplier. if your system already has sata, you might need to add an esata header (or possibly a controller if the existing controller doesn't support port multipliers.) 5 disks on PM would be a pretty simple way to add JBOD for a md-based raid5. >> If you want iSCSI, then you're looking at a low-end SAN as opposed to a >> DAS. ?But the SAN/NAS distinction is blurry these days, as many devices >> can give you either block or file-level access. > > Yes, true. I'm dropping iSCSC entirely. Don't have the $$ to do a SAN > with fibre switches etc. iSCSI doesn't require SAN infrastructure, of course. that's kind of the point: you plug it into your existing ethernet fabric. for the low-overhead application you describe, it's a reasonable fit, except that even low-end iSCSI/NAS boxes tend to ramp up in price. that is, comparable to what you'd pay for a cheap uATX system (which would be of about the same speed, power, space and performance, not surprisingly.) >> Yes, they do. ?But if you want to access 5TB via iSCSI (or NFS), that's >> likely the cheapest option. > > That's quite non-intuitive to me. If it' a NAS they must need > procs+RAM+NICs on board. How does that get cheaper than an equivalent > "dumb" JBOD which outsources all these 3 functions to the attached > host server? Maybe I am missing a part of the argument. procs+ram+nic can easily total less than $100; enclosures can be very cheap as well. that's what's so appealing about that approach: it's fully user-servicable, and you don't have to depend on some random vendor to maintain firmware, supported-disk lists, etc. of course, that's also the main downside: you have just adopted another system, albeit embedded, to maintain. > I already have a server that the JBOD can be attached to so that cost > to me is a sunk cost. I just need to consider the incrementals above > that. right - 10 years ago, the cost overhead of the system was larger. nowadays, integration and moore's law has made small systems very cheap. this is good, since disks are incredibly cheap as well. (bad if you're in the storage business, where it looks a little funny to justify thousands of dollars of controller/etc infrastructure when the disks cost $100 or so. disk arrays can still make sense, of course, but availability of useful cheap commodity systems has changed the equation. regards, mark hahn. From bdobbins at gmail.com Fri Feb 19 10:25:07 2010 From: bdobbins at gmail.com (Brian Dobbins) Date: Fri, 19 Feb 2010 13:25:07 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? Message-ID: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> Hi guys, I'm beginning to look into configurations for a new cluster and with the AMD 12-core and Intel 8-core chips 'here' (or coming soonish), I'm curious if anyone has any data on the effects of the messaging rate of the IB cards. With a 4-socket node having between 32 and 48 cores, lots of computing can get done fast, possibly stressing the network. I know Qlogic has made a big deal about the InfiniPath adapter's extremely good message rate in the past... is this still an important issue? How do the latest Mellanox adapters compare? (Qlogic documents a ~30M messages processsed per second rate on its QLE7342, but I didn't see a number on the Mellanox ConnectX-2... and more to the point, do people see this effecting them?) On a similar note, does a dual-port card provide an increase in on-card processing, or 'just' another link? (The increased bandwidth is certainly nice, even in a flat switched network, I'm sure!) I'm primarily concerned with weather and climate models here - WRF, CAM, CCSM, etc., and clearly the communication rate will depend to a large degree on the resolutions used, but any information, even 'gut instincts' people have are welcome. The more info the merrier. Thanks very much, - Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Fri Feb 19 10:47:07 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 19 Feb 2010 13:47:07 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> Message-ID: <4B7EDCAB.5090403@scalableinformatics.com> Brian Dobbins wrote: > > Hi guys, > > I'm beginning to look into configurations for a new cluster and with > the AMD 12-core and Intel 8-core chips 'here' (or coming soonish), I'm > curious if anyone has any data on the effects of the messaging rate of > the IB cards. With a 4-socket node having between 32 and 48 cores, lots > of computing can get done fast, possibly stressing the network. The big issue will be contention for the resource. As you scale up the number of requesters, if the number of resources don't also scale up (even vitualized non-blocking HCA/NICs are good here), you could hit a problem at some point. > I know Qlogic has made a big deal about the InfiniPath adapter's > extremely good message rate in the past... is this still an important > issue? How do the latest Mellanox adapters compare? (Qlogic documents > a ~30M messages processsed per second rate on its QLE7342, but I didn't > see a number on the Mellanox ConnectX-2... and more to the point, do > people see this effecting them?) We see this on the storage side. Massive oversubscription of resources leads to contention issues for links, to ib packet requeue failures among other things. > > On a similar note, does a dual-port card provide an increase in > on-card processing, or 'just' another link? (The increased bandwidth is > certainly nice, even in a flat switched network, I'm sure!) Depends. If the card can talk to the PCIe bus at full speed, you might be able to saturate the link with a single QDR port. If your card is throttled for some reason (we have seen this) then adding the extra port might or might not help. If you are at the design stage, I'd suggest "go wide" as you can ... as many IB HCAs as you can get to keep the number of ports/core as high as reasonable. Of course I'd have to argue the same thing on the storage side :) > I'm primarily concerned with weather and climate models here - WRF, > CAM, CCSM, etc., and clearly the communication rate will depend to a > large degree on the resolutions used, but any information, even 'gut > instincts' people have are welcome. The more info the merrier. > > Thanks very much, > - Brian > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From bdobbins at gmail.com Fri Feb 19 12:23:20 2010 From: bdobbins at gmail.com (Brian Dobbins) Date: Fri, 19 Feb 2010 15:23:20 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <4B7EDCAB.5090403@scalableinformatics.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <4B7EDCAB.5090403@scalableinformatics.com> Message-ID: <2b5e0c121002191223w4ef0db3cu24816ff671b62abc@mail.gmail.com> Hi Joe, I'm beginning to look into configurations for a new cluster and with the >> AMD 12-core and Intel 8-core chips 'here' (or coming soonish), I'm curious >> if anyone has any data on the effects of the messaging rate of the IB cards. >> With a 4-socket node having between 32 and 48 cores, lots of computing can >> get done fast, possibly stressing the network. >> > > The big issue will be contention for the resource. As you scale up the > number of requesters, if the number of resources don't also scale up (even > vitualized non-blocking HCA/NICs are good here), you could hit a problem at > some point. My knowledge of the latest low-level hardware is sadly out of date - does a virtualized non-blocking HCA mean that I can have one HCA which virtualizes into four (one per socket say), and each of those four has its own memory-mapped buffer so that I don't get cache invalidation / contention on multi-socket boxes, or am I totally off-base here? I'm all for scaling up NICs as I scale up cores, but each additional NIC / HCA port means more switch ports, which adds up fast. In fact, if I have a standard 2-socket node now, with 8 cores in it and a DDR IB port, and then get a 2-socket node with 24 cores in it and a QDR IB port,... how's the math work? I've got 3x the cores, 1x the adapters, but that adapter has 2x the speed. Blah. I know Qlogic has made a big deal about the InfiniPath adapter's extremely >> good message rate in the past... is this still an important issue? How do >> the latest Mellanox adapters compare? (Qlogic documents a ~30M messages >> processsed per second rate on its QLE7342, but I didn't see a number on the >> Mellanox ConnectX-2... and more to the point, do people see this effecting >> them?) >> > We see this on the storage side. Massive oversubscription of resources > leads to contention issues for links, to ib packet requeue failures among > other things. So (ignoring disk latencies and just focusing on link contention), is there any difference between using 2x the storage nodes or the same number of storage nodes, but with 2x the NICs? On a similar note, does a dual-port card provide an increase in on-card >> processing, or 'just' another link? (The increased bandwidth is certainly >> nice, even in a flat switched network, I'm sure!) >> > > Depends. If the card can talk to the PCIe bus at full speed, you might be > able to saturate the link with a single QDR port. If your card is throttled > for some reason (we have seen this) then adding the extra port might or > might not help. If you are at the design stage, I'd suggest "go wide" as > you can ... as many IB HCAs as you can get to keep the number of ports/core > as high as reasonable. > Oh dear. I need to go re-learn a lot of things. So if I want multiple full-speed QDR cards in a node, I need that node to have independent PCIe buses, and each card to be placed on a separate bus. Of course I'd have to argue the same thing on the storage side :) No argument from me there! Thanks again, as always, for your input. - Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdidomenico4 at gmail.com Fri Feb 19 12:49:11 2010 From: mdidomenico4 at gmail.com (Michael Di Domenico) Date: Fri, 19 Feb 2010 15:49:11 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> Message-ID: the folks on the linux-rdma mailing list can probably share some slides with you about app load over different cards. if you dont get a response, i can drop a few names of people who definitely have the info, but i dont want to do it at large on the list The last set of slides i can (thinking way back when i was still with qlogic) recall, yes the ipath cards could do 30m mesg/sec whereas the mlnx cards were half to a two thirds lower. this does have an affect on core count traffic, but only under certain application loads mlnx and qlogic made a trade off with their card designs, the qlogic cards have a real high msg/rate, but the rdma bandwidth performance can suffer in certain cases, where as mlnx has a higher rdma bandwidth but a lower mesg rate getting the balance right is a mastery of art. and the tipping point slides easily and with every application it's been a while since i was at qlogic and have forgotten a lot of the sales mumbo... On Fri, Feb 19, 2010 at 1:25 PM, Brian Dobbins wrote: > > Hi guys, > > ? I'm beginning to look into configurations for a new cluster and with the > AMD 12-core and Intel 8-core chips 'here' (or coming soonish), I'm curious > if anyone has any data on the effects of the messaging rate of the IB > cards.? With a 4-socket node having between 32 and 48 cores, lots of > computing can get done fast, possibly stressing the network. > > ? I know Qlogic has made a big deal about the InfiniPath adapter's extremely > good message rate in the past... is this still an important issue?? How do > the latest Mellanox adapters compare?? (Qlogic documents a ~30M messages > processsed per second rate on its QLE7342, but I didn't see a number on the > Mellanox ConnectX-2... and more to the point, do people see this effecting > them?) > > ? On a similar note, does a dual-port card provide an increase in on-card > processing, or 'just' another link?? (The increased bandwidth is certainly > nice, even in a flat switched network, I'm sure!) > > ? I'm primarily concerned with weather and climate models here - WRF, CAM, > CCSM, etc., and clearly the communication rate will depend to a large degree > on the resolutions used, but any information, even 'gut instincts' people > have are welcome.? The more info the merrier. > > ? Thanks very much, > ? - Brian > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > From Shainer at mellanox.com Fri Feb 19 13:09:40 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Fri, 19 Feb 2010 13:09:40 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02662C76@mtiexch01.mti.com> When you look on low level, marketing driven benchmarks, you should be careful. Mellanox latest message rate numbers with ConnectX-2 more than doubled versus the old cards, and are for real message rate - separate messages on the wire. The competitor numbers are with using message coalescing, so it is not real separate messages on the wire, or not really message rate. Gilad -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Michael Di Domenico Sent: Friday, February 19, 2010 12:49 PM To: beowulf at beowulf.org Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)? the folks on the linux-rdma mailing list can probably share some slides with you about app load over different cards. if you dont get a response, i can drop a few names of people who definitely have the info, but i dont want to do it at large on the list The last set of slides i can (thinking way back when i was still with qlogic) recall, yes the ipath cards could do 30m mesg/sec whereas the mlnx cards were half to a two thirds lower. this does have an affect on core count traffic, but only under certain application loads mlnx and qlogic made a trade off with their card designs, the qlogic cards have a real high msg/rate, but the rdma bandwidth performance can suffer in certain cases, where as mlnx has a higher rdma bandwidth but a lower mesg rate getting the balance right is a mastery of art. and the tipping point slides easily and with every application it's been a while since i was at qlogic and have forgotten a lot of the sales mumbo... On Fri, Feb 19, 2010 at 1:25 PM, Brian Dobbins wrote: > > Hi guys, > > ? I'm beginning to look into configurations for a new cluster and with the > AMD 12-core and Intel 8-core chips 'here' (or coming soonish), I'm curious > if anyone has any data on the effects of the messaging rate of the IB > cards.? With a 4-socket node having between 32 and 48 cores, lots of > computing can get done fast, possibly stressing the network. > > ? I know Qlogic has made a big deal about the InfiniPath adapter's extremely > good message rate in the past... is this still an important issue?? How do > the latest Mellanox adapters compare?? (Qlogic documents a ~30M messages > processsed per second rate on its QLE7342, but I didn't see a number on the > Mellanox ConnectX-2... and more to the point, do people see this effecting > them?) > > ? On a similar note, does a dual-port card provide an increase in on-card > processing, or 'just' another link?? (The increased bandwidth is certainly > nice, even in a flat switched network, I'm sure!) > > ? I'm primarily concerned with weather and climate models here - WRF, CAM, > CCSM, etc., and clearly the communication rate will depend to a large degree > on the resolutions used, but any information, even 'gut instincts' people > have are welcome.? The more info the merrier. > > ? Thanks very much, > ? - Brian > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Fri Feb 19 13:57:30 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 19 Feb 2010 13:57:30 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> Message-ID: <20100219215730.GK2857@bx9.net> On Fri, Feb 19, 2010 at 01:25:07PM -0500, Brian Dobbins wrote: > I know Qlogic has made a big deal about the InfiniPath adapter's extremely > good message rate in the past... is this still an important issue? Yes, for many codes. If I recall stuff I published a while ago, WRF sent a surprising number of short messages. But really, the right approach for you is to do some benchmarking. Arguing about microbenchmarks is pointless; they only give you clues that help explain your real application results. I believe that both QLogic and Mellanox have test clusters you can borrow. Tom Elken ought to have some WRF data he can share with you, showing message sizes as a function of cluster size for one of the usual WRF benchmark datasets. > On a similar note, does a dual-port card provide an increase in on-card > processing, or 'just' another link? (The increased bandwidth is certainly > nice, even in a flat switched network, I'm sure!) Published microbenchmarks in for Mellanox parts the SDR/DDR generation showed that only large messages got a benefit. I've never seen any application benchmarks comparing 1 and 2 port cards. -- greg (formerly the system architect of InfiniPath's SDR and DDR generations) From lindahl at pbm.com Fri Feb 19 14:05:38 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 19 Feb 2010 14:05:38 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02662C76@mtiexch01.mti.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <9FA59C95FFCBB34EA5E42C1A8573784F02662C76@mtiexch01.mti.com> Message-ID: <20100219220538.GL2857@bx9.net> > Mellanox latest message rate numbers with ConnectX-2 more than > doubled versus the old cards, and are for real message rate - > separate messages on the wire. The competitor numbers are with using > message coalescing, so it is not real separate messages on the wire, > or not really message rate. Gilad, I think you forgot which side you're supposed to be supporting. The only people I have ever seen publish message rate with coalesced messages are DK Panda (with Mellanox cards) and Mellanox. QLogic always hated coalesced messages, and if you look back in the archive for this mailing list, you'll see me denouncing coalesced messages as meanless about 1 microsecond after the first result was published by Prof. Panda. Looking around the Internet, I don't see any numbers ever published by PathScale/QLogic using coalesced messages. At the end of the day, the only reason microbenchmarks are useful is when they help explain why one interconnect does better than another on real applications. No customer should ever choose which adapter to buy based on microbenchmarks. -- greg (formerly employed by QLogic) From lindahl at pbm.com Fri Feb 19 14:17:21 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 19 Feb 2010 14:17:21 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <4B7EDCAB.5090403@scalableinformatics.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <4B7EDCAB.5090403@scalableinformatics.com> Message-ID: <20100219221721.GM2857@bx9.net> On Fri, Feb 19, 2010 at 01:47:07PM -0500, Joe Landman wrote: > The big issue will be contention for the resource. Joe, What "the resource" is depends on implementation. All network cards have the limit of the line rate of the network. As far as I can tell, the Mellanox IB cards have a limited number of engines that process messages. For short messages from a lot of CPUs, they don't have enough. For long messages, they have plenty, & hit the line rate. Don't storage systems typically send mostly long messages? The InfiniPath (now True Scale) design uses a pipelined approach. You can analytically compute the performance on short messages by knowing 2 numbers: the line rate, and the "dead time" between back-to-back packets, which is determined by the length of the longest pipeline stage. I was thrilled when we figured out that our performance graph was exactly determined by that equation. And the pipeline is a resource that you can't oversubscribe. -- greg (formerly... yada yada) From Shainer at mellanox.com Fri Feb 19 14:36:34 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Fri, 19 Feb 2010 14:36:34 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com><9FA59C95FFCBB34EA5E42C1A8573784F02662C76@mtiexch01.mti.com> <20100219220538.GL2857@bx9.net> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> Nice to hear from you Greg, hope all is well. I don't forget anything, at least for now. OSU has different benchmarks so you can measure message coalescing or real message rate. Funny to read that Q hated coalescing when they created the first benchmark for that ...:-) but lets not argue on that. Nowadays it seems that QLogic promotes the message rate as non coalescing data and I almost got bought by their marketing machine till I looked on at the data on the wire... interesting what the bits and bytes and symbols can tell you... -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Greg Lindahl Sent: Friday, February 19, 2010 2:06 PM To: beowulf at beowulf.org Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)? > Mellanox latest message rate numbers with ConnectX-2 more than > doubled versus the old cards, and are for real message rate - > separate messages on the wire. The competitor numbers are with using > message coalescing, so it is not real separate messages on the wire, > or not really message rate. Gilad, I think you forgot which side you're supposed to be supporting. The only people I have ever seen publish message rate with coalesced messages are DK Panda (with Mellanox cards) and Mellanox. QLogic always hated coalesced messages, and if you look back in the archive for this mailing list, you'll see me denouncing coalesced messages as meanless about 1 microsecond after the first result was published by Prof. Panda. Looking around the Internet, I don't see any numbers ever published by PathScale/QLogic using coalesced messages. At the end of the day, the only reason microbenchmarks are useful is when they help explain why one interconnect does better than another on real applications. No customer should ever choose which adapter to buy based on microbenchmarks. -- greg (formerly employed by QLogic) _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From patrick at myri.com Fri Feb 19 14:56:18 2010 From: patrick at myri.com (Patrick Geoffray) Date: Fri, 19 Feb 2010 17:56:18 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <20100218192604.GP15788@mail.nih.gov> References: <4B7D82F5.9050503@tamu.edu> <20100218192604.GP15788@mail.nih.gov> Message-ID: <4B7F1712.3040303@myri.com> On 2/18/2010 2:26 PM, Jesse Becker wrote: > On Thu, Feb 18, 2010 at 01:12:05PM -0500, Gerald Creager wrote: >> For what you're describing, I'd consider CoRAID's AoE technology and > I'll second this recommendation. The Coraid servers are fairly +1. The AoE spec is very simple, I wish it would have more traction outside CoRaID. On the opposite, iSCSI is a utter mess with all the bad technical choices. Patrick From rpnabar at gmail.com Fri Feb 19 15:05:01 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 19 Feb 2010 17:05:01 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <4B7F1712.3040303@myri.com> References: <4B7D82F5.9050503@tamu.edu> <20100218192604.GP15788@mail.nih.gov> <4B7F1712.3040303@myri.com> Message-ID: On Fri, Feb 19, 2010 at 4:56 PM, Patrick Geoffray wrote: > On 2/18/2010 2:26 PM, Jesse Becker wrote: >> >> On Thu, Feb 18, 2010 at 01:12:05PM -0500, Gerald Creager wrote: >>> >>> For what you're describing, I'd consider CoRAID's AoE technology and > >> I'll second this recommendation. The Coraid servers are fairly > > +1. The AoE spec is very simple, I wish it would have more traction outside > CoRaID. On the opposite, iSCSI is a utter mess with all the bad technical > choices. Thanks for the pointers! I had never heard of AoE before! -- Rahul From rpnabar at gmail.com Fri Feb 19 15:18:17 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 19 Feb 2010 17:18:17 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> Message-ID: On Fri, Feb 19, 2010 at 11:29 AM, Mark Hahn wrote: Thanks Mark! > right - 10 years ago, the cost overhead of the system was larger. > nowadays, integration and moore's law has made small systems very cheap. > this is good, since disks are incredibly cheap as well. (bad if you're > in the storage business, where it looks a little funny to justify thousands > of dollars of controller/etc infrastructure when the disks cost $100 or so. > disk arrays can still make sense, of course, but availability of useful > cheap commodity systems has changed the equation. I totally agree! Especially for my "tiny" storage capacity (~5 Terabytes) the cost of the non-disk accessories (enclosure+controllers+cables) is turning out to be several fold that of the disks themselves! That was pretty surprising to me! > point: you plug it into your existing ethernet fabric. ?for the low-overhead > application you describe, it's a reasonable fit, except that even low-end > iSCSI/NAS boxes tend to ramp up in price. >?that is, comparable to what you'd > pay for a cheap uATX system (which would be of about the same speed, > power, space and performance, not surprisingly.) I'm curious, what's the selling point for iSCSI then? The prices are quite ramped up and the performance not stellar. Do any of you in the HPC world buy i-SCSI at all? > procs+ram+nic can easily total less than $100; enclosures can be very cheap Really?! Hmm....What kind of "procs+ram+nic" combo can one get for less than $100. That is pretty surprising for me! That again makes me wonder about the performance of these NAS boxes. Standard server + DAS seems safer than a low-end NAS. -- Rahul From mathog at caltech.edu Fri Feb 19 15:43:39 2010 From: mathog at caltech.edu (David Mathog) Date: Fri, 19 Feb 2010 15:43:39 -0800 Subject: [Beowulf] case (de)construction question Message-ID: Many rack cases have threaded standoff's directly attached to the case metal. On the outside of the case one sees a hexagonal nut, and on the inside the cylindrical standoff - with no sign of the hexagonal nut. We even have one type of case with a removable motherboard tray, which is quite thin, and even here this type of standoff is employed. The question is, how are these things put together, and more specifically, how are they to be taken apart? Some of the standoffs are in the way of a larger power supply that needs to go into one of these cases. Is there a more elegant way of removing these than by grinding them off or drilling them out? I have already tried unscrewing one, on the theory the standoff might be threaded into the hex nut, but it wouldn't budge. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From hahn at mcmaster.ca Fri Feb 19 15:56:11 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 19 Feb 2010 18:56:11 -0500 (EST) Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> Message-ID: > I'm curious, what's the selling point for iSCSI then? The prices are > quite ramped up and the performance not stellar. Do any of you in the > HPC world buy i-SCSI at all? ease, I suppose. ethernet is omnipresent, so anything which uses ethernet has a big advantage. the sticking point is really that the ethernet phenomenon has to some extent stalled at 1 Gb - that's where the mass market is, and it's not that clear whether or how much drive there will be for adoption above 1Gb. in homes, 1Gb is mostly overkill. non-rich office settings (such as a lot of academia) still have 100bT to the desktop. 1Gb is plenty good enough for a lot of HPC - obviously almost anything serial, small parallel or even large-loose parallel. even though a single modern disk sustains greater BW than 1Gb, I find that most users do not have an expectation of really high storage bandwidth. sure, people who do good checkpointing from large parallel apps cry for it. some specialized fields want bandwidth even for serial jobs. but a typical job in my organization (disparate academic HPC) does IO infrequently and not very much. when our clusters have IO problems, it's more often metadata for shared filesystems, rather than bandwidth. however, I'm not actually claiming iSCSI is prevalent. the protocol is relatively heavy-weight, and it's really only providing SAN access, not shared, file-level access, which is ultimately what most want... >> procs+ram+nic can easily total less than $100; enclosures can be very cheap > > Really?! Hmm....What kind of "procs+ram+nic" combo can one get for > less than $100. That is pretty surprising for me! That again makes me atom motherboards (which include the CPU) start at $58 on newegg, though the lowest-end items aren't all that appealing. the first Gb models start at $80, so add a $20 dimm, and you're done. (admittedly, small boards like those, being ITX, typically have 2x sata ports, not 6x. but adding a cheap sata controller or just starting with a cheap uATX board doesn't really blow the budget...) > wonder about the performance of these NAS boxes. Standard server + DAS > seems safer than a low-end NAS. the cheapest NAS boxes are build with arm/mips SOCs much like a wifi router. (third party/free linux firmware for routers often also includes such NAS boxes.) recently a generation of Atom-based NAS has come onto the market, which perform better, somewhat hotter, etc. these appear to be the first cheap NAS to come close to saturating Gb - the embedded cpu models tended to poke along at 20-40 MB/s. regards, mark hahn. From chekh at pcbi.upenn.edu Fri Feb 19 16:07:30 2010 From: chekh at pcbi.upenn.edu (Alex Chekholko) Date: Fri, 19 Feb 2010 19:07:30 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <4B7D82F5.9050503@tamu.edu> <20100218192604.GP15788@mail.nih.gov> <4B7F1712.3040303@myri.com> Message-ID: <20100219190730.8489fb15.chekh@pcbi.upenn.edu> On Fri, 19 Feb 2010 17:05:01 -0600 Rahul Nabar wrote: > On Fri, Feb 19, 2010 at 4:56 PM, Patrick Geoffray wrote: > > On 2/18/2010 2:26 PM, Jesse Becker wrote: > >> > >> On Thu, Feb 18, 2010 at 01:12:05PM -0500, Gerald Creager wrote: > >>> > >>> For what you're describing, I'd consider CoRAID's AoE technology and > > > >> I'll second this recommendation. The Coraid servers are fairly > > > > +1. The AoE spec is very simple, I wish it would have more traction outside > > CoRaID. On the opposite, iSCSI is a utter mess with all the bad technical > > choices. > > Thanks for the pointers! I had never heard of AoE before! This is all well and good until you compare the prices of the respective solutions. E.g. what's the cheapest 5TB (usable) AoE box you can buy? Regards, -- Alex Chekholko chekh at pcbi.upenn.edu From james.p.lux at jpl.nasa.gov Fri Feb 19 16:37:43 2010 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Fri, 19 Feb 2010 16:37:43 -0800 Subject: [Beowulf] case (de)construction question In-Reply-To: References: Message-ID: > -----Original Message----- > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of David Mathog > Sent: Friday, February 19, 2010 3:44 PM > To: beowulf at beowulf.org > Subject: [Beowulf] case (de)construction question > > Many rack cases have threaded standoff's directly attached to the case > metal. On the outside of the case one sees a hexagonal nut, and on the > inside the cylindrical standoff - with no sign of the hexagonal nut. We > even have one type of case with a removable motherboard tray, which is > quite thin, and even here this type of standoff is employed. > > The question is, how are these things put together, Those are "self clinching fasteners" of one sort or another. PEMs from Penn Engineering (http://www.pemnet.com/) are the ones I've used " PEM(r) self-clinching concealed-head studs and standoffs install permanently in steel or aluminum sheets as thin as .062" / 1.6mm to provide strong and reusable threads for mating hardware in a wide range of thin-metal assembly applications. Their concealed-head feature contributes particular design benefits by allowing the side of the sheet opposite installation to remain smooth and untouched." http://www.pemnet.com/fastening_products/pdf/chdata.pdf reveals all. Those ones actually install in a blind hole, but there are other ones that insert in a through hole. http://www.pemnet.com/fastening_products/pdf/fhdata.pdf and more > specifically, how are they to be taken apart? They aren't designed to be dismantled. You can sometimes use a suitable press with appropriate dies to press the nut out, but the whole pressing the stud in process deforms the metal. Some of the standoffs are > in the way of a larger power supply that needs to go into one of these > cases. Is there a more elegant way of removing these than by grinding > them off or drilling them out? Saw and grind. Dremel tools are your friend. I have already tried unscrewing one, on > the theory the standoff might be threaded into the hex nut, but it > wouldn't budge. No, it's actually one continuous piece of metal. The hexagonal form factor is to allow the part to be clamped in the pressing process without spinning. > > Thanks, > > David Mathog > mathog at caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From landman at scalableinformatics.com Fri Feb 19 18:52:47 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 19 Feb 2010 21:52:47 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <20100219190730.8489fb15.chekh@pcbi.upenn.edu> References: <4B7D82F5.9050503@tamu.edu> <20100218192604.GP15788@mail.nih.gov> <4B7F1712.3040303@myri.com> <20100219190730.8489fb15.chekh@pcbi.upenn.edu> Message-ID: <4B7F4E7F.6000102@scalableinformatics.com> Alex Chekholko wrote: >> Thanks for the pointers! I had never heard of AoE before! > > This is all well and good until you compare the prices of the respective solutions. > > E.g. what's the cheapest 5TB (usable) AoE box you can buy? I believe somewhat more than a relatively fast iSCSI/SRP/NFS/CIFS box with 6.75TB usable (but we are biased). AoE hasn't really found a niche for many reasons. Not the least of which is the paucity of quality software target implementations, a single vendor hardware supplier, and the significant resource consuming initiator. We have lots of experience setting up AoE systems for users and customers, lots of experience fixing systems, and altering designs so that the AoE initiators don't bring down head nodes, login nodes, etc that they are attached to. It is best, when building an AoE system design, to isolate the unit mounting the AoE targets from important services that have to be up. Whether or not the AoE protocol is superior to iSCSI is moot. The AoE initiator for windows isn't terribly stable. Last I checked there was something for Solaris though I don't know the state. iSCSI is everywhere, it is available on pretty much all platforms ... it has achieved ubiquity. Quality targets and initiators exist as software stacks you can use. They interoperate reasonably well, though the Windows 2.08 stack doesn't seem to follow the standard terribly well in terms of reconnecting to a single target. These minor nits aside, it works, reasonably well, and without significant pain. This aside, both AoE and iSCSI provide block device services. Both systems can present a block device with a RAID backing store. Patrick and others will talk about the beauty of the standards, but this is unfortunately irrelevant in the market. The market isn't a meritocracy. Actually, with the advent of USB3 and related devices, I'd expect AoE and lower end raid to be effectively completely subsumed by this. USB3 has ample bandwidth to connect a low end RAID unit, more than GbE. The comments on the Atom based micro itx MBs are quite relevant there. Especially if you can get one with multiple SATA and USB3 ports (especially if they can be USB3 targets). If you want the lowest end RAID right now, get an eSATA device and enclosure (not cheap but doable). It will work, albeit not as fast as you might like. Use mdadm, and be done with it. Remember, if you want to use mdadm, you do need block devices to work with. Whether these block devices are gigabit, USB, or eSATA attached is irrelevant. Better/more expensive systems will put the RAID on the unit and export you either a block device, a file system, or both. Even better systems will give you thin provisioning, snapshotting, and all these other very nice features. You make your choices and pay your money. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Fri Feb 19 18:55:00 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 19 Feb 2010 21:55:00 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <4B7F1712.3040303@myri.com> References: <4B7D82F5.9050503@tamu.edu> <20100218192604.GP15788@mail.nih.gov> <4B7F1712.3040303@myri.com> Message-ID: <4B7F4F04.9010503@scalableinformatics.com> Patrick Geoffray wrote: >> I'll second this recommendation. The Coraid servers are fairly > > +1. The AoE spec is very simple, I wish it would have more traction > outside CoRaID. On the opposite, iSCSI is a utter mess with all the bad -1 on the AoE initiator implementation. Seriously. We have customers using it, and its not pretty. +1 for iSCSI for its ubiquity. Doesn't have to be great, just has to work. It does. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From tjrc at sanger.ac.uk Sat Feb 20 00:09:48 2010 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Sat, 20 Feb 2010 08:09:48 +0000 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> Message-ID: On 19 Feb 2010, at 11:56 pm, Mark Hahn wrote: > however, I'm not actually claiming iSCSI is prevalent. the protocol > is relatively heavy-weight, and it's really only providing SAN access, > not shared, file-level access, which is ultimately what most want... iSCSI seems fairly common in the virtualisation world, especially for lower-performance failover VMware clusters and the like. Fairly commonly see a fibrechannel-connected active cluster, and iSCSI backup for disaster recovery, with some sort of block-level replication going on between the two. We've been looking at iSCSI for our VMware setup in the future. As people have been saying, performance isn't stellar, an to get it decent you still end up paying a lot of money on networking kit, because ideally in that scenario you still want the storage traffic on a separate network from everything else (certainly from the Vmotion and fault tolerance traffic, and probably from the application traffic as well) so you frequently see designs with ESX servers with 6-8 NICs in them. 2 for application traffic, 2 for VMware's internal traffic, at least 2 for storage traffic. But for HPC? Nah - can't really see its utility. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From eugen at leitl.org Sat Feb 20 02:28:09 2010 From: eugen at leitl.org (Eugen Leitl) Date: Sat, 20 Feb 2010 11:28:09 +0100 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <4B7F1712.3040303@myri.com> References: <4B7D82F5.9050503@tamu.edu> <20100218192604.GP15788@mail.nih.gov> <4B7F1712.3040303@myri.com> Message-ID: <20100220102809.GR17686@leitl.org> On Fri, Feb 19, 2010 at 05:56:18PM -0500, Patrick Geoffray wrote: > >I'll second this recommendation. The Coraid servers are fairly > > +1. The AoE spec is very simple, I wish it would have more traction > outside CoRaID. On the opposite, iSCSI is a utter mess with all the bad > technical choices. Is it very painful to get a zfs as an AoE target talk to VMWare? For some reason only iSCSI is always being mentioned here. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From chenyon1 at iit.edu Fri Feb 19 07:45:51 2010 From: chenyon1 at iit.edu (Yong Chen) Date: Fri, 19 Feb 2010 09:45:51 -0600 Subject: [Beowulf] [hpc-announce] Submission due: 3/3/2010 CFP: Intl. Workshop on Parallel Programming Models and Systems Software for HEC (P2S2) Message-ID: [Apologies if you got multiple copies of this email. If you'd like to opt out of these announcements, information on how to unsubscribe is available at the bottom of this email.] CALL FOR PAPERS =============== Third International Workshop on Parallel Programming Models and Systems Software for High-end Computing (P2S2) Sept. 13th, 2010 To be held in conjunction with ICPP-2010: The 39th International Conference on Parallel Processing, Sept. 13-16, 2010, San Diego, CA, USA Website: http://www.mcs.anl.gov/events/workshops/p2s2 SCOPE ----- The goal of this workshop is to bring together researchers and practitioners in parallel programming models and systems software for high-end computing systems. Please join us in a discussion of new ideas, experiences, and the latest trends in these areas at the workshop. TOPICS OF INTEREST ------------------ The focus areas for this workshop include, but are not limited to: * Systems software for high-end scientific and enterprise computing architectures o Communication sub-subsystems for high-end computing o High-performance file and storage systems o Fault-tolerance techniques and implementations o Efficient and high-performance virtualization and other management mechanisms for high-end computing * Programming models and their high-performance implementations o MPI, Sockets, OpenMP, Global Arrays, X10, UPC, Chapel, Fortress and others o Hybrid Programming Models * Tools for Management, Maintenance, Coordination and Synchronization o Software for Enterprise Data-centers using Modern Architectures o Job scheduling libraries o Management libraries for large-scale system o Toolkits for process and task coordination on modern platforms * Performance evaluation, analysis and modeling of emerging computing platforms PROCEEDINGS ----------- Proceedings of this workshop will be published in CD format and will be available at the conference (together with the ICPP conference proceedings) . SUBMISSION INSTRUCTIONS ----------------------- Submissions should be in PDF format in U.S. Letter size paper. They should not exceed 8 pages (all inclusive). Submissions will be judged based on relevance, significance, originality, correctness and clarity. Please visit workshop website at: http://www.mcs.anl.gov/events/workshops/p2s2/ for the submission link. JOURNAL SPECIAL ISSUE --------------------- The best papers of P2S2'10 will be included in a special issue of the International Journal of High Performance Computing Applications (IJHPCA) on Programming Models, Software and Tools for High-End Computing. IMPORTANT DATES --------------- Paper Submission: March 3rd, 2010 Author Notification: May 3rd, 2010 Camera Ready: June 14th, 2010 PROGRAM CHAIRS -------------- * Pavan Balaji, Argonne National Laboratory * Abhinav Vishnu, Pacific Northwest National Laboratory PUBLICITY CHAIR --------------- * Yong Chen, Illinois Institute of Technology STEERING COMMITTEE ------------------ * William D. Gropp, University of Illinois Urbana-Champaign * Dhabaleswar K. Panda, Ohio State University * Vijay Saraswat, IBM Research PROGRAM COMMITTEE ----------------- * Ahmad Afsahi, Queen's University * George Almasi, IBM Research * Taisuke Boku, Tsukuba University * Ron Brightwell, Sandia National Laboratory * Franck Cappello, INRIA, France * Yong Chen, Illinois Institute of Technology * Ada Gavrilovska, Georgia Tech * Torsten Hoefler, Indiana University * Zhiyi Huang, University of Otago, New Zealand * Hyun-Wook Jin, Konkuk University, Korea * Zhiling Lan, Illinois Institute of Technology * Doug Lea, State University of New York at Oswego * Jiuxing Liu, IBM Research * Guillaume Mercier, INRIA, France * Scott Pakin, Los Alamos National Laboratory * Fabrizio Petrini, IBM Research * Bronis de Supinksi, Lawrence Livermore National Laboratory * Sayantan Sur, IBM Research * Rajeev Thakur, Argonne National Laboratory * Vinod Tipparaju, Oak Ridge National Laboratory * Jesper Traff, NEC, Europe * Weikuan Yu, Auburn University If you have any questions, please contact us at p2s2-chairs at mcs.anl.gov ======================================================================== You can unsubscribe from the hpc-announce mailing list here: https://lists.mcs.anl.gov/mailman/listinfo/hpc-announce ======================================================================== From eugen at leitl.org Sat Feb 20 02:42:40 2010 From: eugen at leitl.org (Eugen Leitl) Date: Sat, 20 Feb 2010 11:42:40 +0100 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> Message-ID: <20100220104240.GS17686@leitl.org> On Fri, Feb 19, 2010 at 05:18:17PM -0600, Rahul Nabar wrote: > I totally agree! Especially for my "tiny" storage capacity (~5 > Terabytes) the cost of the non-disk accessories > (enclosure+controllers+cables) is turning out to be several fold that > of the disks themselves! That was pretty surprising to me! I presume you're buying things off-the-shelf, but even then a borderline useful empty 4-drive Qnap is around 700 EUR. A much more poweful DIY system with 8 SATA hotplug shouldn't be more than 1 kEUR. In comparison, filling it up with cheap 2 TByte disks is at least twice that. In practice you can just get a quote for a BTO supermicro server or rackmount for only a little more. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From landman at scalableinformatics.com Sat Feb 20 03:35:06 2010 From: landman at scalableinformatics.com (Joe Landman) Date: Sat, 20 Feb 2010 06:35:06 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: References: <20100218131401.18a528a0.chekh@pcbi.upenn.edu> <20100218145537.6ee9145b.chekh@pcbi.upenn.edu> Message-ID: <4B7FC8EA.4080409@scalableinformatics.com> Tim Cutts wrote: > on between the two. We've been looking at iSCSI for our VMware setup in > the future. As people have been saying, performance isn't stellar, an > to get it decent you still end up paying a lot of money on networking > kit, because ideally in that scenario you still want the storage traffic Hmmm... Our sub $10k USD units sport 10GbE/IB interfaces and do more than 500 MB/s over iSCSI, sustained. We've been demonstrating this for ~1.25 years. No more expensive than other technologies. Much less expensive than some others. Yeah, I know, the vast majority of iSCSI users use it over GbE. Most of the units on the market from others have trouble keeping that single GbE pipe filled, and they are expensive kit. > But for HPC? Nah - can't really see its utility. It is in use and its use is growing. Look at iSCSI as glue logic between systems. You can use SAS to connect arrays, then you need an expander chip in the backplane for the array (talk about a performance killer ...). You can use FC to connect arrays, with the same types of issues (no expander chip, but a massive oversubscription of bandwidth, and really expensive networking kit). Or you can use point to point iSCSI, or iSCSI over a switch. We've done and seen both the point to point and networked version. Works great for its use cases. For massively oversubscribed SAS/FC networks, the iSCSI systems come in less expensive. For point to point connections (using iSCSI as a wire protocol to directly attach RAIDed arrays or JBOD arrays, or mixtures, to a server) it works great, and you can do multipathing without additional switches ... as well as doing fail-over connections, including background replication over a private iSCSI to iSCSI network to have HA storage. It has lots of utility. Its relatively cheap, and if you get from the right vendors, its as fast as some others "fast" and expensive solutions. Of course, we are biased given what we sell. > > Tim > > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From gerry.creager at tamu.edu Sat Feb 20 05:32:04 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sat, 20 Feb 2010 07:32:04 -0600 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <20100220102809.GR17686@leitl.org> References: <4B7D82F5.9050503@tamu.edu> <20100218192604.GP15788@mail.nih.gov> <4B7F1712.3040303@myri.com> <20100220102809.GR17686@leitl.org> Message-ID: <4B7FE454.5020704@tamu.edu> Eugen Leitl wrote: > On Fri, Feb 19, 2010 at 05:56:18PM -0500, Patrick Geoffray wrote: > >>> I'll second this recommendation. The Coraid servers are fairly >> +1. The AoE spec is very simple, I wish it would have more traction >> outside CoRaID. On the opposite, iSCSI is a utter mess with all the bad >> technical choices. > > Is it very painful to get a zfs as an AoE target talk to VMWare? > For some reason only iSCSI is always being mentioned here. We've never tried zfs on CoRAID but I've done XFS with no problems. We also have some iSCSI NAS, and I have to disagree with the difficulty of use. We find iSCSI as easy to use as AoE. gerry From henning.fehrmann at aei.mpg.de Mon Feb 22 06:04:14 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Mon, 22 Feb 2010 15:04:14 +0100 Subject: [Beowulf] RAM ECC errors Message-ID: <20100222140414.GA19032@gretchen.aei.mpg.de> Hello, we started monitoring the rate of correctable errors appearing in the RAM. We also observed few uncorrectable errors. The corresponding kernel module 'edac_core' can cause a Kernel Panic when such an event occurs, which makes sense to avoid corrupted results. Is there a way to get some useful information before the kernel panics? In particular are we looking for the process list to find out which user was running what before the UE errors occurred. Thank you. Cheers, Henning From mathog at caltech.edu Mon Feb 22 12:30:38 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 22 Feb 2010 12:30:38 -0800 Subject: [Beowulf] Re: RAM ECC errors (Henning Fehrmann) Message-ID: Henning Fehrmann wrote: > we started monitoring the rate of correctable errors appearing in the RAM. > We also observed few uncorrectable errors. The corresponding kernel > module 'edac_core' can cause a Kernel Panic when such an event occurs, > which makes sense to avoid corrupted results. Are you saying that now that you are monitoring you are seeing kernel panics which did not appear before? > > Is there a way to get some useful information before the kernel panics? You can get some information through netconsole, but you know that already. > In particular are we looking for the process list to find out which > user was running what before the UE errors occurred. Well, you could log process start/stops and flush them to disk or syslog them, so that at least when the system crashes it would be possible to derive a list of everything that was still running. Doubt this will help much though, since the most likely culprit is a bad stick of memory, in which case the netconsole or IPMI or MCE messages may be enough to figure out which stick is the problem. That is, whichever process triggered it is probably an innocent bystander. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From patrick at myri.com Mon Feb 22 12:58:15 2010 From: patrick at myri.com (Patrick Geoffray) Date: Mon, 22 Feb 2010 15:58:15 -0500 Subject: [Beowulf] Any recommendations for a good JBOD? In-Reply-To: <4B7F4E7F.6000102@scalableinformatics.com> References: <4B7D82F5.9050503@tamu.edu> <20100218192604.GP15788@mail.nih.gov> <4B7F1712.3040303@myri.com> <20100219190730.8489fb15.chekh@pcbi.upenn.edu> <4B7F4E7F.6000102@scalableinformatics.com> Message-ID: <4B82EFE7.1070101@myri.com> Joe, On 2/19/2010 9:52 PM, Joe Landman wrote: > This aside, both AoE and iSCSI provide block device services. Both > systems can present a block device with a RAID backing store. Patrick > and others will talk about the beauty of the standards, but this is > unfortunately irrelevant in the market. The market isn't a meritocracy. Indeed, this is unfortunate. To be clear, I did not comment on the implementations, only the protocols. And on that point, iSCSI is definitively one of the worst spec I have ever read. The fact that iSCSI has any traction is a good example that marketers are in command, not engineers. I would not say that the quality of the protocol is irrelevant, it has a direct impact on the robustness and performance of the implementations. For example, any kind of offload requires the iSCSI header to be at the beginning of a packet, so you need framing. However, iSCSI is build on top of TCP, a streaming protocol. So, if you go through a router and change MTU, you are toast. The worse part is that the iSCSI standard acknowledges that but the solution they propose is ridiculous (PDU markers). FCoE shares some of the simplicity of AoE (right on top of Ethernet), but then the marketers came and messed it up with all kind of extensions to justify selling you new (and expensive) gear. > Actually, with the advent of USB3 and related devices, I'd expect AoE > and lower end raid to be effectively completely subsumed by this. USB3 > has ample bandwidth to connect a low end RAID unit, more than GbE. The You can't switch USB the same way you switch Ethernet. Patrick From mathog at caltech.edu Mon Feb 22 16:25:15 2010 From: mathog at caltech.edu (David Mathog) Date: Mon, 22 Feb 2010 16:25:15 -0800 Subject: [Beowulf] Re: case (de)construction question Message-ID: The support staff at PennEngineering said that it would only take a couple of hundred pounds of force to push out a standoff, and to try tapping it with a hammer. There is no hammer in the machine room, but there is some unistrut, so... 1. On top of a rubber wheeled cart (to cut down on the shock to everything else) put the case flat on 3 parallel pieces of unistrut, with the hexagonal back of the standoff centered on one of the slot holes in one piece of unistrut. 2. Dropped a closed ended 3 ft. piece of unistrut on the standoff side a few times from a height of about 6 inches. 3. When the standoff was punched down flush with the inside pulled it out from the back with a pair of pliers. (I had put a screw in the standoff first, the blows alone would have shoved the standoff all the way through.) The case did warp out slightly (2mm?) on the bottom around where the fastener had been. To fix that, flipped it over, stood the standoff up over the hole (hex nut side down), taped it in this vertical position with some masking tape, and once again employed the 3ft unistrut as a hammer. A couple of quick taps and it was flat enough so that it would slide into the rack. Drilling the standoff out wouldn't have deformed the case, but at least this way there were no little metal shavings to worry about. Using a couple of fender washers instead of the unistrut might have reduced the size of the deformation. Or maybe not, as the dimple was round and about 2 times wider than the unistrut slot. The standoff seems little the worse for wear. The nut part was chewed up slightly by the pliers, but the cylinder part and the groove appear to be undamaged. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From carsten.aulbert at aei.mpg.de Mon Feb 22 22:33:38 2010 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Tue, 23 Feb 2010 07:33:38 +0100 Subject: [Beowulf] Re: RAM ECC errors (Henning Fehrmann) In-Reply-To: References: Message-ID: <201002230733.39676.carsten.aulbert@aei.mpg.de> Hi David replying also on Henning's behalf On Monday 22 February 2010 21:30:38 David Mathog wrote: > > Are you saying that now that you are monitoring you are seeing kernel > panics which did not appear before? > No, but there seem to be a switch in the kernel module that allows to trigger a kernel panic upon discovering uncorrectable errors. > You can get some information through netconsole, but you know that already. > Yup already running, question is if a kernel panic would also be fully visible via netconsole - we are glad that we rarely have those ;) > Well, you could log process start/stops and flush them to disk or syslog > them, so that at least when the system crashes it would be possible to > derive a list of everything that was still running. Doubt this will > help much though, since the most likely culprit is a bad stick of > memory, in which case the netconsole or IPMI or MCE messages may be > enough to figure out which stick is the problem. That is, whichever > process triggered it is probably an innocent bystander. Yes, but the memory of any process might get corrupted, thus this is more to learn which user is currently running jobs. Which in turn enables us to notify these users that this particular machine running these jobs had a problem and the user might need to re-run her jobs to prevent "false" data entering her job. Cheers Carsten -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 1871 bytes Desc: not available URL: From pauljohn32 at gmail.com Sat Feb 20 10:49:45 2010 From: pauljohn32 at gmail.com (Paul Johnson) Date: Sat, 20 Feb 2010 12:49:45 -0600 Subject: [Beowulf] which mpi library should I focus on? Message-ID: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> i've not written MPI programs before. I've written plenty of C and Java, however, and I think I can learn. I'm trying to decide whether to concentrate on OpenMPI or MPICH2 as I get started. In the Internet, I find plenty of people who are fiercely devoted to MPICH2, and I also find plenty of people who say OpenMPI is now the "preferred application". I am afraid I will start a flame war between these two sides, but I do need some advice. My immediate goal is to write programs that can parallelize statistical analysis (mostly medium sized calculations that have to be run 1000s of times from various starting random number seeds). Many of the projects that we do will be in R, which has several packages that can be built for MPI framework (such as SNOW or such), but the installer must select which MPI library to build those packages against. What are the reasons to prefer one or the other? pj -- Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas From hahn at mcmaster.ca Tue Feb 23 06:09:15 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 23 Feb 2010 09:09:15 -0500 (EST) Subject: [Beowulf] which mpi library should I focus on? In-Reply-To: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> References: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> Message-ID: > i've not written MPI programs before. I've written plenty of C and > Java, however, and I think I can learn. I'm trying to decide whether > to concentrate on OpenMPI or MPICH2 as I get started. In the > Internet, I find plenty of people who are fiercely devoted to MPICH2, > and I also find plenty of people who say OpenMPI is now the "preferred > application". why do you think it would make any difference? it's also normally pretty trivial to switch. > What are the reasons to prefer one or the other? none - it's a matter of taste, especially since your application will not be sensitive to minor differences. From atchley at myri.com Tue Feb 23 06:20:34 2010 From: atchley at myri.com (Scott Atchley) Date: Tue, 23 Feb 2010 09:20:34 -0500 Subject: [Beowulf] which mpi library should I focus on? In-Reply-To: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> References: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> Message-ID: On Feb 20, 2010, at 1:49 PM, Paul Johnson wrote: > What are the reasons to prefer one or the other? Why choose? You can install both and test with your application to see if there is a performance difference (be sure to keep your runtime environment paths correct - don't mix libraries and MPI binaries). Your MPI code should adhere to the standard and both should run it correctly. Scott From brockp at umich.edu Tue Feb 23 06:25:45 2010 From: brockp at umich.edu (Brock Palen) Date: Tue, 23 Feb 2010 09:25:45 -0500 Subject: [Beowulf] which mpi library should I focus on? In-Reply-To: References: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> Message-ID: > why do you think it would make any difference? it's also normally > pretty trivial to switch. Very true, we use modules and can swap easily, and rebuild > >> What are the reasons to prefer one or the other? > > none - it's a matter of taste, especially since your application will > not be sensitive to minor differences. (shameless plug) if you want, listen to our podcast on OpenMPI http://www.rce-cast.com/index.php/Podcast/rce01-openmpi.html The MPICH2 show is recorded (edited it last night, almost done!), and will be released this Saturday Midnight Eastern. If you want to hear the rough cut, to compare to OpenMPI, email me and I will send you the unfinished mp3. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From douglas.guptill at dal.ca Tue Feb 23 07:46:39 2010 From: douglas.guptill at dal.ca (Douglas Guptill) Date: Tue, 23 Feb 2010 11:46:39 -0400 Subject: [Beowulf] which mpi library should I focus on? In-Reply-To: References: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> Message-ID: <20100223154639.GB695@sopalepc> On Tue, Feb 23, 2010 at 09:25:45AM -0500, Brock Palen wrote: > (shameless plug) if you want, listen to our podcast on OpenMPI > http://www.rce-cast.com/index.php/Podcast/rce01-openmpi.html > > The MPICH2 show is recorded (edited it last night, almost done!), and > will be released this Saturday Midnight Eastern. > If you want to hear the rough cut, to compare to OpenMPI, email me and I > will send you the unfinished mp3. That sounds like a nice pair. OpenMPI vs MPICH2. Douglas. From mathog at caltech.edu Tue Feb 23 09:05:30 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 23 Feb 2010 09:05:30 -0800 Subject: [Beowulf] Re: RAM ECC errors Message-ID: Carsten Aulbert wrote > > Are you saying that now that you are monitoring you are seeing kernel > > panics which did not appear before? > > > > No, but there seem to be a switch in the kernel module that allows to trigger > a kernel panic upon discovering uncorrectable errors. By "switch" do you mean: A. There is an option that may be set when that module is loaded which will then cause it to panic on an uncorrectable error, where normally it would not. B. There has been a change in the module code between kernel versions that causes it to panic now on events where it formerly did not panic. > > You can get some information through netconsole, but you know that already. > > > > Yup already running, question is if a kernel panic would also be fully visible > via netconsole - we are glad that we rarely have those ;) I have seen one kernel panic since turning on netconsole, and it did log across the network and showed up in /var/log/messages as it was supposed to, with the same information presented as in the tests. Limited data, but it would seem the answer is "at least sometimes". > Yes, but the memory of any process might get corrupted, thus this is more to > learn which user is currently running jobs. Which in turn enables us to notify > these users that this particular machine running these jobs had a problem and > the user might need to re-run her jobs to prevent "false" data entering her > job. If the node blows up presumably the output of all the jobs currently running there will clearly indicate that there was a failure - so you should not have to notify those users since they will see the problem in their results. (Unless MPI, or PVM, or whatever is being used to spread jobs around, ignores fatal errors, which should never be the case.) For jobs which completed earlier on the same node, this would have been before an uncorrectable error took place, so the results should be OK. Or am I missing something? Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rpnabar at gmail.com Tue Feb 23 11:23:59 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 23 Feb 2010 13:23:59 -0600 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops Message-ID: Over the years I have scrupulously adhered to the conventional wisdom that "spanning tree" is turned off on HPC switches. So that protocols don't time out in the time STP needs to acquire its model of network topology. But that does assume that there are no loops in the switch connectivity that can cause broadcast storms etc. Thereby constraining the network design to a loopless configuration. Most cases this is fine but..... In the interest of latency minimum switch hops make sense and for that loops might sometimes provide the best solution. Just wondering what people think. Does STP enabled have other drawbacks aside from the initial lag on port activation? Or maybe all the latency advantage is always wiped out if the STP being on itself has some massive overhead. Do you always configure switches to not have loops? Or are loops ok and then I turn STP ON but just use PortFast to get away with the best of both worlds. -- Rahul From gerry.creager at tamu.edu Tue Feb 23 12:02:56 2010 From: gerry.creager at tamu.edu (Gerry Creager) Date: Tue, 23 Feb 2010 14:02:56 -0600 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops In-Reply-To: References: Message-ID: <4B843470.6010906@tamu.edu> On 2/23/10 1:23 PM, Rahul Nabar wrote: > Over the years I have scrupulously adhered to the conventional wisdom > that "spanning tree" is turned off on HPC switches. So that protocols > don't time out in the time STP needs to acquire its model of network > topology. But that does assume that there are no loops in the switch > connectivity that can cause broadcast storms etc. Thereby constraining > the network design to a loopless configuration. Most cases this is > fine but..... > > In the interest of latency minimum switch hops make sense and for that > loops might sometimes provide the best solution. Just wondering what > people think. Does STP enabled have other drawbacks aside from the > initial lag on port activation? Or maybe all the latency advantage is > always wiped out if the STP being on itself has some massive overhead. > > Do you always configure switches to not have loops? Or are loops ok > and then I turn STP ON but just use PortFast to get away with the best > of both worlds. It's my firm opinion that loops and STP are evil for HPC installations. Period. gerry From hahn at mcmaster.ca Tue Feb 23 12:05:39 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 23 Feb 2010 15:05:39 -0500 (EST) Subject: [Beowulf] Re: RAM ECC errors (Henning Fehrmann) In-Reply-To: <201002230733.39676.carsten.aulbert@aei.mpg.de> References: <201002230733.39676.carsten.aulbert@aei.mpg.de> Message-ID: > No, but there seem to be a switch in the kernel module that allows to trigger > a kernel panic upon discovering uncorrectable errors. I suspect you mean /sys/module/edac_mc/panic_on_ue (ue = uncorrected error). I consider this very much the norm: it would be very strange to run with ECC memory, and ECC enabled, and not actually halt on UE. UE represents a failure of the memory system, not just a transient event, but something which must be physically fixed. even for HA situations, I'd be pretty skeptical about using a memory channel which had any UE's on it. CE (corrected errors) OTOH, are very different. they're almost just a heartbeat of your ECC subsystem. yes, a CE indicates some event that needed correcting, but at a modest rate, CEs are acceptable. there are failure modes, though, where enough CEs eventually cause a UE: tracking CE rate is important for that reason. (other UE modes don't have this warning sign...) you can set CEs to log through kernel->syslog via edac tunables in /sys. > Yes, but the memory of any process might get corrupted, thus this is more to if UE is set to panic, nothing will get corrupted (that's really the point eh?) From rpnabar at gmail.com Tue Feb 23 12:10:10 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 23 Feb 2010 14:10:10 -0600 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops In-Reply-To: <4B843470.6010906@tamu.edu> References: <4B843470.6010906@tamu.edu> Message-ID: On Tue, Feb 23, 2010 at 2:02 PM, Gerry Creager wrote: > > It's my firm opinion that loops and STP are evil for HPC installations. > Period. Thanks Gerry! This seems like one of the rare HPC-topics where such a clear answer is present! :) "It depends" is more usual for me to hear. I bet you have excellent reasons for hating STP so much, but can I know why? Just curious. Bad implementations, performance hits? Pardon my ignorance, if I am asking something that's plain obviously "evil". -- Rahul From lindahl at pbm.com Tue Feb 23 13:10:15 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 23 Feb 2010 13:10:15 -0800 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops In-Reply-To: References: Message-ID: <20100223211015.GB8195@bx9.net> On Tue, Feb 23, 2010 at 01:23:59PM -0600, Rahul Nabar wrote: > In the interest of latency minimum switch hops make sense and for that > loops might sometimes provide the best solution. STP disables all loops. All you gain is a bit of redundancy, but the price is high. -- greg From rpnabar at gmail.com Tue Feb 23 13:15:28 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 23 Feb 2010 15:15:28 -0600 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops In-Reply-To: <20100223211015.GB8195@bx9.net> References: <20100223211015.GB8195@bx9.net> Message-ID: On Tue, Feb 23, 2010 at 3:10 PM, Greg Lindahl wrote: > On Tue, Feb 23, 2010 at 01:23:59PM -0600, Rahul Nabar wrote: > >> In the interest of latency minimum switch hops make sense and for that >> loops might sometimes provide the best solution. > > STP disables all loops. All you gain is a bit of redundancy, but the > price is high. I see! That makes sense. Too bad. I wish there was some non-STP way of dealing with loops then. -- Rahul From lindahl at pbm.com Tue Feb 23 13:29:04 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 23 Feb 2010 13:29:04 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> References: <20100219220538.GL2857@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> Message-ID: <20100223212904.GD8195@bx9.net> On Fri, Feb 19, 2010 at 02:36:34PM -0800, Gilad Shainer wrote: > Nice to hear from you Greg, hope all is well. I hope all is well with you, Gilad. From what I can tell, you're again visiting that alternate Universe that you sometimes visit -- is it nice there? > I don't forget anything, at least for now. OSU has different benchmarks > so you can measure message coalescing or real message rate. Funny to > read that Q hated coalescing when they created the first benchmark for > that The benchmark that we created is not a coalescing benchmark. Coalescing produces a meaningless answer from the message rate benchmark. Real apps don't get much of a benefit from message coalescing, but (if they send smallish messages) they get a big benefit from a good non-coalesced message rate. If you look back in the archives of this list, you can find me saying that. And some other people were involved in the discussion, too. Or, we can take your word for what happened, instead of looking at history. I know which method you prefer. -- greg From lindahl at pbm.com Tue Feb 23 13:33:49 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 23 Feb 2010 13:33:49 -0800 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops In-Reply-To: References: <20100223211015.GB8195@bx9.net> Message-ID: <20100223213349.GE8195@bx9.net> On Tue, Feb 23, 2010 at 03:15:28PM -0600, Rahul Nabar wrote: > On Tue, Feb 23, 2010 at 3:10 PM, Greg Lindahl wrote: > > On Tue, Feb 23, 2010 at 01:23:59PM -0600, Rahul Nabar wrote: > > > >> In the interest of latency minimum switch hops make sense and for that > >> loops might sometimes provide the best solution. > > > > STP disables all loops. All you gain is a bit of redundancy, but the > > price is high. > > I see! That makes sense. Too bad. I wish there was some non-STP way > of dealing with loops then. Managed switches often include a non-STP way of finding and suppressing broadcast storms -- I know HP and Cisco have that. I don't know if it's any better than STP, though. In the InfiniBand world loops are encouraged & provide a nice performance benefit -- the routes are worked out globally by the Subnet Manager. Also, there is ethernet switch silicon that has an alternate routing mechanism that's as good as IB -- but I don't remember if it's standardized or compatible between different silicon vendors. -- greg From hahn at mcmaster.ca Tue Feb 23 13:57:23 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue, 23 Feb 2010 16:57:23 -0500 (EST) Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <20100223212904.GD8195@bx9.net> References: <20100219220538.GL2857@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> <20100223212904.GD8195@bx9.net> Message-ID: > Coalescing produces a meaningless answer from the message rate > benchmark. Real apps don't get much of a benefit from message > coalescing, but (if they send smallish messages) they get a big > benefit from a good non-coalesced message rate. in the interests of less personal/posturing/pissing, let me ask: where does the win from coalescing come from? I would have thought that coalescing is mainly a way to reduce interrupts, a technique that's familiar from ethernet interrupt mitigation, NAPI, even basic disk scheduling. to me it looks like the key factor would be "propagation of desire" - when the app sends a message and will do nothing until the reply, it probably doesn't make sense to coalesce that message. otoh it's interesting if user-level can express non-urgency as well. my guess is the other big thing is LogP-like parameters (gap -> piggybacking). assuming MPI is the application-level interface, are there interesting issues related to knowing where to deliver messages? I don't have a good understanding about where things stand WRT things like QP usage (still N*N? is N node count or process count?) or unexpected messages. now that I'm inventorying ignorance, I don't really understand why RDMA always seems to be presented as a big hardware issue. wouldn't it be pretty easy to define an eth or IP-level protocol to do remote puts, gets, even test-and-set or reduce primitives, where the interrupt handler could twiddle registered blobs of user memory on the target side? regards, mark hahn. From patrick at myri.com Tue Feb 23 14:10:29 2010 From: patrick at myri.com (Patrick Geoffray) Date: Tue, 23 Feb 2010 17:10:29 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> Message-ID: <4B845255.9050407@myri.com> Brian, On 2/19/2010 1:25 PM, Brian Dobbins wrote: > the IB cards. With a 4-socket node having between 32 and 48 cores, lots > of computing can get done fast, possibly stressing the network. > > I know Qlogic has made a big deal about the InfiniPath adapter's > extremely good message rate in the past... is this still an important > issue? How do the latest Mellanox adapters compare? I have been quite vocal in the past against the merit of high packet rate, but I have learned to appreciate it. There is a set of applications that can benefit from it, especially at scale. Actually, packet rate is much more important outside of HPC (where application throughput is what money buys). However, I would pay attention to a different problem with many-core machines. Each user-space process uses a dedicated set of NIC resources, and this can be a problem with 48 cores per node (it affects all vendors, even if they swear otherwise). You may want to consider multiple NICs, unless you know that only a subset of the cores are communicating through the network (hybrid MPI/Open-MP model for example) or that the multiplexing overhead is not a big deal for you. > On a similar note, does a dual-port card provide an increase in > on-card processing, or 'just' another link? (The increased bandwidth is > certainly nice, even in a flat switched network, I'm sure!) You need PCIe Gen2 x16 to saturate a 32 Gb/s QDR link. There is no such NIC on the market AFAIK (only Gen1 x16 or Gen2 x8). But even then, you won't have any PCIe bandwidth left to drive a second port on the same NIC. There may be other rationales for a second port, but bandwidth is not one of them. Patrick From Shainer at mellanox.com Tue Feb 23 14:24:09 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Tue, 23 Feb 2010 14:24:09 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <4B845255.9050407@myri.com> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F0266305A@mtiexch01.mti.com> > On a similar note, does a dual-port card provide an increase in > on-card processing, or 'just' another link? (The increased bandwidth is > certainly nice, even in a flat switched network, I'm sure!) Today one port IB (assuming QDR) can saturate the PCIe Gen2 interface that is supported. Using 2 ports can give you high availability or fail over. Some folks are using it to increase the message rate as well, but I have not seen numbers to confirm. Other folks are using dual one-port adapters, and this gives them 2x the BW. Gilad From lindahl at pbm.com Tue Feb 23 14:32:21 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 23 Feb 2010 14:32:21 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: References: <20100219220538.GL2857@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> <20100223212904.GD8195@bx9.net> Message-ID: <20100223223221.GA24604@bx9.net> On Tue, Feb 23, 2010 at 04:57:23PM -0500, Mark Hahn wrote: > in the interests of less personal/posturing/pissing, let me ask: > where does the win from coalescing come from? I would have thought > that coalescing is mainly a way to reduce interrupts, a technique > that's familiar from ethernet interrupt mitigation, NAPI, even basic disk > scheduling. The coalescing we're talking about here is more like TCP's Nagle algorithm: The sending side defers sending a packet so that it can send a single larger packet instead of several small ones. In HPC we mostly hate the Nagle algorithm, because it isn't omniscient: it tends to always delay our messages hoping to get a 2nd one to the same target, but we rarely send a 2nd message to the same target that could be combined. People don't write much MPI code that works like that; it's always better to do the combining yourself. > to me it looks like the key factor would be "propagation of desire" - > when the app sends a message and will do nothing until the reply, > it probably doesn't make sense to coalesce that message. Yes, that's one way to think about it. > assuming MPI is the application-level interface, are there interesting > issues related to knowing where to deliver messages? I don't have a > good understanding about where things stand WRT things like QP usage > (still N*N? is N node count or process count?) or unexpected messages. A traditional MPI implementation uses N QPs x N processes, so the global number of QPs is N^2. InfiniPath's pm library for MPI uses a much smaller endpoint than a QP. Using a ton of QPs does slow down things (hurts scaling), and that's why SRQ (shared receive queues) was added to IB. MVAPICH has several different ways it can handle messages, configured (last I looked) at compile time: checking memory for delivered messages for tiny clusters, ordinary QPs at medium size, SRQ at large cluster sizes. The reason it switches is scalability; SQRs scale better but are fairly expensive in the Mellanox silicon. Since latency/bandwidth benchmarks are generally run at only 2 nodes, well, you can fill in the rest of this paragraph. InfiniPath's pm library uses a lighter-weight thing that's somewhat like an SRQ -- at all cluster sizes. This is why it scales so nicely. It wasn't a novel invention -- the T3E MPI implementation used a similar gizmo. > now that I'm inventorying ignorance, I don't really understand why RDMA > always seems to be presented as a big hardware issue. wouldn't it be > pretty easy to define an eth or IP-level protocol to do remote puts, > gets, even test-and-set or reduce primitives, where the interrupt handler > could twiddle registered blobs of user memory on the target side? That approach is called Active Messages, and can be bolted on to pretty much every messaging implementation. Doesn't OpenMX provide that kind of interface? The NoSQL distributed computing thingie we built for Blekko's search engine uses active messages. -- greg From bdobbins at gmail.com Tue Feb 23 14:35:41 2010 From: bdobbins at gmail.com (Brian Dobbins) Date: Tue, 23 Feb 2010 17:35:41 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <4B845255.9050407@myri.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <4B845255.9050407@myri.com> Message-ID: <2b5e0c121002231435q4f6275cbs9ad010b3a50e8c86@mail.gmail.com> Hi Patrick, I have been quite vocal in the past against the merit of high packet rate, > but I have learned to appreciate it. There is a set of applications that can > benefit from it, especially at scale. Actually, packet rate is much more > important outside of HPC (where application throughput is what money buys). > The 'especially at scale' bit seems to me to be the critical issue - weighing the price/performance as the ratio of small-scale to large-scale runs changes, assuming that an adapter with better large-scale performance has a significant cost differential. If only we knew what that ratio would be ahead of time, this would be easier. :-) However, I would pay attention to a different problem with many-core > machines. Each user-space process uses a dedicated set of NIC resources, and > this can be a problem with 48 cores per node (it affects all vendors, even > if they swear otherwise). You may want to consider multiple NICs, unless you > know that only a subset of the cores are communicating through the network > (hybrid MPI/Open-MP model for example) or that the multiplexing overhead is > not a big deal for you. Well, clearly we hope to move more towards hybrid methods -all that's old is new again?- but, again, it's *currently* hard to quantify the variables involved. Time to transition, performance differences, user effort, etc. But getting back to a technical vein, is the multiplexing an issue due to atomic locks on mapped memory pages? Or just because each copy reserves its own independent buffers? What are the critical issues? > You need PCIe Gen2 x16 to saturate a 32 Gb/s QDR link. There is no such NIC > on the market AFAIK (only Gen1 x16 or Gen2 x8). But even then, you won't > have any PCIe bandwidth left to drive a second port on the same NIC. There > may be other rationales for a second port, but bandwidth is not one of them. > I thought PCIe Gen2 x 8 @ 500 Mhz gives 8GB/s? I know there are 250 and 500 Mhz variants in addition to the lane sizes, so while a 250 Mhz x8 link wouldn't provide enough bandwidth to a dual-port card, the 500 Mhz one should. But I'm woefully out of date on my hardware knowledge, it seems. Of course, EDR (eight data-rate) IB is on the roadmap for 2011, so if we're in no rush that could help, too. Cheers, - Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: From bdobbins at gmail.com Tue Feb 23 14:42:11 2010 From: bdobbins at gmail.com (Brian Dobbins) Date: Tue, 23 Feb 2010 17:42:11 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2b5e0c121002231435q4f6275cbs9ad010b3a50e8c86@mail.gmail.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <4B845255.9050407@myri.com> <2b5e0c121002231435q4f6275cbs9ad010b3a50e8c86@mail.gmail.com> Message-ID: <2b5e0c121002231442g75c44504p88a35314994d412b@mail.gmail.com> > I thought PCIe Gen2 x 8 @ 500 Mhz gives 8GB/s? I know there are 250 and > 500 Mhz variants in addition to the lane sizes, so while a 250 Mhz x8 link > wouldn't provide enough bandwidth to a dual-port card, the 500 Mhz one > should. But I'm woefully out of date on my hardware knowledge, it seems. > Of course, EDR (eight data-rate) IB is on the roadmap for 2011, so if we're > in no rush that could help, too. > Er,.. I thought about this a half-second after I sent it. First, that's 5.0 Ghz, not 500 Mhz, and second, that's probably aggregate bi-directional throughout at 8 GB/s, isn't it? So, I got it, 4 GB/s per direction, which means dual-port won't get us much in terms of unidirectional throughput. - Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Tue Feb 23 14:55:13 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 23 Feb 2010 14:55:13 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2b5e0c121002231435q4f6275cbs9ad010b3a50e8c86@mail.gmail.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <4B845255.9050407@myri.com> <2b5e0c121002231435q4f6275cbs9ad010b3a50e8c86@mail.gmail.com> Message-ID: <20100223225513.GC24604@bx9.net> On Tue, Feb 23, 2010 at 05:35:41PM -0500, Brian Dobbins wrote: > Well, clearly we hope to move more towards hybrid methods -all that's old > is new again?- If you want bad performance, sure. If you want good performance, you want a device which supports talking to a lot of cores, and then multiple devices per node, before you go hybrid. The first two don't require changing your code. The last does. The main reason to use hybrid is if there isn't enough parallelism in your code/dataset to use the cores independently. > But getting back to a technical vein, is the multiplexing an issue due to > atomic locks on mapped memory pages? Or just because each copy reserves its > own independent buffers? What are the critical issues? It's all implementation-dependent. A card might have an on-board memory limit, or a limited number of "engines" which process messages. Even if it has a option to store some data in main memory, often that results in a scalability hit. -- greg From Shainer at mellanox.com Tue Feb 23 14:51:48 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Tue, 23 Feb 2010 14:51:48 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com><4B845255.9050407@myri.com><2b5e0c121002231435q4f6275cbs9ad010b3a50e8c86@mail.gmail.com> <2b5e0c121002231442g75c44504p88a35314994d412b@mail.gmail.com> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F0266305F@mtiexch01.mti.com> PCIe Gen2 at 5GT, and 8/10 bit encoding and current chipsets efficiencies gives you around 3.3GB/s per direction, so one IB QDR port can handle that. For more BW out of the host, you can use more adapters (single port ones are the cost effective solution for that). Gilad From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Brian Dobbins Sent: Tuesday, February 23, 2010 2:42 PM To: Patrick Geoffray Cc: beowulf at beowulf.org Subject: Re: [Beowulf] Q: IB message rate & large core counts (per node)? I thought PCIe Gen2 x 8 @ 500 Mhz gives 8GB/s? I know there are 250 and 500 Mhz variants in addition to the lane sizes, so while a 250 Mhz x8 link wouldn't provide enough bandwidth to a dual-port card, the 500 Mhz one should. But I'm woefully out of date on my hardware knowledge, it seems. Of course, EDR (eight data-rate) IB is on the roadmap for 2011, so if we're in no rush that could help, too. Er,.. I thought about this a half-second after I sent it. First, that's 5.0 Ghz, not 500 Mhz, and second, that's probably aggregate bi-directional throughout at 8 GB/s, isn't it? So, I got it, 4 GB/s per direction, which means dual-port won't get us much in terms of unidirectional throughput. - Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shainer at mellanox.com Tue Feb 23 15:08:48 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Tue, 23 Feb 2010 15:08:48 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? References: <20100219220538.GL2857@bx9.net><9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> <20100223212904.GD8195@bx9.net> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02663069@mtiexch01.mti.com> > The benchmark that we created is not a coalescing benchmark. > > Coalescing produces a meaningless answer from the message rate > benchmark. Real apps don't get much of a benefit from message > coalescing, but (if they send smallish messages) they get a big > benefit from a good non-coalesced message rate. Agree, most benefit will come from non-coalesced message rate - and for the broader audience - from the ability to send one MPI message within a single network packet. Message coalescing is when you incorporate multiple MPI messages in a single network packet. > If you look back in the archives of this list, you can find me saying > that. And some other people were involved in the discussion, too. > > Or, we can take your word for what happened, instead of looking at > history. I know which method you prefer. You can look at history or not, your decision. My claim (and more important for the present) is that the so called "non-coalescing" results published on latest InfiniPath based products are actually with coalescing. The other universe you called it? From bdobbins at gmail.com Tue Feb 23 15:23:59 2010 From: bdobbins at gmail.com (Brian Dobbins) Date: Tue, 23 Feb 2010 18:23:59 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <20100223225513.GC24604@bx9.net> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <4B845255.9050407@myri.com> <2b5e0c121002231435q4f6275cbs9ad010b3a50e8c86@mail.gmail.com> <20100223225513.GC24604@bx9.net> Message-ID: <2b5e0c121002231523w7789d3cqc1fc03c4f0563a04@mail.gmail.com> Hi Greg, > Well, clearly we hope to move more towards hybrid methods -all that's > old > > is new again?- > > If you want bad performance, sure. If you want good performance, you > want a device which supports talking to a lot of cores, and then > multiple devices per node, before you go hybrid. The first two don't > require changing your code. The last does. > > The main reason to use hybrid is if there isn't enough parallelism in > your code/dataset to use the cores independently. > Actually, it's often *for* performance that we look towards hybrid methods, albeit in an indirect way - with RAM amounts per node increasing at the same or lesser rate than cores, and with each MPI task on *some* of our codes having a pretty hefty memory footprint, using fewer MPI processes and more threads per task lets us fully utilize nodes that would otherwise have cores sitting idle due to a lack of available memory. Sure, we could rewrite the code to tackle this, too, but in general it seems easier to add threading in than to rework a complicated parallel decomposition, shared buffers, etc. In a nutshell, even if a hybrid mode *costs* me 10-20% over a direct mode with an equal number of processors, if it allows me to use 50% more cores in a node, it works out well for us. But yes, ignoring RAM constraints, non-hybrid parallelism tends to be nicer at the moment. > > But getting back to a technical vein, is the multiplexing an issue due to > > atomic locks on mapped memory pages? Or just because each copy reserves > its > > own independent buffers? What are the critical issues? > > It's all implementation-dependent. A card might have an on-board > memory limit, or a limited number of "engines" which process > messages. Even if it has a option to store some data in main memory, > often that results in a scalability hit. > Thanks. I guess I need to read up on quite a bit more and set up some tests. Cheers, - Brian -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Tue Feb 23 16:40:01 2010 From: mathog at caltech.edu (David Mathog) Date: Tue, 23 Feb 2010 16:40:01 -0800 Subject: [Beowulf] Arima motherboards with SATA2 drives Message-ID: Have any of you seen a patched BIOS for the Arima HDAM* motherboards that resolves the issue of the Sil 3114 SATA controller locking up when it sees a SATA II disk? (Even a disk jumpered to Sata I speeds.) Silicon Image released a BIOS fix for this, but since all of these motherboards use a Phoenix BIOS, it is not like an AMI or Award BIOS, where there are published methods for swapping out the broken chunk of BIOS (5.0.49) for the one with the fix (5.4.0.3). Sure, one could work around this on a single disk system, at least, with an IDE to SATA2 converter, or a PCI(X) Sata(2) controller, but reflashing the BIOS would be easier. Or it would be if Flextronics, who bought this product line from Arima, would issue another BIOS update :-(. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From lindahl at pbm.com Tue Feb 23 20:42:30 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 23 Feb 2010 20:42:30 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <2b5e0c121002231523w7789d3cqc1fc03c4f0563a04@mail.gmail.com> References: <2b5e0c121002191025p521196bdm941cd3f018e8b305@mail.gmail.com> <4B845255.9050407@myri.com> <2b5e0c121002231435q4f6275cbs9ad010b3a50e8c86@mail.gmail.com> <20100223225513.GC24604@bx9.net> <2b5e0c121002231523w7789d3cqc1fc03c4f0563a04@mail.gmail.com> Message-ID: <20100224044230.GA19953@bx9.net> On Tue, Feb 23, 2010 at 06:23:59PM -0500, Brian Dobbins wrote: > Actually, it's often *for* performance that we look towards hybrid > methods, albeit in an indirect way - with RAM amounts per node increasing at > the same or lesser rate than cores, and with each MPI task on *some* of our > codes having a pretty hefty memory footprint, using fewer MPI processes and > more threads per task lets us fully utilize nodes that would otherwise have > cores sitting idle due to a lack of available memory. If you have data structures which are identical in every process, you might be able to mmap them shared, reducing your memory usage even for the hybrid case. I have only once before heard this kind of thing reported, and perhaps it was you (can't remember who). I don't think it's a big driver for hybrid programming. Anyone else? -- greg From ebiederm at xmission.com Tue Feb 23 21:03:33 2010 From: ebiederm at xmission.com (Eric W. Biederman) Date: Tue, 23 Feb 2010 21:03:33 -0800 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops In-Reply-To: <20100223213349.GE8195@bx9.net> (Greg Lindahl's message of "Tue\, 23 Feb 2010 13\:33\:49 -0800") References: <20100223211015.GB8195@bx9.net> <20100223213349.GE8195@bx9.net> Message-ID: Greg Lindahl writes: > On Tue, Feb 23, 2010 at 03:15:28PM -0600, Rahul Nabar wrote: >> On Tue, Feb 23, 2010 at 3:10 PM, Greg Lindahl wrote: >> > On Tue, Feb 23, 2010 at 01:23:59PM -0600, Rahul Nabar wrote: >> > >> >> In the interest of latency minimum switch hops make sense and for that >> >> loops might sometimes provide the best solution. >> > >> > STP disables all loops. All you gain is a bit of redundancy, but the >> > price is high. >> >> I see! That makes sense. Too bad. I wish there was some non-STP way >> of dealing with loops then. > > Managed switches often include a non-STP way of finding and > suppressing broadcast storms -- I know HP and Cisco have that. > I don't know if it's any better than STP, though. > > In the InfiniBand world loops are encouraged & provide a nice > performance benefit -- the routes are worked out globally by the > Subnet Manager. Also, there is ethernet switch silicon that has an > alternate routing mechanism that's as good as IB -- but I don't > remember if it's standardized or compatible between different silicon > vendors. For the most trivial of loops there is link aggregation. For more interesting loops you can run many ethernet switches as wire speed ip routers talking a routing protocol like ospf. Eric From rpnabar at gmail.com Tue Feb 23 21:32:02 2010 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 23 Feb 2010 23:32:02 -0600 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops In-Reply-To: References: <20100223211015.GB8195@bx9.net> <20100223213349.GE8195@bx9.net> Message-ID: On Tue, Feb 23, 2010 at 11:03 PM, Eric W. Biederman wrote: > > For the most trivial of loops there is link aggregation. Yup, that's true. Am already using link aggregation but never thought of that as a loop before. But makes sense. > For more interesting loops you can run many ethernet switches > as wire speed ip routers talking a routing protocol like ospf. Thanks! I will look that up. Maybe it serves my need. -- Rahul From forum.san at gmail.com Tue Feb 23 21:40:12 2010 From: forum.san at gmail.com (Sangamesh B) Date: Wed, 24 Feb 2010 11:10:12 +0530 Subject: [Beowulf] which mpi library should I focus on? In-Reply-To: <20100223154639.GB695@sopalepc> References: <13e802631002201049j59e06a9vd8e1e7a05e8a47e5@mail.gmail.com> <20100223154639.GB695@sopalepc> Message-ID: Hi, I hope you are developing MPI codes and wants to run in cluster environment. If so, I prefer you to use Open MPI. Because, Open MPI is well developed and its stable Has a very good FAQ section, where you will get clear your doubts easily. It has a in-built tight-integration method with cluster schedulers- SGE, PBS, LSF etc. It has an option to choose ETHERNET or INFINIBAND network connectivity during run-time. Thanks, Sangamesh On Tue, Feb 23, 2010 at 9:16 PM, Douglas Guptill wrote: > On Tue, Feb 23, 2010 at 09:25:45AM -0500, Brock Palen wrote: > > > (shameless plug) if you want, listen to our podcast on OpenMPI > > http://www.rce-cast.com/index.php/Podcast/rce01-openmpi.html > > > > The MPICH2 show is recorded (edited it last night, almost done!), and > > will be released this Saturday Midnight Eastern. > > If you want to hear the rough cut, to compare to OpenMPI, email me and I > > will send you the unfinished mp3. > > That sounds like a nice pair. OpenMPI vs MPICH2. > > Douglas. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From henning.fehrmann at aei.mpg.de Tue Feb 23 23:20:43 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Wed, 24 Feb 2010 08:20:43 +0100 Subject: [Beowulf] Re: RAM ECC errors In-Reply-To: References: Message-ID: <20100224072043.GA7998@gretchen.aei.mpg.de> Hi David, Thank you for the response. > Carsten Aulbert wrote > > > Are you saying that now that you are monitoring you are seeing kernel > > > panics which did not appear before? > > > > > > > No, but there seem to be a switch in the kernel module that allows to > trigger > > a kernel panic upon discovering uncorrectable errors. > > By "switch" do you mean: > A. There is an option that may be set when that module is loaded which > will then cause it to panic on an uncorrectable error, where normally it > would not. > B. There has been a change in the module code between kernel versions > that causes it to panic now on events where it formerly did not panic. It is A. There is a module parameter for edac_core: edac_mc_panic_on_ue=1. We have not tested it yet since uncorrectable errors rarely occur. > > > > You can get some information through netconsole, but you know that > already. > > > > > > > Yup already running, question is if a kernel panic would also be fully > visible > > via netconsole - we are glad that we rarely have those ;) > > I have seen one kernel panic since turning on netconsole, and it did log > across the network and showed up in /var/log/messages as it was supposed > to, with the same information presented as in the tests. Limited data, > but it would seem the answer is "at least sometimes". I got a hint from one of the kernel developer. Including the show show_state() function into panic.c right before dump_stack() should give process information via printk which could be collected with netconsole. We are still waiting for an UE event. > > > Yes, but the memory of any process might get corrupted, thus this is > more to > > learn which user is currently running jobs. Which in turn enables us > to notify > > these users that this particular machine running these jobs had a > problem and > > the user might need to re-run her jobs to prevent "false" data > entering her > > job. > > If the node blows up presumably the output of all the jobs currently > running there will clearly indicate that there was a failure - so you > should not have to notify those users since they will see the problem in > their results. (Unless MPI, or PVM, or whatever is being used to spread > jobs around, ignores fatal errors, which should never be the case.) For > jobs which completed earlier on the same node, this would have been > before an uncorrectable error took place, so the results should be OK. Yes, this is correct. A panic should be enough to avoid corrupted data. Often, jobs are failing for other reasons. A process list might help us to exclude other possibilities for job failure. It makes the work a bit more convenient. Cheers, Henning From henning.fehrmann at aei.mpg.de Tue Feb 23 23:30:31 2010 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Wed, 24 Feb 2010 08:30:31 +0100 Subject: [Beowulf] Re: RAM ECC errors (Henning Fehrmann) In-Reply-To: <22487_1266955822_4B84362E_22487_708_1_Pine.LNX.4.64.1002231419410.15558@coffee.psychology.mcmaster.ca> References: <201002230733.39676.carsten.aulbert@aei.mpg.de> <22487_1266955822_4B84362E_22487_708_1_Pine.LNX.4.64.1002231419410.15558@coffee.psychology.mcmaster.ca> Message-ID: <20100224073031.GB7998@gretchen.aei.mpg.de> Hi Mark, On Tue, Feb 23, 2010 at 03:05:39PM -0500, Mark Hahn wrote: > >No, but there seem to be a switch in the kernel module that allows to trigger > >a kernel panic upon discovering uncorrectable errors. > > I suspect you mean /sys/module/edac_mc/panic_on_ue > (ue = uncorrected error). I consider this very much the norm: > it would be very strange to run with ECC memory, and ECC enabled, > and not actually halt on UE. UE represents a failure of the memory > system, not just a transient event, but something which must be > physically fixed. even for HA situations, I'd be pretty skeptical > about using a memory channel which had any UE's on it. Strangely enough, panic_on_ue is off by default. > > CE (corrected errors) OTOH, are very different. they're almost just > a heartbeat of your ECC subsystem. yes, a CE indicates some event > that needed correcting, but at a modest rate, CEs are acceptable. > there are failure modes, though, where enough CEs eventually cause a > UE: tracking CE rate is important for that reason. (other UE modes > don't have this warning sign...) On some apparently broken hardware we have a rate of nearly one event per second. I assume the probability of having uncorrectable errors is few orders of magnitude smaller than the rate of correctable errors since more event have to occur simultaneously. And hopefully, the rate of a silent corruption is still smaller. > > you can set CEs to log through kernel->syslog via edac tunables in /sys. > > >Yes, but the memory of any process might get corrupted, thus this is more to > > if UE is set to panic, nothing will get corrupted (that's really the point eh?) Correct, but it helps rule out other reasons for job failures. Cheers, Henning From robh at dongle.org.uk Wed Feb 24 04:12:07 2010 From: robh at dongle.org.uk (Robert Horton) Date: Wed, 24 Feb 2010 12:12:07 +0000 Subject: [Beowulf] Spanning Tree Protocol and latency: allowing loops in switching networks for minimizing switch hops In-Reply-To: References: Message-ID: <1267013527.7577.35.camel@moelwyn.maths.qmul.ac.uk> On Tue, 2010-02-23 at 13:23 -0600, Rahul Nabar wrote: > In the interest of latency minimum switch hops make sense and for that > loops might sometimes provide the best solution. Using STP won't give you a latency advantage; it just disables some links in a network with loops so you have a single spanning tree. Rob From h-bugge at online.no Wed Feb 24 05:00:27 2010 From: h-bugge at online.no (=?iso-8859-1?Q?H=E5kon_Bugge?=) Date: Wed, 24 Feb 2010 14:00:27 +0100 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <20100223223221.GA24604@bx9.net> References: <20100219220538.GL2857@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> <20100223212904.GD8195@bx9.net> <20100223223221.GA24604@bx9.net> Message-ID: <1564E081-759B-44BA-A482-519AB5D0DF14@online.no> Hi Greg, On Feb 23, 2010, at 23:32 , Greg Lindahl wrote: > A traditional MPI implementation uses N QPs x N processes, so the > global number of QPs is N^2. InfiniPath's pm library for MPI uses a > much smaller endpoint than a QP. Using a ton of QPs does slow down > things (hurts scaling), and that's why SRQ !!:gs/SRQ/XRC/ SRQ reduces the number of receive queues used, and thereby reduces the footprint of the receive buffers. As such, SRQ does not change number of QPs used. Actually, by using bucketed receive queues (one receive queue per bucket size), you need more QPs (I believe Open MPI use 4 QPs per connection using SRQ). XRC on the other hand, reduces the number of QPs per node from NxPPN to N. I have seen impacts of this running only 8 processes per node. In that particular case, the application ran faster using IPoIB for only 128 processes. I assume XRC would have alleviated this effect, but I had no opportunity to evaluate XRC at the time. Hence, I would advice anyone performing benchmarking around this ssue to also include XRC and/or lazy connection establishment to see if the number of QPs in use affects performance. Thanks, H?kon From hahn at mcmaster.ca Wed Feb 24 07:36:17 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 24 Feb 2010 10:36:17 -0500 (EST) Subject: [Beowulf] Re: RAM ECC errors (Henning Fehrmann) In-Reply-To: <14923_1266996853_o1O7Y1Vi014830_20100224073031.GB7998@gretchen.aei.mpg.de> References: <201002230733.39676.carsten.aulbert@aei.mpg.de> <22487_1266955822_4B84362E_22487_708_1_Pine.LNX.4.64.1002231419410.15558@coffee.psychology.mcmaster.ca> <14923_1266996853_o1O7Y1Vi014830_20100224073031.GB7998@gretchen.aei.mpg.de> Message-ID: > Strangely enough, panic_on_ue is off by default. this seems to be version-dependent (we have a bunch of HP XC clusters that have panic_on_ue (and log_ce) enabled by default. I didn't check the sources to see whether HP had patched this, though. > On some apparently broken hardware we have a rate of nearly one event > per second. I assume the probability of having uncorrectable errors is it's certainly possible to have periodic CEs (some page that gets accessed by a periodic timer, etc). but more likely, this is just the edac module's polling rate (/sys/devices/system/edac/mc/poll_msec = 1000, right?) my experience is that if you're getting CE logs at 1 Hz, then you're actual CE rate is potentially a lot higher: there's a ce_noinfo_count control which indicates when there are too many CEs per poll. if you really wanted to find the rate, you could crank up poll_msec, but my experience is that >1 Hz probably calls for a physical fix. OTOH, I do observe machines where a reboot seems to make the CEs go away. that's worrisome. on other machines, reseating dimms does the trick (also a bit worrisome, or at least annoying.) I have a script that I run to summarize this stuff and decode the channel/row to dimm numbers. for a cluster, I normally run "pdsh -a collect_edac_stats | sort" to look for problem nodes... From brice.goglin at gmail.com Tue Feb 23 15:16:39 2010 From: brice.goglin at gmail.com (Brice Goglin) Date: Wed, 24 Feb 2010 00:16:39 +0100 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <20100223223221.GA24604@bx9.net> References: <20100219220538.GL2857@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> <20100223212904.GD8195@bx9.net> <20100223223221.GA24604@bx9.net> Message-ID: <4B8461D7.1000505@gmail.com> Greg Lindahl wrote: >> now that I'm inventorying ignorance, I don't really understand why RDMA >> always seems to be presented as a big hardware issue. wouldn't it be >> pretty easy to define an eth or IP-level protocol to do remote puts, >> gets, even test-and-set or reduce primitives, where the interrupt handler >> could twiddle registered blobs of user memory on the target side? >> > > That approach is called Active Messages, and can be bolted on to > pretty much every messaging implementation. Doesn't OpenMX provide > that kind of interface? > Open-MX offers what MX offers: no explicit RDMA interface, only 2-sided. But something similar to a remote get is used internally for large messages. It wouldn't be hard to mplement some RDMA-like features in such a software-only model like Mark said above. Brice From atchley at myri.com Wed Feb 24 18:49:38 2010 From: atchley at myri.com (Scott Atchley) Date: Wed, 24 Feb 2010 21:49:38 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <4B8461D7.1000505@gmail.com> References: <20100219220538.GL2857@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02662CA3@mtiexch01.mti.com> <20100223212904.GD8195@bx9.net> <20100223223221.GA24604@bx9.net> <4B8461D7.1000505@gmail.com> Message-ID: On Feb 23, 2010, at 6:16 PM, Brice Goglin wrote: > Greg Lindahl wrote: >>> now that I'm inventorying ignorance, I don't really understand why RDMA >>> always seems to be presented as a big hardware issue. wouldn't it be >>> pretty easy to define an eth or IP-level protocol to do remote puts, >>> gets, even test-and-set or reduce primitives, where the interrupt handler >>> could twiddle registered blobs of user memory on the target side? >>> >> >> That approach is called Active Messages, and can be bolted on to >> pretty much every messaging implementation. Doesn't OpenMX provide >> that kind of interface? >> > > Open-MX offers what MX offers: no explicit RDMA interface, only 2-sided. > But something similar to a remote get is used internally for large > messages. It wouldn't be hard to mplement some RDMA-like features in > such a software-only model like Mark said above. > > Brice Don't forget the unexpected handler which can provide some Active Message behavior. Scott From kus at free.net Thu Feb 25 10:19:20 2010 From: kus at free.net (Mikhail Kuzminsky) Date: Thu, 25 Feb 2010 21:19:20 +0300 Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: Message-ID: BTW, is Cray SeaStar2+ better than IB - for nodes w/many cores ? And I didn't see latencies comparison for SeaStar vs IB. Mikhail From lindahl at pbm.com Thu Feb 25 12:50:44 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Thu, 25 Feb 2010 12:50:44 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02663069@mtiexch01.mti.com> References: <20100223212904.GD8195@bx9.net> <9FA59C95FFCBB34EA5E42C1A8573784F02663069@mtiexch01.mti.com> Message-ID: <20100225205044.GC9879@bx9.net> On Tue, Feb 23, 2010 at 03:08:48PM -0800, Gilad Shainer wrote: > My claim (and more important for the present) is that the so called > "non-coalescing" results published on latest InfiniPath based > products are actually with coalescing. The other universe you called > it? You claimed several things, but fine, I'm happy just talking about only this one. As I said before, I can't speak about the latest True Scale numbers because I don't work for QLogic anymore. But I am willing to accept your pledge that you'll quit your job if you're wrong. -- greg From hahn at mcmaster.ca Thu Feb 25 13:53:59 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 25 Feb 2010 16:53:59 -0500 (EST) Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: References: Message-ID: > BTW, is Cray SeaStar2+ better than IB - for nodes w/many cores ? BW seems like a SeaStar2+ advantage, though I haven't seen any discussion of latency or message rate. > And I didn't see latencies comparison for SeaStar vs IB. my guess is that for short routes, seastar is competitive, but can fall behind on large machines (long routes). regardless of how tight the seastar per-hop latency is, IB has 2.5x the per-hop fanout (2or 3 outgoing 9.6 GB links versus 18 outgoing 4 GB/s links). higher radix means an advantage that increases with size. From richard.walsh at comcast.net Thu Feb 25 18:27:25 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Fri, 26 Feb 2010 02:27:25 +0000 (UTC) Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: Message-ID: <438542495.7562021267151245516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Mark Hahn wrote: >Regardless of how tight the seastar per-hop latency is, >IB has 2.5x the per-hop fanout (2or 3 outgoing 9.6 GB links >versus 18 outgoing 4 GB/s links). higher radix means an >advantage that increases with size. Doesn't this assume worst case all-to-all type communication patterns. If you are just trading ghost cell data with your neighbors and you have placed your job smartly on the torus the fan out advantage mentioned is irrelevant. No? rbw _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Thu Feb 25 21:10:41 2010 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 26 Feb 2010 00:10:41 -0500 (EST) Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: <438542495.7562021267151245516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <438542495.7562021267151245516.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: >> Regardless of how tight the seastar per-hop latency is, >> IB has 2.5x the per-hop fanout (2or 3 outgoing 9.6 GB links >> versus 18 outgoing 4 GB/s links). higher radix means an >> advantage that increases with size. > > Doesn't this assume worst case all-to-all type communication > patterns. I'm assuming random point-to-point communication, actually. > If you are just trading ghost cell data with your neighbors > and you have placed your job smartly on the torus the fan out > advantage mentioned is irrelevant. No? if your comms are nearest-neighbor, then yes, a nearest-neighbor fabric is your friend ;) how often does that actually happen? to work out so neatly would preclude, for instance, adaptive meshes, right? it seems like mostly I see jobs with no obvious regular structure to their communication. From plegresl at gmail.com Thu Feb 25 22:25:22 2010 From: plegresl at gmail.com (Patrick LeGresley) Date: Thu, 25 Feb 2010 22:25:22 -0800 Subject: [Beowulf] Computational / GPGPU Engineer at Life Technologies Message-ID: <1423F812-0234-4CB2-9324-483EF3358BBA@gmail.com> We have an opening in our genetic systems division for someone with HPC and especially GPU experience. The requisition number is 2793BR and you can get to the jobsite here: http://www.lifetechnologies.com/careers.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.walsh at comcast.net Fri Feb 26 09:36:33 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Fri, 26 Feb 2010 17:36:33 +0000 (UTC) Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: Message-ID: <265537950.7749891267205793764.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Mark Hahn wrote: >> Doesn't this assume worst case all-to-all type communication >> patterns. > >I'm assuming random point-to-point communication, actually. A sub-case of all-to-all (possibly all-to-all). So you are assuming random point-to-point is a common pattern in HPC ... mmm ... I would call it a worse case pattern, something more typical of graph searching codes like they run at the NSA. Sure a high radix switch (or better yet a global memory address space, Cray X1E) is good and designed for this worst-case, but not sure this is the common case data reference pattern in HPC ... if it were they would be selling more global memory systems at Cray and SGI (not just to the NSA). There you might also want a machine like the Cray XMT where the memory is flat and stalled threads can be switched out for another thread. >> If you are just trading ghost cell data with your neighbors >> and you have placed your job smartly on the torus the fan out >> advantage mentioned is irrelevant. No? > >if your comms are nearest-neighbor, then yes, a nearest-neighbor >fabric is your friend ;) I think that if you look at the HPC space globally there is still a lot of locality that you can rely on. Familiar with the "7 dwarves" paper from Berkeley? >how often does that actually happen? to work out so neatly would >preclude, for instance, adaptive meshes, right? it seems like mostly >I see jobs with no obvious regular structure to their communication. Really ... must be doing a lot of turbulent flow simulations with shedding vortices, crash simulations with self-penetrating meshes ... tough stuff for your average cluster or even your above average cluster. Even AMR codes usually attempt to discover new neighbors and localize them. Not disrespecting switches, but they are in a sense designed for worse case scenarios (the design asserts that "there are no neighborhoods") ... a torus design appeals to the middle ground were locality is not banished. rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From kus at free.net Fri Feb 26 10:29:12 2010 From: kus at free.net (Mikhail Kuzminsky) Date: Fri, 26 Feb 2010 21:29:12 +0300 Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: Message-ID: In message from Mark Hahn (Thu, 25 Feb 2010 16:53:59 -0500 (EST)): >BW seems like a SeaStar2+ advantage ... >(2or 3 outgoing 9.6 GB links versus 18 >outgoing 4 GB/s links). Really SeaStar2+ BW advantage isn't such expressive :-) 1) 9.6 GB/s is peak value, sustained is 6 GB/s 2) And this values are for bisectional bandwidth, therefore, as I understand, it's only 3 GB/s for transmission of one message. Mikhail From richard.walsh at comcast.net Fri Feb 26 11:16:07 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Fri, 26 Feb 2010 19:16:07 +0000 (UTC) Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: <4A1DA2D8-E75F-46C0-9CDA-64BD204A0CCA@gmail.com> Message-ID: <187918651.7793881267211767624.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Larry Stewart wrote: >Designing the communications network for this worst-case pattern has a >number of benefits: > > * it makes the machine less sensitive to the actual communications pattern > * it makes performance less variable run-to-run, when the job controller > chooses different subsets of the system I agree with this and pretty much all of your other comments, but wanted to make the point that a worst-case, hardware-only solution is not required or necessarily where all of the research and development effort should be placed for HPC as a whole. And let's not forgot that unless they are supported by some coincidental volume requirement in another non-HPC market, they will cost more (sometimes a lot). If worst-case hardware solutions were required then clusters would not have pushed out their HPC predecessors, and novel high-end designs would not find it so hard to break into the market. Lower cost hardware solutions often stimulate the more software-intelligent use of the additional resources that come along for the ride. With clusters you paid less for interconnects, memory interfaces, and packaged software, and got to spend the savings on more memory, more memory bandwidth (aggregate), and more processing power. This in turn had an effect on the problems tackled, weak scaling an application was an approach to use the memory while managing the impact of a cheaper interconnect. So, yes let's try to banish latency with cool state-of-the-art interconnects engineered for worst-case, not common-case, scenarios (we have been hearing about the benefits of high radix switches), but remember that interconnect cost and data locality and partitioning will always matter and may make the worse-case interconnect unnecessary >There's a paper in the IBM Journal of Research and Development about this, >they wound up using simulated annealing to find good placement on the most >regular machine around, because the "obvious" assignments weren't optimal. Can you point me at this paper ... sounds very interesting ... ?? >Personally, I believe our thinking about interconnects has been poisoned by thinking >that NICs are I/O devices. We would be better off if they were coprocessors. Threads >should be able to send messages by writing to registers, and arriving packets should >activate a hyperthread that has full core capabilities for acting on them, and with the >ability to interact coherently with the memory hierarchy from the same end as other >processors. We had started kicking this around for the SiCortex gen-3 chip, but were >overtaken by events. Yes to all this ... now that everyone has made the memory controller an integral part of the processor. We can move on to the NIC ... ;-) ... rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shainer at mellanox.com Fri Feb 26 17:00:37 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Fri, 26 Feb 2010 17:00:37 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? References: Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02663413@mtiexch01.mti.com> > > > BTW, is Cray SeaStar2+ better than IB - for nodes w/many cores ? > > BW seems like a SeaStar2+ advantage, though I haven't seen any > discussion of latency or message rate. > SeaStar2+ BW to the node is 6.5GB bi-dir, similar to IB QDR 4x port BW through PCIe Gen2. SeaStar BW to the network is 9.6GB bi-dir, little bit higher than the 8GB IB BW between switches, but less than IB 12x port which deliver 24GB/s. I have not see latency numbers out there as well, but some indications showed that IB latency is lower than SeaStar. > > And I didn't see latencies comparison for SeaStar vs IB. > > my guess is that for short routes, seastar is competitive, > but can fall behind on large machines (long routes). > > regardless of how tight the seastar per-hop latency is, > IB has 2.5x the per-hop fanout (2or 3 outgoing 9.6 GB links > versus 18 outgoing 4 GB/s links). higher radix means an > advantage that increases with size. From lindahl at pbm.com Fri Feb 26 17:31:20 2010 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 26 Feb 2010 17:31:20 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02663413@mtiexch01.mti.com> References: <9FA59C95FFCBB34EA5E42C1A8573784F02663413@mtiexch01.mti.com> Message-ID: <20100227013120.GI13186@bx9.net> On Fri, Feb 26, 2010 at 05:00:37PM -0800, Gilad Shainer wrote: > I have not see latency numbers out there as well, > but some indications showed that IB latency is lower than SeaStar. I'm surprised that you aren't familiar with http://icl.cs.utk.edu/hpcc/hpcc_results.cgi It has plenty of latency results. I think this latency measurement could be improved, but at least it involves all the cores on a node, unlike the 1-core-per-node number usually quoted. -- greg p.s. I'm still waiting for confirmation of your offer to resign if you're wrong about QLogic's message rate. From Shainer at mellanox.com Fri Feb 26 22:43:22 2010 From: Shainer at mellanox.com (Gilad Shainer) Date: Fri, 26 Feb 2010 22:43:22 -0800 Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? References: <9FA59C95FFCBB34EA5E42C1A8573784F02663413@mtiexch01.mti.com> <20100227013120.GI13186@bx9.net> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02663423@mtiexch01.mti.com> > > I have not see latency numbers out there as well, > > but some indications showed that IB latency is lower than SeaStar. > > I'm surprised that you aren't familiar with > > http://icl.cs.utk.edu/hpcc/hpcc_results.cgi > > It has plenty of latency results. I think this latency measurement > could be improved, but at least it involves all the cores on a node, > unlike the 1-core-per-node number usually quoted. Typical answer. Why not to compare different settings and create claims from that. If you continue to search you might find a paper or two which will give the node to node latency, but since it is not on Cray web site, I will leave this out of my response. > > -- greg > > p.s. I'm still waiting for confirmation of your offer to resign if > you're wrong about QLogic's message rate. > p.s I am still waiting for you to grow up but it more and more seems like an impossible mission. I can assure you that it does not hurt, so don't be afraid. It could have helped you in the past, you know... From richard.walsh at comcast.net Sat Feb 27 20:17:17 2010 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sun, 28 Feb 2010 04:17:17 +0000 (UTC) Subject: [Beowulf] Q: IB message rate & large core counts (per node)? In-Reply-To: <532126883.8209551267330513688.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <2043745298.8209961267330637770.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> All, In case anyone else has trouble keeping the numbers straight between IB (SDR, DDR, QDR, EDR) and PCI-Express, (1.0, 2.0, 30) here are a couple of tables in Excel I just worked up to help me remember. If anyone finds errors in it please let me know so that I can fix them. Regards, rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: IBvsPCIE.xlsx Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet Size: 40695 bytes Desc: not available URL: From lstewart2 at gmail.com Fri Feb 26 10:20:49 2010 From: lstewart2 at gmail.com (Lawrence Stewart) Date: Fri, 26 Feb 2010 13:20:49 -0500 Subject: [Beowulf] Q: IB message rate & large core counts (per node) ? In-Reply-To: <265537950.7749891267205793764.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> References: <265537950.7749891267205793764.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> Message-ID: <4A1DA2D8-E75F-46C0-9CDA-64BD204A0CCA@gmail.com> On Feb 26, 2010, at 12:36 PM, richard.walsh at comcast.net wrote: > > Mark Hahn wrote: > > >> Doesn't this assume worst case all-to-all type communication > >> patterns. > > > >I'm assuming random point-to-point communication, actually. > > A sub-case of all-to-all (possibly all-to-all). So you are assuming > random point-to-point is a common pattern in HPC ... mmm ... I > would call it a worse case pattern, something more typical of > graph searching codes like they run at the NSA. Sure a high > radix switch (or better yet a global memory address space, Cray > X1E) is good and designed for this worst-case, but not sure this > is the common case data reference pattern in HPC ... if it were > they would be selling more global memory systems at Cray and > SGI (not just to the NSA). Designing the communications network for this worst-case pattern has a number of benefits: * it makes the machine less sensitive to the actual communications pattern * it makes performance less variable run-to-run, when the job controller chooses different subsets of the system > > There you might also want a machine like the Cray XMT where > the memory is flat and stalled threads can be switched out for > another thread. > > >> If you are just trading ghost cell data with your neighbors > >> and you have placed your job smartly on the torus the fan out > >> advantage mentioned is irrelevant. No? Smart placement is a lot harder than it appears. * The actual communications pattern often doesn't match preconceptions * Communications from concurrently running applications can interfere. There's a paper in the IBM Journal of Research and Development about this, they wound up using simulated annealing to find good placement on the most regular machine around, because the "obvious" assignments weren't optimal. ... In addition to this stuff, the quality of the interconnect has other effects * a fast, low latency interconnect lets the application scale effectively to larger numbers of nodes before performance rolls off * an interconnect with low latency short messages provides a decent base for PGAS languages like UPC and CoArray Fortran or for lightweight communications APIs like SHMEM or active messages. Personally, I believe our thinking about interconnects has been poisoned by thinking that NICs are I/O devices. We would be better off if they were coprocessors. Threads should be able to send messages by writing to registers, and arriving packets should activate a hyperthread that has full core capabilities for acting on them, and with the ability to interact coherently with the memory hierarchy from the same end as other processors. We had started kicking this around for the SiCortex gen-3 chip, but were overtaken by events. -Larry -------------- next part -------------- An HTML attachment was scrubbed... URL: