From lindahl at pbm.com Tue Sep 1 08:35:26 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 1 Sep 2009 08:35:26 -0700 Subject: [Beowulf] petabyte for $117k Message-ID: <20090901153526.GC4682@bx9.net> http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ Kinda neat -- how does the price compare to the various 48-drive systems available? -- g From laytonjb at att.net Tue Sep 1 08:52:58 2009 From: laytonjb at att.net (Jeff Layton) Date: Tue, 1 Sep 2009 08:52:58 -0700 (PDT) Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090901153526.GC4682@bx9.net> References: <20090901153526.GC4682@bx9.net> Message-ID: <63149.9536.qm@web80708.mail.mud.yahoo.com> I saw that as well (storagemojo blog). Looks interesting but I need to read the pdf since there are some pieces I'm missing. Cool concept. There are some others like it as well - low performance storage but lots of capacity ("cheap and deep"). I think it can make alot of sense in my situations (just my 2 cents). Jeff ________________________________ From: Greg Lindahl To: beowulf at beowulf.org Sent: Tuesday, September 1, 2009 11:35:26 AM Subject: [Beowulf] petabyte for $117k http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ Kinda neat -- how does the price compare to the various 48-drive systems available? -- g _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Tue Sep 1 09:03:52 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 1 Sep 2009 11:03:52 -0500 Subject: [Beowulf] Vendor terms and conditions for a typical Beouwulf expansion contract Message-ID: We are planning a cluster expansion that is quite larger than the one's I've handled previously. I was wondering about the vendor-terms and if there was anything that was good to be specifically requesting them for. In the past we've stuck to standard vendor contracts; something like: "1 year warranty; 2 year extended warranty. Next Business Day on site." But, in view of the larger size of the current order I wanted to be sure I covered all my bases. Are there any other terms that people try to add or negotiate into the contract? We assemble our own cluster in-house so there is no "performance guarantee" etc. from the vendor side in the OS, packages etc. In the past I've heard rumors of other arrangements: e.g. having the vendor stock some spares on-site, delayed incremental payment terms subject to conditions on performance, pre-certifying a person on our side to bypass the routine helpdesk procedures for warranties. Are such, non-standard modifications common? Anybody want to jog my memory about items that I might want to add? Of course, the vendor may not actually agree to them all but I'm just looking for items to put on the discussion table. -- Rahul From eugen at leitl.org Tue Sep 1 09:10:29 2009 From: eugen at leitl.org (Eugen Leitl) Date: Tue, 1 Sep 2009 18:10:29 +0200 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090901153526.GC4682@bx9.net> References: <20090901153526.GC4682@bx9.net> Message-ID: <20090901161029.GM4508@leitl.org> On Tue, Sep 01, 2009 at 08:35:26AM -0700, Greg Lindahl wrote: > http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ > > Kinda neat -- how does the price compare to the various 48-drive > systems available? "Seagate ST31500341AS 1.5TB Barracuda 7200.11 SATA 3Gb/s 3.5? Aargh! Should be definitely substituted by 2 TByte WD RE4 drive. Today I've built a 32 TByte raw storage Supermicro box with X8DDAi (dual-socket Nehalem, 24 GByte RAM, IPMI), two LSI SAS3081E-R and OpenSolaris sees all (WD2002FYPS) drives so far (the board refuses to boot from DVD when more than 12 drives are in though probably to some BIOS brain damage, so you have to manually build a raidz-2 with all 16 drives in it once Solaris has booted up). The drives are about 3170 EUR sans VAT total for all 16, the box itself around 3000 EUR sans VAT. I presume Linux with RAID 6 would work (haven't checked yet), too, and if you need more you can use a cluster FS. Maybe not as cheap as a Backblaze, but off-shelf (BTO) and you get what you pay for. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From landman at scalableinformatics.com Tue Sep 1 09:23:56 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 01 Sep 2009 12:23:56 -0400 Subject: [Beowulf] petabyte for $117k In-Reply-To: <63149.9536.qm@web80708.mail.mud.yahoo.com> References: <20090901153526.GC4682@bx9.net> <63149.9536.qm@web80708.mail.mud.yahoo.com> Message-ID: <4A9D4A9C.6010300@scalableinformatics.com> Jeff Layton wrote: > I saw that as well (storagemojo blog). Looks interesting but I need to > read the pdf since there are some pieces I'm missing. > > Cool concept. There are some others like it as well - low performance > storage but lots of capacity ("cheap and deep"). I think it can make > alot of sense in my situations (just my 2 cents). Cool! We've been doing something like this (concept) for a while with Delta-V (http://scalableinformatics.com/delta-v), though we add higher performance layers atop this, and a number of automation bits, not to mention iSCSI, SRP, AoE, iSER, ..., NFS/SMB/... They go the minimal price route on everything. Desktop drives, non-ecc memory, desktop motherboard, etc. Low performance, but probably good for cold data of large size. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From bill at cse.ucdavis.edu Tue Sep 1 16:28:10 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 01 Sep 2009 16:28:10 -0700 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090901153526.GC4682@bx9.net> References: <20090901153526.GC4682@bx9.net> Message-ID: <4A9DAE0A.1010500@cse.ucdavis.edu> Greg Lindahl wrote: > http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ > > Kinda neat -- how does the price compare to the various 48-drive > systems available? I'm very curious to hear how they are in production. I've had vibration of large sets of drives basically render the consumer drives useless. Timeouts, highly variable performance, drives constantly dropping out of raids. It became especially fun when the heavy I/O of a rebuild knocks additional drives out of the array. I'd also worry that running the consumer drives well out of spec (both in duty cycle and vibration) might significantly shorten their lives. Are the 7200.11 1.5TB seagate's particularly vibration resistant? Maybe those $0.23 nylon standoffs work better than I'd expect. From eugen at leitl.org Wed Sep 2 01:23:46 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 2 Sep 2009 10:23:46 +0200 Subject: [Beowulf] petabyte for $117k In-Reply-To: <4A9DAE0A.1010500@cse.ucdavis.edu> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> Message-ID: <20090902082346.GN4508@leitl.org> On Tue, Sep 01, 2009 at 04:28:10PM -0700, Bill Broadley wrote: > I'm very curious to hear how they are in production. I've had vibration of My thoughts exactly. > large sets of drives basically render the consumer drives useless. Timeouts, > highly variable performance, drives constantly dropping out of raids. It > became especially fun when the heavy I/O of a rebuild knocks additional drives > out of the array. Also my experience down to a T. > I'd also worry that running the consumer drives well out of spec (both in duty > cycle and vibration) might significantly shorten their lives. > > Are the 7200.11 1.5TB seagate's particularly vibration resistant? No. They're awful, as the entire 7200.11 line (I've had failures in 750 GByte, 1 TByte, 1.5 TByte, everywhere, even with reasonably small drive populations). > Maybe those $0.23 nylon standoffs work better than I'd expect. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From bill at cse.ucdavis.edu Wed Sep 2 02:10:26 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 02 Sep 2009 02:10:26 -0700 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090902082346.GN4508@leitl.org> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> Message-ID: <4A9E3682.4050006@cse.ucdavis.edu> Eugen Leitl wrote: > On Tue, Sep 01, 2009 at 04:28:10PM -0700, Bill Broadley wrote: > >> I'm very curious to hear how they are in production. I've had vibration of > > My thoughts exactly. The lid screws down to apply pressure to a piece of foam. Foam presses down on 45 drives. 5 drives (1.4 lb each) sit on each port multipliers. 6 nylon mounts support each multiplier supporting 7 pounds of drives. Seems like the damping from the nylon mounts would be minimal under that much pressure. So on the bottom of the case you have 63 pounds of drives, a significant fraction of which is rotating mass. I wonder if the port multipliers are really designed to actually support the drives. Doesn't seem like the SATA power/data connects I've seen are designed for that. Not to mention it's hard to imagine the typical flexible thin sheet metal lid applying even pressure to 45 drives through the foam. Seems like most of the pressure would be on the drives on the outside edge, leaving the inside drives relatively undamped. My experience is that nylon mounts don't help much. Sure poor manufacturing tolerances, and near zero load/tension often prevent things like fans from tightly coupling with a chassis. But to decouple drive vibration from a chassis seems to require something much more aggressive. Something like a very soft/gooey rubber with a fair bit of travel and give under minimal pressure (4-6 ounces). >> large sets of drives basically render the consumer drives useless. Timeouts, >> highly variable performance, drives constantly dropping out of raids. It >> became especially fun when the heavy I/O of a rebuild knocks additional drives >> out of the array. > > Also my experience down to a T. Strangely their design works out to 0.11 per GB. I tweaked their design to my liking. I upgraded to the $200 baracuda 2TB drive, 6GB of DDR3-1333 ECC memory (from 4GB ddr2), ECC capable motherboard (with dual gigE), and a Nehalem based xeon (ECC capable). The result was $0.13 per GB. Seems like a rather worthwhile investment from a reliability perspective, let alone performance. Anyone familiar with what the sun thumper does to minimize vibration? >> I'd also worry that running the consumer drives well out of spec (both in duty >> cycle and vibration) might significantly shorten their lives. >> >> Are the 7200.11 1.5TB seagate's particularly vibration resistant? > > No. They're awful, as the entire 7200.11 line (I've had failures > in 750 GByte, 1 TByte, 1.5 TByte, everywhere, even with reasonably > small drive populations). A rather scary 26% of over 1500 reviews on newegg give that seagate 1 star out of 5. The consumer 1TB WD drive has 1100 reviews and 74% are 5 stars and only 8% are 1 star. From mm at yuhu.biz Wed Sep 2 02:44:12 2009 From: mm at yuhu.biz (Marian Marinov) Date: Wed, 2 Sep 2009 12:44:12 +0300 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090902082346.GN4508@leitl.org> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> Message-ID: <200909021244.12736.mm@yuhu.biz> On Wednesday 02 September 2009 11:23:46 Eugen Leitl wrote: > On Tue, Sep 01, 2009 at 04:28:10PM -0700, Bill Broadley wrote: > > I'm very curious to hear how they are in production. I've had vibration > > of > > My thoughts exactly. > > > large sets of drives basically render the consumer drives useless. > > Timeouts, highly variable performance, drives constantly dropping out of > > raids. It became especially fun when the heavy I/O of a rebuild knocks > > additional drives out of the array. > > Also my experience down to a T. > > > I'd also worry that running the consumer drives well out of spec (both in > > duty cycle and vibration) might significantly shorten their lives. > > > > Are the 7200.11 1.5TB seagate's particularly vibration resistant? > > No. They're awful, as the entire 7200.11 line (I've had failures > in 750 GByte, 1 TByte, 1.5 TByte, everywhere, even with reasonably > small drive populations). I have the same experience with those drives. Not really reliable. > > > Maybe those $0.23 nylon standoffs work better than I'd expect. -- Best regards, Marian Marinov From carsten.aulbert at aei.mpg.de Wed Sep 2 02:50:56 2009 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Wed, 2 Sep 2009 11:50:56 +0200 Subject: [Beowulf] petabyte for $117k In-Reply-To: <4A9E3682.4050006@cse.ucdavis.edu> References: <20090901153526.GC4682@bx9.net> <20090902082346.GN4508@leitl.org> <4A9E3682.4050006@cse.ucdavis.edu> Message-ID: <200909021150.56529.carsten.aulbert@aei.mpg.de> On Wednesday 02 September 2009 11:10:26 Bill Broadley wrote: > > Anyone familiar with what the sun thumper does to minimize vibration? Each disk is contained in a cage and this one is secured per slot. Pretty standard layout, but then I've never really checked if there were vibrational "hot" spots in these boxes. But there is not much rubber inside the box IIRC. Carsten From landman at scalableinformatics.com Wed Sep 2 04:28:24 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 02 Sep 2009 07:28:24 -0400 Subject: [Beowulf] petabyte for $117k In-Reply-To: <4A9E3682.4050006@cse.ucdavis.edu> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> <4A9E3682.4050006@cse.ucdavis.edu> Message-ID: <4A9E56D8.8070103@scalableinformatics.com> Bill Broadley wrote: > The lid screws down to apply pressure to a piece of foam. Foam presses down > on 45 drives. 5 drives (1.4 lb each) sit on each port multipliers. 6 nylon > mounts support each multiplier supporting 7 pounds of drives. Seems like the > damping from the nylon mounts would be minimal under that much pressure. So > on the bottom of the case you have 63 pounds of drives, a significant fraction > of which is rotating mass. ... Which is being driven at multiple points by a 120Hz signal (7200RPM). It seems like a basic physics calculation to get the resulting eigen-modes and eigen-frequencies. Looking at the design, I was concerned about the nylon standoffs (high frequency coupling, including octaves of 120Hz) coupling enough vibration into the units. > I wonder if the port multipliers are really designed to actually support the > drives. Doesn't seem like the SATA power/data connects I've seen are designed > for that. Technically, they should not be used for structural support. For supporting cables? Sure. > Not to mention it's hard to imagine the typical flexible thin sheet metal lid > applying even pressure to 45 drives through the foam. Seems like most of the > pressure would be on the drives on the outside edge, leaving the inside drives > relatively undamped. > > My experience is that nylon mounts don't help much. Sure poor manufacturing > tolerances, and near zero load/tension often prevent things like fans from > tightly coupling with a chassis. But to decouple drive vibration from a > chassis seems to require something much more aggressive. Something like a > very soft/gooey rubber with a fair bit of travel and give under minimal > pressure (4-6 ounces). Yeah. Looking at this unit, the big issue looks to be vibration. In which case, you probably want better (more vibration tolerant) drives than the ones spec'ed. And a better mounting design. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From amjad11 at gmail.com Wed Sep 2 05:14:15 2009 From: amjad11 at gmail.com (amjad ali) Date: Wed, 2 Sep 2009 17:14:15 +0500 Subject: [Beowulf] CPU shifts?? and time problems Message-ID: <428810f20909020514k66fe7905i80a0d17867b2d715@mail.gmail.com> Hi All, I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on ROCKS-5 with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel on it. Please reply following: 1) When I run my serial code on the dual-core head node (or parallel code with -np 1); it gives results in about 2 minutes. What I observe is that "System Monitor" application show that some times CPU1 become busy 80+% and CPU2 around 10% busy. After some time CPU1 gets share around 10% busy while the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue till end. Why this is so? Does this busy-ness shifts/swaping harms performance/speed? 2) When I run my parallel code with -np 2 on the dual-core headnode only; it gives results in about 1 minute. What I observe is that "System Monitor" application show that all the time CPU1 and CPU2 remain busy 100%. 3) When I run my parallel code with "-np 4" and "-np 8" on the dual-core headnode only; it gives results in about 2 and 3.20 minutes respectively. What I observe is that "System Monitor" application show that all the time CPU1 and CPU2 remain busy 100%. 4) When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8 cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I observe is that "System Monitor" application show CPU usage fluctuations somewhat as in point number 1 above (CPU1 remains dominant busy most of the time), in case of -np 4. Does this means that an MPI-process is shifting to different cores/cpus/nodes? Does these shiftings harm performance/speed? 5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare to -np 2 on the headnode? Obviously its due to communication overhead! but how to get better performance--lesser run time? My code is not too complicated only 2 values are sent and 2 values are received by each process after each stage. Regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Wed Sep 2 06:57:38 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 2 Sep 2009 09:57:38 -0400 (EDT) Subject: [Beowulf] CPU shifts?? and time problems In-Reply-To: <428810f20909020514k66fe7905i80a0d17867b2d715@mail.gmail.com> References: <428810f20909020514k66fe7905i80a0d17867b2d715@mail.gmail.com> Message-ID: On Wed, 2 Sep 2009, amjad ali wrote: > Hi All, > I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on ROCKS-5 > with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel > on it. > Please reply following: > > 1) When I run my serial code on the dual-core head node (or parallel code > with -np 1); it gives results in about 2 minutes. What I observe is that > "System Monitor" application show that some times CPU1 become busy 80+% and > CPU2 around 10% busy. After some time CPU1 gets share around 10% busy while > the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue > till end. Why this is so? Does this busy-ness shifts/swaping harms > performance/speed? the kernel decides where to run processes based on demand. if the machine were otherwise idle, your process would stay on the same CPU. depending on the particular kernel release, the kernel uses various heuristics to decide how much to "resist" moving the process among cpus. the cost of moving among cpus depends entirely on how much your code depends on the resources tied to one cpu or the other. for instance, if your code has a very small memory footprint, moving will have only trivial cost. if your process has a larger working set size, but fits in onchip cache, it may be relatively expensive to move to a different processor in the system that doesn't share cache. consider a 6M L3 in a 2-socket system, for instance: the inter-socket bandwidth will be approximately memory speed, which on a core2 system is something like 6 GB/s. so migration will incur about a 1ms overhead (possibly somewhat hidden by concurrency.) in your case (if I have the processor spec right), you have 2 cores sharing a single 4M L2. L1 cache is unshared, but trivial in size, so migration cost should be considered near-zero. the numactl command lets you bind a cpu to a processor if you wish. this is normally valuable on systems with more complex topologies, such as combinations of shared and unshared caches, especially when divided over multiple sockets, and with NUMA memory (such as opterons and nehalems.) > 2) When I run my parallel code with -np 2 on the dual-core headnode only; > it gives results in about 1 minute. What I observe is that "System Monitor" > application show that all the time CPU1 and CPU2 remain busy 100%. no problem there. normally, though, it's best to _not_ run extraneous processes, and instead only look at the elapsed time that the job takes to run. that is the metric that you should care about. > 3) When I run my parallel code with "-np 4" and "-np 8" on the dual-core > headnode only; it gives results in about 2 and 3.20 minutes respectively. > What I observe is that "System Monitor" application show that all the time > CPU1 and CPU2 remain busy 100%. sure. with 4 cpus, you're overloading the cpus, but they timeslice fairly efficiently, so you don't lose. once you get to 8 cpus, you lose because the overcommitted processes start interfering (probably their working set is blowing the L2 cache.) > 4) When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8 > cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I well, then I think it's a bit hyperbolic to call it a parallel code ;) seriously, all you've learned here is that your interconnect is causing your code to not scale. the problem could be your code or the interconnect. > observe is that "System Monitor" application show CPU usage fluctuations > somewhat as in point number 1 above (CPU1 remains dominant busy most of the > time), in case of -np 4. Does this means that an MPI-process is shifting to > different cores/cpus/nodes? Does these shiftings harm performance/speed? MPI does not shift anything. the kernel may rebalance runnable processes within a single node, but not across nodes. it's difficult to tell how much your monitoring is harming the calculation or perturbing the load-balance. > 5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare to > -np 2 on the headnode? Obviously its due to communication overhead! but how > to get better performance--lesser run time? My code is not too complicated > only 2 values are sent and 2 values are received by each process after each > stage. then do more work between sends and receives. hard to say without knowing exactly what the communication pattern is. I think you should first validate your cluster to see that the Gb is running as fast as expected. actually, that everything is running right. that said, Gb is almost not a cluster interconnect at all, since it's so much slower than the main competitors (IB mostly, to some extent 10GE). fatter nodes (dual-socket quad-core, for instance) would at least decrease the effect of slow interconnect. you might also try instaling openMX, which is an ethernet protocol optimized for MPI (rather than your current MPI which is presumably layered on top of the usual TCP stack, which is optimized for wide-area streaming transfers.) heck, you can probably obtain some speedup by tweaking your coalesce settings via ethtool. From kilian.cavalotti.work at gmail.com Wed Sep 2 08:07:14 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed, 2 Sep 2009 17:07:14 +0200 Subject: [Beowulf] Vendor terms and conditions for a typical Beouwulf expansion contract In-Reply-To: References: Message-ID: Hi Rahul, On Tue, Sep 1, 2009 at 6:03 PM, Rahul Nabar wrote: > In the past we've stuck to standard vendor contracts; something like: > "1 year warranty; 2 year extended warranty. Next Business Day on > site." You could also consider H+4 on-site intervention for critical parts, like switches, master nodes, or whatever piece of hardware your whole cluster operation depends on. > In the past I've heard rumors of other arrangements: e.g. having the > vendor stock some spares on-site, This one is pretty common, I think. And a very good idea generally speaking. You can ask for a quote including a couple hard drives of each type, a few NICs/HBAs and memory DIMMs. You'll probably pay for those spares anyway, but having them handy will be a nice time saver in case of an emergency, rather than having to wait for the delivery of a replacement part. As long as you make sure to resplenish your spares stock as it is being used: you call support to have them ship a replacement, as you would normally do, although you already have the part. Cheers, -- Kilian From coutinho at dcc.ufmg.br Wed Sep 2 10:43:46 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Wed, 2 Sep 2009 14:43:46 -0300 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090902082346.GN4508@leitl.org> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> Message-ID: 2009/9/2 Eugen Leitl > On Tue, Sep 01, 2009 at 04:28:10PM -0700, Bill Broadley wrote: > > > I'm very curious to hear how they are in production. I've had vibration > of > > My thoughts exactly. > > > large sets of drives basically render the consumer drives useless. > Timeouts, > > highly variable performance, drives constantly dropping out of raids. It > > became especially fun when the heavy I/O of a rebuild knocks additional > drives > > out of the array. > > Also my experience down to a T. > > > I'd also worry that running the consumer drives well out of spec (both in > duty > > cycle and vibration) might significantly shorten their lives. > > > > Are the 7200.11 1.5TB seagate's particularly vibration resistant? > > No. They're awful, as the entire 7200.11 line (I've had failures > in 750 GByte, 1 TByte, 1.5 TByte, everywhere, even with reasonably > small drive populations). > > According to this site, the main difference between Seagate desktop and ES series is that the latter are more vibration resistant. http://techreport.com/articles.x/10748 -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Wed Sep 2 13:02:57 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 2 Sep 2009 13:02:57 -0700 Subject: [Beowulf] petabyte for $117k In-Reply-To: References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> Message-ID: <20090902200257.GF7504@bx9.net> On Wed, Sep 02, 2009 at 02:43:46PM -0300, Bruno Coutinho wrote: > According to this site, the main difference between Seagate desktop and ES > series is that the latter are more vibration resistant. > http://techreport.com/articles.x/10748 This is interesting -- a non-firmware difference between normal and "enterprise" disks. The Barracuda.ES2 datasheet confirms the 12.5 rad/sec^2 number, but the 5.5 rad/sec^2 number is much harder to find; this doc has it: http://www.seagate.com/docs/pdf/whitepaper/mb578_7200.pdf but you'll have to look at it cached. As for people's vibrations comments: they own a bunch of them and they work... but that is only a single point of evidence and not a history of working with a variety of disks models over time. The guy said he could write a whole post about vibration; I think it would be very interesting. -- greg From bill at cse.ucdavis.edu Wed Sep 2 13:28:18 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 02 Sep 2009 13:28:18 -0700 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090902200257.GF7504@bx9.net> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> <20090902200257.GF7504@bx9.net> Message-ID: <4A9ED562.8080609@cse.ucdavis.edu> Greg Lindahl wrote: > As for people's vibrations comments: they own a bunch of them and they > work... For now, I've seen similar setups last 6-12 months before a drive drops, then a rebuild triggers drop #2. > but that is only a single point of evidence and not a history > of working with a variety of disks models over time. The guy said he > could write a whole post about vibration; I think it would be very > interesting. Indeed, very. If they were significantly cheaper than a better design I could see the justification. But for $0.11 vs $0.13 per GB I don't see it. Certainly as a potential customer for N copies of my data I'd certainly rather pay $0.13 + overhead for reliable (ecc + raid edition drives) storage then $0.11 + overhead for unreliable storage (no ecc and consumer drives) for my precious bits. It's especially scary since they don't seem to have any replication, or at least that replication is incompatible with their statement "In rough terms, every time one of our customers buys a hard drive, Backblaze needs another hard drive." From rpnabar at gmail.com Wed Sep 2 14:17:36 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 2 Sep 2009 16:17:36 -0500 Subject: [Beowulf] Vendor terms and conditions for a typical Beouwulf expansion contract In-Reply-To: References: Message-ID: On Wed, Sep 2, 2009 at 10:07 AM, Kilian CAVALOTTI wrote: > > You could also consider H+4 on-site intervention for critical parts, > like switches, master nodes, or whatever piece of hardware your whole > cluster operation depends on. Good idea! I will do that for some of the critical, non-replicated items. > This one is pretty common, I think. And a very good idea generally > speaking. You can ask for a quote including a couple hard drives of > each type, a few NICs/HBAs and memory DIMMs. You'll probably pay for > those spares anyway, but having them handy will be a nice time saver > in case of an emergency, rather than having to wait for the delivery > of a replacement part. As long as you make sure to resplenish your > spares stock as it is being used: you call support to have them ship a > replacement, as you would normally do, although you already have the > part. Thanks for those comments. Will help me get a better setup in the contract. -- Rahul From rpnabar at gmail.com Wed Sep 2 14:25:00 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 2 Sep 2009 16:25:00 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes Message-ID: What are good choices for a switch in a Beouwulf setup currently? The last time we went in for a Dell PowerConnect and later realized that this was pretty basic. I only have gigabit on the compute nodes. So no Infiniband / Myrinet etc. issues. The point is that I will have about 300 compute nodes. Should I go for one large switch or several stacked ones? In the past I had resorted to just interconnecting two or more 48 port switches with multiple ethernet cables but this is quite crude I believe. Of course, the smaller switches tend to be cheaper so in the past it was making more sense to hook them up together even at the price of taking a performance hit. The main traffic sources are MPI and NFS. (NFS is quite inefficient so this time around I might play with another FS but still something that allows global cross mounts from ~300 compute nodes) There is a variety of codes we run; some latency sensitive and others bandwidth sensitive. Finally, what are the switch parameters I ought to be comparing. If 300 eth ports are chattering at once do I look at the max rated switching capacity or something similar? -- Rahul From hahn at mcmaster.ca Wed Sep 2 15:41:07 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 2 Sep 2009 18:41:07 -0400 (EDT) Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: > allows global cross mounts from ~300 compute nodes) There is a variety > of codes we run; some latency sensitive and others bandwidth > sensitive. if you're sensitive either way, you're going to be unhappy with Gb. IMO, you'd be best to configure your scheduler to never spread an MPI job across switches, and then just match the backbone to the aggregate IO bandwidth your NFS storage can support. something like 10G uplinks from 48pt switches would probably work well. From rpnabar at gmail.com Wed Sep 2 20:29:07 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 2 Sep 2009 22:29:07 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: On Wed, Sep 2, 2009 at 5:41 PM, Mark Hahn wrote: >> allows global cross mounts from ~300 compute nodes) There is a variety >> of codes we run; some latency sensitive and others bandwidth >> sensitive. > > if you're sensitive either way, you're going to be unhappy with Gb. I am still testing sensitivity but I suspect I am sensitive either way. > IMO, you'd be best to configure your scheduler to never spread an MPI > job across switches, Good idea. I was thinking about it. Might need to tweak my PBS scheduler. >and then just match the backbone to the aggregate IO > bandwidth your NFS storage can support. That brings me to another important question. Any hints on speccing the head-node? Especially the kind of storage I put in on the head node. I need around 1 Terabyte of storage. In the past I've uses RAID5+SAS in the server. Mostly for running jobs that access their I/O via files stored centrally. For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I boost the RAM up? Or any other comments. It is tricky to spec the central node. Or is it more advisable to go for storage-box external to the server for NFS-stores and then figure out a fast way of connecting it to the server. Fiber perhaps? -- Rahul From amjad11 at gmail.com Wed Sep 2 20:33:38 2009 From: amjad11 at gmail.com (amjad ali) Date: Thu, 3 Sep 2009 08:33:38 +0500 Subject: [Beowulf] CPU shifts?? and time problems In-Reply-To: References: <428810f20909020514k66fe7905i80a0d17867b2d715@mail.gmail.com> Message-ID: <428810f20909022033t24c3c421na25d13ec1b2c0697@mail.gmail.com> Hi, please see below On Wed, Sep 2, 2009 at 6:57 PM, Mark Hahn wrote: > On Wed, 2 Sep 2009, amjad ali wrote: > > Hi All, >> I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on >> ROCKS-5 >> with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel >> on it. >> Please reply following: >> >> 1) When I run my serial code on the dual-core head node (or parallel code >> with -np 1); it gives results in about 2 minutes. What I observe is that >> "System Monitor" application show that some times CPU1 become busy 80+% >> and >> CPU2 around 10% busy. After some time CPU1 gets share around 10% busy >> while >> the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue >> till end. Why this is so? Does this busy-ness shifts/swaping harms >> performance/speed? >> > > the kernel decides where to run processes based on demand. if the machine > were otherwise idle, your process would stay on the same CPU. depending on > the particular kernel release, the kernel uses various heuristics to decide > how much to "resist" moving the process among cpus. > > the cost of moving among cpus depends entirely on how much your code > depends > on the resources tied to one cpu or the other. for instance, if your code > has a very small memory footprint, moving will have only trivial cost. > if your process has a larger working set size, but fits in onchip cache, > it may be relatively expensive to move to a different processor in the > system that doesn't share cache. consider a 6M L3 in a 2-socket system, > for instance: the inter-socket bandwidth will be approximately memory > speed, > which on a core2 system is something like 6 GB/s. so migration will incur > about a 1ms overhead (possibly somewhat hidden by concurrency.) > > in your case (if I have the processor spec right), you have 2 cores > sharing a single 4M L2. L1 cache is unshared, but trivial in size, > so migration cost should be considered near-zero. > > the numactl command lets you bind a cpu to a processor if you wish. > this is normally valuable on systems with more complex topologies, > such as combinations of shared and unshared caches, especially when divided > over multiple sockets, and with NUMA memory (such as opterons and nehalems.) > > 2) When I run my parallel code with -np 2 on the dual-core headnode only; >> it gives results in about 1 minute. What I observe is that "System >> Monitor" >> application show that all the time CPU1 and CPU2 remain busy 100%. >> > > no problem there. normally, though, it's best to _not_ run extraneous > processes, and instead only look at the elapsed time that the job takes to > run. that is the metric that you should care about. > > 3) When I run my parallel code with "-np 4" and "-np 8" on the dual-core >> headnode only; it gives results in about 2 and 3.20 minutes respectively. >> What I observe is that "System Monitor" application show that all the time >> CPU1 and CPU2 remain busy 100%. >> > > sure. with 4 cpus, you're overloading the cpus, but they timeslice fairly > efficiently, so you don't lose. once you get to 8 cpus, you lose because > the overcommitted processes start interfering (probably their working set > is blowing the L2 cache.) > > 4) When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8 >> cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I >> > > well, then I think it's a bit hyperbolic to call it a parallel code ;) > seriously, all you've learned here is that your interconnect is causing > your code to not scale. the problem could be your code or the > interconnect. > > observe is that "System Monitor" application show CPU usage fluctuations >> somewhat as in point number 1 above (CPU1 remains dominant busy most of >> the >> time), in case of -np 4. Does this means that an MPI-process is shifting >> to >> different cores/cpus/nodes? Does these shiftings harm performance/speed? >> > > MPI does not shift anything. the kernel may rebalance runnable processes > within a single node, but not across nodes. it's difficult to tell how much > your monitoring is harming the calculation or perturbing the load-balance. > > 5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare >> to >> -np 2 on the headnode? Obviously its due to communication overhead! but >> how >> to get better performance--lesser run time? My code is not too complicated >> only 2 values are sent and 2 values are received by each process after >> each >> stage. >> > > then do more work between sends and receives. hard to say without knowing > exactly what the communication pattern is. > Here is my subroutine: IF (myrank /= p-1) CALL MPI_ISEND(u_local(Np,E),1,MPI_REAL8,myrank+1,55,MPI_COMM_WORLD,right(1), ierr) IF (myrank /= 0) CALL MPI_ISEND(u_local(1,B),1,MPI_REAL8,myrank-1,66,MPI_COMM_WORLD,left(1), ierr) IF (myrank /= 0) CALL MPI_IRECV(u_left_exterior,1, MPI_REAL8, myrank-1, 55, MPI_COMM_WORLD,right(2), ierr) IF (myrank /= p-1) CALL MPI_IRECV(u_right_exterior,1, MPI_REAL8, myrank+1, 66, MPI_COMM_WORLD,left(2), ierr) u0_local=RESHAPE(u_local,(/Np*K_local/)) du0_local=RESHAPE(du_local,(/Nfp*Nfaces*K_local/)) q_local = rx_local*MATMUL(Dr,u_local) DO I = shift*Nfp*Nfaces+1+1 , shift*Nfp*Nfaces+Nfp*Nfaces*K_local-1 du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 ENDDO I = shift*Nfp*Nfaces+1 IF (myrank == 0) du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 IF (myrank /= p-1) CALL MPI_WAIT(right(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(left(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(right(2), status, ierr) IF (myrank /= p-1) CALL MPI_WAIT(left(2), status, ierr) IF (myrank /= 0) THEN du0_local(I) = (u0_local(vmapM_local(I))-u_left_exterior)/2.0_8 ENDIF I = shift*Nfp*Nfaces+Nfp*Nfaces*K_local IF (myrank == p-1) du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 IF (myrank /= p-1) THEN du0_local(I) = (u0_local(vmapM_local(I))-u_right_exterior)/2.0_8 ENDIF IF (myrank == 0) du0_local(mapI) = 0.0_8 IF (myrank == p-1) du0_local(mapO) = 0.0_8 du_local = RESHAPE(du0_local,(/Nfp*Nfaces,K_local/)) q_local = q_local-MATMUL(LIFT,Fscale_local*(nx_local*du_local)) IF (myrank /= p-1) CALL MPI_ISEND(q_local(Np,E),1,MPI_REAL8,myrank+1,551,MPI_COMM_WORLD,right(1), ierr) IF (myrank /= 0) CALL MPI_ISEND(q_local(1,B),1,MPI_REAL8,myrank-1,661,MPI_COMM_WORLD,left(1), ierr) IF (myrank /= 0) CALL MPI_IRECV(q_left_exterior,1, MPI_REAL8, myrank-1, 551, MPI_COMM_WORLD,right(2), ierr) IF (myrank /= p-1) CALL MPI_IRECV(q_right_exterior,1, MPI_REAL8, myrank+1, 661, MPI_COMM_WORLD,left(2), ierr) q0_local=RESHAPE(q_local,(/Np*K_local/)) dq0_local=RESHAPE(dq_local,(/Nfp*Nfaces*K_local/)) rhsu_local= rx_local*MATMUL(Dr,q_local) + u_local*(u_local-a1)*(1.0_8-u_local) - v_local DO I = shift*Nfp*Nfaces+1+1 , shift*Nfp*Nfaces+Nfp*Nfaces*K_local-1 dq0_local(I) = (q0_local(vmapM_local(I))-q0_local(vmapP_local(I)))/2.0_8 ENDDO I = shift*Nfp*Nfaces+1 IF (myrank /= p-1) CALL MPI_WAIT(right(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(left(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(right(2), status, ierr) IF (myrank /= p-1) CALL MPI_WAIT(left(2), status, ierr) IF (myrank /= 0) THEN dq0_local(I) = (q0_local(vmapM_local(I))-q_left_exterior)/2.0_8 ENDIF I = shift*Nfp*Nfaces+Nfp*Nfaces*K_local IF (myrank /= p-1) THEN dq0_local(I) = (q0_local(vmapM_local(I))-q_right_exterior)/2.0_8 ENDIF IF (myrank == 0) dq0_local(mapI) = q0_local(vmapI)+0.225_8 IF (myrank == p-1) dq0_local(mapO) = q0_local(vmapO) dq_local = RESHAPE(dq0_local,(/Nfp*Nfaces,K_local/)) rhsu_local= rhsu_local-MATMUL(LIFT,(Fscale_local*(nx_local*dq_local))) END SUBROUTINE ===================================================================== Is it suffieciently goog or there are some serious problems with the communication pattern. Here Arrays are not very big because Nfp*Nfaces = 2 and K_local = < 30. > > I think you should first validate your cluster to see that the Gb is > running as fast as expected. actually, that everything is running right. > that said, Gb is almost not a cluster interconnect at all, since it's so > much slower than the main competitors (IB mostly, to some extent 10GE). > fatter nodes (dual-socket quad-core, for instance) would at least decrease > the effect of slow interconnect. > > you might also try instaling openMX, which is an ethernet protocol > optimized for MPI (rather than your current MPI which is presumably layered > on top of the usual TCP stack, which is optimized for wide-area > streaming transfers.) heck, you can probably obtain some speedup by > tweaking your coalesce settings via ethtool. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlb17 at duke.edu Wed Sep 2 20:54:17 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 2 Sep 2009 23:54:17 -0400 (EDT) Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: On Wed, 2 Sep 2009 at 10:29pm, Rahul Nabar wrote > That brings me to another important question. Any hints on speccing > the head-node? Especially the kind of storage I put in on the head > node. I need around 1 Terabyte of storage. In the past I've uses > RAID5+SAS in the server. Mostly for running jobs that access their I/O > via files stored centrally. > > For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I > boost the RAM up? Or any other comments. It is tricky to spec the > central node. > > Or is it more advisable to go for storage-box external to the server > for NFS-stores and then figure out a fast way of connecting it to the > server. Fiber perhaps? Speccing storage for a 300 node cluster is a non-trivial task and is heavily dependent on your expected access patterns. Unless you anticipate vanishingly little concurrent access, you'll be very hard pressed to service a cluster that large with a basic Linux NFS server. About a year ago I had ~300 nodes pointed at a NetApp FAS3020 with 84 spindles of 10K RPM FC-AL disks. A single user could *easily* flatten the NetApp (read: 100% CPU and multi-second/minute latencies for everybody else) without even using the whole cluster. Whatever you end up with for storage, you'll need to be vigilant regarding user education. Jobs should store as much in-process data as they can on the nodes (assuming you're not running diskless nodes) and large jobs should stagger their access to the central storage as best they can. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From landman at scalableinformatics.com Wed Sep 2 21:15:41 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 03 Sep 2009 00:15:41 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: <4A9F42ED.5050600@scalableinformatics.com> Rahul Nabar wrote: > That brings me to another important question. Any hints on speccing > the head-node? Especially the kind of storage I put in on the head For a cluster of this size, divide and conquer. Head node to handle cluster admin. Create login nodes for users to access to handle builds, job submission, etc. > node. I need around 1 Terabyte of storage. In the past I've uses > RAID5+SAS in the server. Mostly for running jobs that access their I/O > via files stored centrally. Hmmm... We don't recommend burdening the head node with storage apart for very small clusters, where it is a bit more cost effective. Depending upon how your nodes do IO for your jobs, this will dictate how you need your IO designed. If all nodes will do IO, then you need something that can handle *huge* transients from time to time. If one node does IO, you need just a good fast connection. Is GbE enough? How much IO are we talking about? Bad storage design can make a nice new 300 node cluster seem very slow. > For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I > boost the RAM up? Or any other comments. It is tricky to spec the > central node. Head node: from a management perspective (name service, dhcp/tftp/pxe, authentication/gateway, status monitor, etc) can be relatively light weight. Login node(s): should have sufficient RAM/CPU for builds. Storage node(s): should be built with thought towards the IO patterns expected. > Or is it more advisable to go for storage-box external to the server > for NFS-stores and then figure out a fast way of connecting it to the > server. Fiber perhaps? Start with your IO patterns, your IO volume, and how many are running at once. Once you have this, move on to figuring out capacity needs, availability needs (replication, fast home vs fast scratch + slow home) Avoid worrying about the technologies you should consider until you have a better handle on how it will be used. The use cases will suggest the technologies you should consider. We are biased (given what we build, sell and support) of course. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Wed Sep 2 23:18:20 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 3 Sep 2009 02:18:20 -0400 (EDT) Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: > That brings me to another important question. Any hints on speccing > the head-node? I think you imply a single, central admin/master/head node. this is a very bad idea. first, it's generally a bad idea to have users on a fileserver. next, it's best to keep cluster-infrastructure (monitoring, management, pxe, scheduling) on a dedicated admin machine. for 300 compute nodes, it might be a good idea to provide more than one login node (for editing, compilation, etc). > Especially the kind of storage I put in on the head > node. I need around 1 Terabyte of storage. In the past I've uses > RAID5+SAS in the server. 1 TB is, I assume you know, half a disk these days (ie, trivial). for a 300-node cluster, I'd configure at least 10x and probably 100x that much. (my user community is pretty diverse, though, and with a wide range of IO habits.) > Mostly for running jobs that access their I/O > via files stored centrally. it would be wise to get some sort of estimates of the actual numbers - even the total size of all files accessed by a job and its average runtime would let you figure an average data rate. > For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I I don't think I'd use such a nice machine for any of fileserver, admin or login nodes. for admin, it's not needed. for login it'll be unused a lot of the time. for fileservers, you want to sweat the IO system, not the CPU or memory. > boost the RAM up? Or any other comments. It is tricky to spec the > central node. spec'ing a single one may be, but a single one is a bad idea... > Or is it more advisable to go for storage-box external to the server > for NFS-stores and then figure out a fast way of connecting it to the > server. Fiber perhaps? 10G (Cu or SiO2, doesn't matter) is the right choice for an otherwise-gigabit cluster. From rpnabar at gmail.com Thu Sep 3 03:59:37 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 05:59:37 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: On Thu, Sep 3, 2009 at 1:18 AM, Mark Hahn wrote: Thanks a lot for all the great comments, guys! > I think you imply a single, central admin/master/head node. ?this is a very > bad idea. ?first, it's generally a bad idea to have users on a fileserver. > ?next, it's best to keep cluster-infrastructure > (monitoring, management, pxe, scheduling) on a dedicated admin machine. > for 300 compute nodes, it might be a good idea to provide more than one > login node (for editing, compilation, etc). Absolutely. I ought to use the term "head node(s)" What I want to spec and figure out is how many central machines are warranted and how I should differentially configure each. > > 1 TB is, I assume you know, half a disk these days (ie, trivial). > for a 300-node cluster, I'd configure at least 10x and probably 100x that > much. ?(my user community is pretty diverse, though, > and with a wide range of IO habits.) We have a different long term store. So this machine is only holding running, staging and other jobs. Users are warned that data is not backed up and subject to periodic flushing. Yet, you are right. I was being overly stingy. Bad estimate. I have a similar smaller cluster and I double checked usage on that one just now. If I scale that up to 300 nodes I should probably be shooting for 4.5 to 5 Terabytes of storage. > > I don't think I'd use such a nice machine for any of fileserver, admin or > login nodes. ?for admin, it's not needed. ?for login it'll be unused a lot > of > the time. ?for fileservers, you want to sweat the IO system, not the CPU or > memory. Yes, I used it for lack of knowledge of a more suitable but puny candidate. Any suggestions on a more puny machine? Besides ovevrspeccing the proc central node doesn't change my cost much relative to entire cluster. > > 10G (Cu or SiO2, doesn't matter) is the right choice for an > otherwise-gigabit cluster. > 10 G storage node to switch, and alternatively 10 G storage-box-switch, correct? -- Rahul From rpnabar at gmail.com Thu Sep 3 04:06:05 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 06:06:05 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: On Wed, Sep 2, 2009 at 10:54 PM, Joshua Baker-LePain wrote: > On Wed, 2 Sep 2009 at 10:29pm, Rahul Nabar wrote >> > Speccing storage for a 300 node cluster is a non-trivial task and is heavily > dependent on your expected access patterns. ?Unless you anticipate > vanishingly little concurrent access, you'll be very hard pressed to service > a cluster that large with a basic Linux NFS server. Thanks Joshua! Question is, what's my alternatives: Software: Change from NFS to xxx? Hardware: Go for a external Netapp storage box? Others.......? > > Whatever you end up with for storage, you'll need to be vigilant regarding > user education. ?Jobs should store as much in-process data as they can on > the nodes (assuming you're not running diskless nodes) and large jobs should > stagger their access to the central storage as best they can. Nope. Not diskless nodes. Nodes have local OS and /scratch space. But userfiles and executable installations reside on a central NFS store. Luckily my usage patterns are such that there are no new code development on this particular cluster. I have a tight control over what exact codes are running (DACAPO, VASP, GPAW) Thus so long as I compile, wrapper-script and optimize each of the codes the users can do no harm. -- Rahul From rpnabar at gmail.com Thu Sep 3 04:14:10 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 06:14:10 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <4A9F42ED.5050600@scalableinformatics.com> References: <4A9F42ED.5050600@scalableinformatics.com> Message-ID: On Wed, Sep 2, 2009 at 11:15 PM, Joe Landman wrote: > Rahul Nabar wrote: > For a cluster of this size, divide and conquer. ?Head node to handle cluster > admin. ?Create login nodes for users to access to handle builds, job > submission, etc. > Hmmm... We don't recommend burdening the head node with storage apart for > very small clusters, where it is a bit more cost effective. Thanks Joe! My total number of users is relatively small. ~50 with rarely more than 20 concurrent logged in users. Of course, each user might have multiple shell sessions. So the experts would recommend three separate central nodes? Loginnode Management node (dhcp / schedulers etc.) Storage node Or more? > Depending upon how your nodes do IO for your jobs, this will dictate how you > need your IO designed. ?If all nodes will do IO, then you need something > that can handle *huge* transients from time to time. ?If one node does IO, > you need just a good fast connection. ?Is GbE enough? ?How much IO are we > talking about? I did my economics and on the compute nodes I am stuck to GbE nothing more. If this becomes a totally unworkable proposition I'll be forced to split into smaller clusters. 10GbE, Myrinet, Infiniband just do not make economic sense for us. On the central nodes, though, I can afford to have better interconnects. Should I? Of what type? -- Rahul From landman at scalableinformatics.com Thu Sep 3 05:58:14 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 03 Sep 2009 08:58:14 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <4A9F42ED.5050600@scalableinformatics.com> Message-ID: <4A9FBD66.1060608@scalableinformatics.com> Rahul Nabar wrote: > On Wed, Sep 2, 2009 at 11:15 PM, Joe > Landman wrote: >> Rahul Nabar wrote: > >> For a cluster of this size, divide and conquer. Head node to handle cluster >> admin. Create login nodes for users to access to handle builds, job >> submission, etc. > >> Hmmm... We don't recommend burdening the head node with storage apart for >> very small clusters, where it is a bit more cost effective. > > Thanks Joe! My total number of users is relatively small. ~50 with > rarely more than 20 concurrent logged in users. Of course, each user > might have multiple shell sessions. > > So the experts would recommend three separate central nodes? > > Loginnode > Management node (dhcp / schedulers etc.) > Storage node You can add more login nodes as you need. Management nodes for the cluster stack (if any) can be fairly simple. The storage node is a function of your IO patterns. For really large clusters, you'd separate out the scheduler and some of the other functions as well. Mark Hahn and some of the other folks on the list run some of the really large clusters out there. They have some good advice for those scaling up. > Or more? > >> Depending upon how your nodes do IO for your jobs, this will dictate how you >> need your IO designed. If all nodes will do IO, then you need something >> that can handle *huge* transients from time to time. If one node does IO, >> you need just a good fast connection. Is GbE enough? How much IO are we >> talking about? > > I did my economics and on the compute nodes I am stuck to GbE nothing > more. If this becomes a totally unworkable proposition I'll be forced > to split into smaller clusters. 10GbE, Myrinet, Infiniband just do not > make economic sense for us. On the central nodes, though, I can afford > to have better interconnects. Should I? Of what type? It might be worth asking what your targeted per node budget is. 24 port SDR IB switches are available, and relatively inexpensive. 24 port SDR PCIe cards are available and relatively inexpensive. Jeff Layton (a great resource BTW) wrote about them last year. We've used them in a number of designs. Not the rock bottom in latency, but we have customers using our storage over NFS over RDMA at 500+ MB/s with them, so its not too bad. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Thu Sep 3 06:14:04 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 08:14:04 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D056731@milexchmb1.mil.tagmclarengroup.com> References: <4A9F42ED.5050600@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D056731@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Thu, Sep 3, 2009 at 6:27 AM, Hearns, John wrote: > In that case, consider that most motherboards have dual gig Ethernet > onboard. Yes. I've made sure mine do. Twin gigabit sockets. > You could at least specify that the cabling plant is put in for the > second Ethernet network I've always used both but in the past via bonding. > and aim to use that for the storage traffic (or the MPI traffic). A Would it be better to use one exclusively for MPI? I'm not sure how one goes about this yet! The separation is maintained at switches too? An MPI switch separate from the rest-of-traffic switch? > second stack of > Ethernet switches should not stretch your budget too much. True. I am very tempted to go that route. > Then on the main storage node you could put in a 10gig interface - or > indeed several 10gig > interfaces and spread the load Yes. That is exactly the sort of thing I am wanting to do. That's why I am asking around. e.g. how many could I put in a reasonable config.? It is easier to go fancy on the main node since the cost does not get a multiplier. -- Rahul From rpnabar at gmail.com Thu Sep 3 06:23:36 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 08:23:36 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <4A9FBD66.1060608@scalableinformatics.com> References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> Message-ID: On Thu, Sep 3, 2009 at 7:58 AM, Joe Landman wrote: > For really large clusters, you'd separate out the scheduler and some of the > other functions as well. ?Mark Hahn and some of the other folks on the list > run some of the really large clusters out there. ?They have some good advice > for those scaling up. Thanks Joe! I really look forward to Mark Hahn and the others guiding me. This expansion is on the larger side for me. > It might be worth asking what your targeted per node budget is. I do not have a target per node but more of a $/performance budget. And for my codes I've found that Infy etc. just don't cut it. The additional cost does not squeeze out the extra performance. I've benchmarked several chips and configs and the current winner for our codes seems to be a Intel Nehalem E5520. Less than 3000 $/node. >24 port SDR > IB switches are available, and relatively inexpensive. Is there a approximate $$$ figure someone can throw out? These numbers have been pretty hard to get. >24 port SDR PCIe > cards are available and relatively inexpensive. Ditto. Any $ figures? All my calculations boosted up the $ price of a node to a point where the performance would have to be very stellar to warrant the spending. And really, the plain-vanilla Nehalem ethernet config is not doing too badly for us yet. My main concern now is scaling. -- Rahul From gus at ldeo.columbia.edu Thu Sep 3 08:19:33 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 03 Sep 2009 11:19:33 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> Message-ID: <4A9FDE85.5020008@ldeo.columbia.edu> Rahul Nabar wrote: >> 24 port SDR >> IB switches are available, and relatively inexpensive. > > Is there a approximate $$$ figure someone can throw out? These numbers > have been pretty hard to get. > >> 24 port SDR PCIe >> cards are available and relatively inexpensive. > > Ditto. Any $ figures? > > All my calculations boosted up the $ price of a node to a point where > the performance would have to be very stellar to warrant the spending. > And really, the plain-vanilla Nehalem ethernet config is not doing too > badly for us yet. My main concern now is scaling. > Hi Rahul See these small SDR switches: http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 And SDR HCA card: http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12 We bought DDR though, but our cluster is small, one 36-port switch only. For a 300-node cluster you need to consider optical fiber for the IB uplinks, and switches with that capability, or buy the appropriate adapters. The regular IB cables are length-challenged, most likely can only be used for node-to-switch connections. Also, for Opteron, Supermicro (and probably others) has motherboards with onboard IB adapters, on 1U dual-node chassis. I wonder if there is something similar for Nehalem. I don't know about your computational chemistry codes, but for climate/oceans/atmosphere (and probably for CFD) IB makes a real difference w.r.t. Gbit Ethernet. For us there was no point on trading a larger number of nodes for IB. OTOH, if your codes run mostly intra-node, there is no advantage in buying a fast interconnect, but I would doubt your Chem codes are happy with 8 processes per job only. Also, with IB, you could dedicate one of your nodes' Gbit Ether ports to I/O only, with all MPI traffic using IB. My $0.02 Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From rpnabar at gmail.com Thu Sep 3 09:28:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 11:28:39 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <4A9FDE85.5020008@ldeo.columbia.edu> References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> <4A9FDE85.5020008@ldeo.columbia.edu> Message-ID: On Thu, Sep 3, 2009 at 10:19 AM, Gus Correa wrote: > See these small SDR switches: > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13 > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 > > And SDR HCA card: > Thanks Gus! This info was very useful. A 24port switch is $2400 and the card $125. Thus each compute node would be approximately $300 more expensive. (How about infiniband cables? Are those special and how expensive. I did google but was overwhelmed by the variety available.) This isn't bad at all I think. If I base it on my curent node price it would require only about a 20% performance boost to justify this investment. I feel Infy could deliver that. When I had calculated it the economics was totally off; maybe I had wrong figures. The price-scaling seems tough though. Stacking 24 port switches might get a bit too cumbersome for 300 servers. But when I look at corresponding 48 or 96 port switches the per-port-price seems to shoot up. Is that typical? > For a 300-node cluster you need to consider > optical fiber for the IB uplinks, You mean compute-node-to-switch and switch-to-switch connections? Again, any $$$ figures, ballpark? > I don't know about your computational chemistry codes, > but for climate/oceans/atmosphere (and probably for CFD) > IB makes a real difference w.r.t. Gbit Ethernet. I have a hunch (just a hunch) that the computational chemistry codes we use haven't been optimized to get the full advantage of the latency benefits etc. Some of the stuff they do is pretty bizarre and inefficient if you look at their source codes (writing to large I/O files all the time eg.) I know this ought to be fixed but there that seems a problem for another day! -- Rahul From rpnabar at gmail.com Thu Sep 3 09:43:28 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 11:43:28 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D056A7B@milexchmb1.mil.tagmclarengroup.com> References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> <4A9FDE85.5020008@ldeo.columbia.edu> <68A57CCFD4005646957BD2D18E60667B0D056A7B@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Thu, Sep 3, 2009 at 11:41 AM, Hearns, John wrote: > > At this point you really, really should be getting a vendor or a few > vendors > in to do these pricings for you. That's part and parcel of pitching for > a > cluster of this size - you really, really should not be doing the donkey > work like this. > You should say 'we want to run codes X, Y,Z' we prefer processors > 'A,B,C' > we need 'N to M' terabytes of storage. Let your vendor work for the > money. That's true. That's how I have specced the rest of the cluster anyways. I'll go to my vendors and check what I get. -- Rahul From Shainer at mellanox.com Thu Sep 3 09:51:01 2009 From: Shainer at mellanox.com (Gilad Shainer) Date: Thu, 3 Sep 2009 09:51:01 -0700 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes References: <4A9F42ED.5050600@scalableinformatics.com><4A9FBD66.1060608@scalableinformatics.com><4A9FDE85.5020008@ldeo.columbia.edu> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02053A65@mtiexch01.mti.com> > > See these small SDR switches: > > > > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct= 13 > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 > > > > And SDR HCA card: > > > > Thanks Gus! This info was very useful. A 24port switch is $2400 and > the card $125. Thus each compute node would be approximately $300 more > expensive. (How about infiniband cables? Are those special and how > expensive. I did google but was overwhelmed by the variety available.) You can find copper cables from around $30, so the $300 will include the cable too > This isn't bad at all I think. If I base it on my curent node price > it would require only about a 20% performance boost to justify this > investment. I feel Infy could deliver that. When I had calculated it > the economics was totally off; maybe I had wrong figures. You can always run your app on available users system and see the performance boost that you will be able to get. For example, you can use the center (free of charge) - http://www.hpcadvisorycouncil.com/cluster_center.php > The price-scaling seems tough though. Stacking 24 port switches might > get a bit too cumbersome for 300 servers. But when I look at > corresponding 48 or 96 port switches the per-port-price seems to shoot > up. Is that typical? It is the same as buying blades. If you get the switches fully populated, than it will be cost effective. There is a 324 port switch, which should be a good option too. > > For a 300-node cluster you need to consider > > optical fiber for the IB uplinks, > > You mean compute-node-to-switch and switch-to-switch connections? > Again, any $$$ figures, ballpark? It all depends on the speed. If you are using IB SDR or DDR, copper cables will be enough. For QDR you can use passive copper up 7-8 meters, and active up to 12m, before you need to move to fiber. > > I don't know about your computational chemistry codes, > > but for climate/oceans/atmosphere (and probably for CFD) > > IB makes a real difference w.r.t. Gbit Ethernet. > > I have a hunch (just a hunch) that the computational chemistry codes > we use haven't been optimized to get the full advantage of the latency > benefits etc. Some of the stuff they do is pretty bizarre and > inefficient if you look at their source codes (writing to large I/O > files all the time eg.) I know this ought to be fixed but there that > seems a problem for another day! On the same web site I have listed above, there are some best practices with apps performance, You can check them out and see if some of them are more relevant. From gus at ldeo.columbia.edu Thu Sep 3 10:25:01 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 03 Sep 2009 13:25:01 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> <4A9FDE85.5020008@ldeo.columbia.edu> Message-ID: <4A9FFBED.6070300@ldeo.columbia.edu> Rahul Nabar wrote: > On Thu, Sep 3, 2009 at 10:19 AM, Gus Correa wrote: >> See these small SDR switches: >> >> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13 >> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 >> >> And SDR HCA card: >> > > Thanks Gus! This info was very useful. A 24port switch is $2400 and > the card $125. Thus each compute node would be approximately $300 more > expensive. (How about infiniband cables? Are those special and how > expensive. I did google but was overwhelmed by the variety available.) > Hi Rahul IB cables (0.5-8m,$40-$109): http://www.colfaxdirect.com/store/pc/viewCategories.asp?pageStyle=m&idCategory=2 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1&idcategory=2 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=2&idcategory=2 etc ... > This isn't bad at all I think. If I base it on my curent node price > it would require only about a 20% performance boost to justify this > investment. I feel Infy could deliver that. When I had calculated it > the economics was totally off; maybe I had wrong figures. > > The price-scaling seems tough though. Stacking 24 port switches might > get a bit too cumbersome for 300 servers. It probably will. I will defer any comments to the network pros on the list. Here is a suggestion. I would guess that if you don't intend to run the codes, say, on more than 24-36 nodes at once, you might as well not stack all the small IB switches. I.e., you could divide the cluster IB-wise into smaller units, of perhaps 36 nodes or so, with 2-3 switches serving each unit. Not sure how to handle the IB subnet(s) manager in such a configuration, but there may be ways around. This scheme may take some scheduler configuration to handle MPI job submission, but it may save you money and hardware/cabling complexity, and still let you run MPI programs with a substantial number of processes. You can still fully connect the 300 nodes through Gbit Ether, for admin and I/O purposes, stacking 48-port GigE switches. IB is a separate (set of) network(s), which I assume will be dedicated to MPI only. You may want to check the 36-port IB switches also, but IIRR they are only DDR and QDR, not SDR, and somewhat more expensive. > But when I look at > corresponding 48 or 96 port switches the per-port-price seems to shoot > up. Is that typical? > I was told the current IB switch price threshold is 36-port. Above that it gets too expensive, the cost-effective solution is stacking smaller switches. I'm just passing the information/gossip along. >> For a 300-node cluster you need to consider >> optical fiber for the IB uplinks, > > You mean compute-node-to-switch and switch-to-switch connections? > Again, any $$$ figures, ballpark? > I would guess you may need optical fiber for switch-switch connections. Depending on the distance, of course, say, across two racks, if this type of connection is needed. Regular IB cables are probably able handle the node-switch links, if the switches are distributed across the racks. >> I don't know about your computational chemistry codes, >> but for climate/oceans/atmosphere (and probably for CFD) >> IB makes a real difference w.r.t. Gbit Ethernet. > > I have a hunch (just a hunch) that the computational chemistry codes > we use haven't been optimized to get the full advantage of the latency > benefits etc. Some of the stuff they do is pretty bizarre and > inefficient if you look at their source codes (writing to large I/O > files all the time eg.) I know this ought to be fixed but there that > seems a problem for another day! > Not only your Chem codes. Brute force I/O is rampant here also. Some codes take pains to improve MPI communication on the domain decomposition side, with asynchronous communication, etc, then squander it all by letting everybody do I/O in unison. (Hence, keep in mind Joshua's posting about educating users and adjusting codes to do I/O gently.) I hope this helps. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From rpnabar at gmail.com Thu Sep 3 15:56:43 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 17:56:43 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> References: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> Message-ID: On Thu, Sep 3, 2009 at 3:16 PM, Greg Kurtzer wrote: Thanks for the comments Greg! > If you were using Perceus..... No. I've never used Perceus before and although it sounds interesting this seems like a bad time to try something new! > > The file system needs to be built to handle the load of the apps. 300 > nodes means you can go from the low end (Linux RAID and NFS) to a > higher end NFS solution, or upper end of a parallel file system or > maybe even one of each (NFS and parallel) as they solve some different > requirements. What exactly do you mean by a "parallel" file system? Something like GPFS? That's IBM proprietory though isn't it? On the other hand NFS seems pretty archaic. I've seen quite a few installations use Lustre. I am planning to play with that. Something in the OpenSource world to keep costs down. -- Rahul From skylar at cs.earlham.edu Thu Sep 3 16:25:36 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Thu, 03 Sep 2009 16:25:36 -0700 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> Message-ID: <4AA05070.2030900@cs.earlham.edu> Rahul Nabar wrote: > What exactly do you mean by a "parallel" file system? Something like > GPFS? That's IBM proprietory though isn't it? On the other hand NFS > seems pretty archaic. I've seen quite a few installations use Lustre. > I am planning to play with that. Something in the OpenSource world to > keep costs down. > > GPFS actually supports exporting the filesystem using clustered NFS. You have to run your clients over NFSv4, though. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From eugen at leitl.org Fri Sep 4 01:17:22 2009 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 4 Sep 2009 10:17:22 +0200 Subject: [Beowulf] Some perspective to this DIY storage server mentioned at Storagemojo Message-ID: <20090904081722.GN4508@leitl.org> http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-this-DIY-storage-server-mentioned-at-Storagemojo.html Some perspective to this DIY storage server mentioned at Storagemojo Thursday, September 3. 2009 I've received yesterday some mails/tweets with hints to a "Thumper for poor" DIY chassis. Those mails asked me for an opinion towards this piece of hardware and if it's a competition to our X4500/X4540. Those questions arised after Robin Harris wrote his article "Cloud storage for $100 a terabyte", which referred to the company Backblaze, which constructed a storage server on its own and described it on their blog in the article "Petabytes on a budget: How to build cheap cloud storage". Sorry, that this article took so long and there may be a higher rate of typos, as my sinusitis came back with a vengeance ... right in the second week of my vacation. But now this rather long article is ready :-) At first: No, it isn't a system comparable to an X4540 ... even without the considerations of DIY versus Tier-1 vendor. I have a rather long opinion about it, but let's say one thing at first: I see several problems, but i think it fits their need, so it's an optimal design for them and they designed it to be the optimum for them. I assume, many problems are addressed in the application logic. The nice thing at custom-build is the fact, that you can build a system exactly for your needs. And the Backblaze system is a system reduced to the minimum. This device is that cheap because it cuts several corners. That's okay for them. But for general purpose this creates problems. I want to share my concerns just to show you, that you can't compare this to a X4540 device. And even more important: I have to deny the conclusions of the Backblaze people. This isn't a good design, even when you just need cheap storage, when you don't own a middleware that does a lot of stuff that ZFS would do in the filesystem for example in the hardware. On the other side it supports my arguments in regard of the waning importance of RAID controllers. The more intelligent your application is, the less intelligent your storage needs to be. So ... what are my objections to this DIY device: * The DIY Thumper has no power-distribution grid. So when one PSU fails, all devices connected at this power supply will fail. In the case PSU2 fails, the system board is away, thus the machine fails. Game over ... until power comes back. * Connected to the last problem: Given the disk layout, the power-distribution isn't correct. They use it with RAID6, but RAID6 just protects you against 2 failures. I don't see a sensible layout in three RAID6 groups, that would allow the system to loose 25 disks at once. A more reasonable RAID Level would be RAID10, but there you have 5 disks without a partner in the other PSU failure domain. * I don't know if i consider a foam sleeve around the disks and some nylon screws as enough vibration dampening, especially when your hard disks. I'm looking forward to the next article they announced which was announced to cover this topic. It will be even more interesting to hear more about it in the future because of the performance and the longevity of the disks in such an environment. Just an example for the real world: Once we found out that disks near to a fan were a tad slower than the ones far away from the fan. This led to changes to the vibration handling in that system. * This baby cries for ZFS. So much capacity, no battery backup RAID controller, only 10^14 disks. But i see the reason, why this choice wasn't feasible for them: Since a few weeks ago, the OpenSolaris SATA framework hadn't support for port multiplier. This was introduced with the putback of PSARC/2009/394 to OpenSolaris. But now it's integrated. And given, that this baby just speaks HTTPS to the outside and the software relies on Tomcat, it should be a piece of cake to move to Opensolaris and ZFS now. * This design isn't really performance oriented. As they use Port multiplier to couple their disks to cheap SATA PCIe/PCI controller, one 3 GBit/s interface has to feed 5 disks. One ST31500341AS delivers round about 120 MByte/s (saw several benchmarks suggesting such a value). Five of them deliver 600 MByte/s, a little bit less than 6 GBit/s. So each SATA channel is oversubscribed by a factor of two. * Even more important, three of the connections to the port extenders are coupled to a standard PCI-Port. One PCI-Conventional 3.0 port (didn't find an information what the board provides, thus i assumed the fastest, source is the german wikipedia page about PCI) is capable to deliver round-about 4 Gigabit/second (to be exact 4,226 GBit/s). Thus you connected 18 GBit/s worth of hard disks at 4 GBit/s worth of connectivity. * I have similar objections for the PCIe connection for SATA-cards. Those ports are PCIe at 1x. One PCIe 1x port has a theoretical throughput of 250 MByte/s. So such a port would be fully loaded by just two hard disks. But this baby connects ten disks to a single lane of PCIe. * Of course those hard disks doesn't run at max speed all the time, i assume the load pattern will be very random in the special use case of Backblaze. But this leads to a high mechanical load to the disks and to some additional objections. Based on the manual of the hard disk, i see two problems here: o The ST31500341AS is a desktop disk. Not even one of this nearline disks like we use in the X4500/X4540. When you look in the disk manual, all reliability calculations were done on the basis of 2400 hours of operation per year. But a year has 8760 hours. o The reliability considerations of Seagate assume a desktop usage pattern, not a server usage pattern. o o Seagate writes in their manual itself: "The AFR and MTBF will be degraded if used in an enterprise application". But given the long credits list at their end, i assume they've read the manual and considered this in their choice of hard disks. o There is another important point about the reliability of the disks: The AFR and the MTBF for the 7200.11 is valid for a surrounding temperature of 25 degrees celcius. Running it above this temperature reduces the MTBF and increases the AFR. Other harddisks build with enterprise usage in mind use another normal temperature vastly higher. * But due to the usage of RAID-6 those disks will see a high throughput in any case. RAID6 relies on a READ/MODIFY/WRITE cycle due to the nature of RAID6. So you write vastly more than just the modified data to disc. This may even interfere with the sparse throughput of the system. We've introduced RAIDZ, RAIDZ2 and RAIDZ3 to circumvent this kind of problems * No battery backup for the caches, but RAID6 ... well ... "Warning ... write holes ahead" * This system uses a Desktop Board, the DG43NB, thus system resources are a little bit sparse on this board. Just 1 processor and just 4 GB of RAM. I find the later one a little bit problematic. For general purpose a lot of more memory would be feasible. There are good reasons to have 32 GB or 64 GB in a X4540. Without a large amount of cache, you aren't able to shave off a little bit of the IOPS load to get back to a moderate load, thus the choice of a desktop disks gets even more problematic here. I think, Robin Harris is correct with his comment, that this system is a DC-3. It flies, it can transport goods and passengers from A to B in a reasonable, but not fast speed but don't forget your parachutes ;-) It's the same with this storage, this hw needs the parachute in form of the software in front of the device. But, and this is one of the key take aways for you ... even when other systems are more expensive, they are not overpriced. At first don't compare the mentioned list prices with the street prices for components. Second: Of course you can save an dollar at one or the other place, but: The seagate hard disk costs you 100 Euro at a big german computer online-shop, the HUA721010KLA330 (aka Hitachi Ultrastar A7K1000 1TB) costs you roundabout 200 Euro after a search at Google. Just using other (in my opinion correct for general purpose) disks, would double the price despite offering less storage. And even this price isn't indicative, as most often there are special agreements between drive manufacturers and system manufactures because of quality standards, quality management and conditions. The technical differences of the UltraStar: 1 errors in 10^15, qualified for 24/7 operations by the manufacturer, qualified for a enterprise work pattern (and even here only a lighter one) and 1.2 Million Hours MTBF normalized on 40 degrees (AFAIK) instead of 0.7 million Hours at 25 degrees. Quality costs. Period. The same for a desktop board in the DIY-"Thumper" instead of a custom build board for optimal performance (a SATA controller for each disk or using 8x lane PCIe for 8 disks instead of 1x lane PCIe for 10 disk e.g.). I'm pretty sure Sun could build an equally priced system, when you take the bare metal of the X4500 chassis and rip out all the specialities of the X4500/X4540 systems. But such a system with so many corners left wouldn't a be a system you expect from Sun. And yes, the X4540 has less capacity at the moment, but i think it's not far too fetched, that the X4540 gets 2TB drives as soon as they reached the same quality standards and qualification as the current drives givinh the X4540 a capacity of 96 TB. To close this article: It's about making decision. Application and hardware has to be seen as one. When your application is capable to overcome the limitations and problems of such ultra-cheap storage (and the software of Backblaze seems to have this capabilities), such a DIY thing may be a good solution for you. If you have to run normal applications without this capablities, the general-purpose system looks as a much better road in my opinion. Posted by Joerg Moellenkamp in English, Solaris, Sun, Technology, The IT Business at 15:22 | Comments (11) | Trackbacks (0) View as PDF: This entry | This month | Full blog From deadline at eadline.org Fri Sep 4 12:49:24 2009 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 4 Sep 2009 15:49:24 -0400 (EDT) Subject: [Beowulf] HPC for Dummies Message-ID: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> It is not a real "book" (it is short), but some people may be interested in: HPC For Dummies http://www.sun.com/x64/ebooks/hpc.jsp You have to register for a copy. The author is some HPC hack. -- Doug From gus at ldeo.columbia.edu Fri Sep 4 13:11:53 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 04 Sep 2009 16:11:53 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> Message-ID: <4AA17489.1080404@ldeo.columbia.edu> Douglas Eadline wrote: > It is not a real "book" (it is short), but some > people may be interested in: > > HPC For Dummies > > http://www.sun.com/x64/ebooks/hpc.jsp > > You have to register for a copy. > The author is some HPC hack. > I got interested, and I tried. Unfortunately it requires Windows or Mac OS to be downloaded and read. Weird requirement for an HPC audience. Linux folks are out. Why not a simple PDF file? Gus Correa From hahn at mcmaster.ca Fri Sep 4 13:24:17 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 4 Sep 2009 16:24:17 -0400 (EDT) Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA17489.1080404@ldeo.columbia.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> Message-ID: > Weird requirement for an HPC audience. s/Weird/Asinine/ From deadline at eadline.org Fri Sep 4 13:57:35 2009 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 4 Sep 2009 16:57:35 -0400 (EDT) Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA17489.1080404@ldeo.columbia.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> Message-ID: <33910.192.168.1.213.1252097855.squirrel@mail.eadline.org> I was not aware of that! I wrote this back in February and have not seen the final version. I will mention something to the people at AMD and Sun. I suspect it has more to with Wiley than with Sun and AMD. Maybe a printed/pdf version will show up. -- Doug > Douglas Eadline wrote: >> It is not a real "book" (it is short), but some >> people may be interested in: >> >> HPC For Dummies >> >> http://www.sun.com/x64/ebooks/hpc.jsp >> >> You have to register for a copy. >> The author is some HPC hack. >> > > I got interested, and I tried. > Unfortunately it requires Windows or Mac OS to be downloaded and read. > Weird requirement for an HPC audience. > Linux folks are out. > Why not a simple PDF file? > > Gus Correa > -- Doug From gus at ldeo.columbia.edu Fri Sep 4 14:31:40 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 04 Sep 2009 17:31:40 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <33910.192.168.1.213.1252097855.squirrel@mail.eadline.org> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> <33910.192.168.1.213.1252097855.squirrel@mail.eadline.org> Message-ID: <4AA1873C.3080603@ldeo.columbia.edu> Hi Douglas I didn't mean to spoil your chances for the Pulitzer Prize, and I really hope your book will become a bestseller! :) Considering what the technical and textbook market is today, where the same books get a new edition every year, with the same or reduced content, a new cover, a password protected web site, and a doubled price tag, your guess about Wiley making the access to the book sample hard may be correct. Thank you for the book sample anyway, which I hope will become available to all as pdf file at some point. Gus Correa Douglas Eadline wrote: > I was not aware of that! > > I wrote this back in February and have not seen > the final version. I will mention something > to the people at AMD and Sun. I suspect it > has more to with Wiley than with Sun and AMD. > Maybe a printed/pdf version will show up. > > -- > Doug > > >> Douglas Eadline wrote: >>> It is not a real "book" (it is short), but some >>> people may be interested in: >>> >>> HPC For Dummies >>> >>> http://www.sun.com/x64/ebooks/hpc.jsp >>> >>> You have to register for a copy. >>> The author is some HPC hack. >>> >> I got interested, and I tried. >> Unfortunately it requires Windows or Mac OS to be downloaded and read. >> Weird requirement for an HPC audience. >> Linux folks are out. >> Why not a simple PDF file? >> >> Gus Correa >> > > From deadline at eadline.org Fri Sep 4 17:51:30 2009 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 4 Sep 2009 20:51:30 -0400 (EDT) Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA1873C.3080603@ldeo.columbia.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> <33910.192.168.1.213.1252097855.squirrel@mail.eadline.org> <4AA1873C.3080603@ldeo.columbia.edu> Message-ID: <50870.192.168.1.213.1252111890.squirrel@mail.eadline.org> > Hi Douglas > > I didn't mean to spoil your chances for the Pulitzer Prize, > and I really hope your book will become a bestseller! :) > > Considering what the technical and textbook market is today, > where the same books get a new edition every year, > with the same or reduced content, a new cover, > a password protected web site, and a doubled price tag, > your guess about Wiley making the access to the book sample > hard may be correct. > > Thank you for the book sample anyway, > which I hope will become available > to all as pdf file at some point. Thanks, but I should mention, it is not a "real" book. Or, not the book I would have written if I had more pages. It is a promotional "book" which I was given approximately 40 pages to explain HPC. I tried to cram as much as I could into it and make it interesting to non-HPC people (i.e. industrial HPC) It is similar to the "Virtualization For Dummies" book AMD produced a year or so ago. I should say that if there is enough interest in this topic Wiley may want me to do a "full book." As I never got a final copy, I to will have to figure out how to download and read it :) -- Doug > > Gus Correa > > > > Douglas Eadline wrote: >> I was not aware of that! >> >> I wrote this back in February and have not seen >> the final version. I will mention something >> to the people at AMD and Sun. I suspect it >> has more to with Wiley than with Sun and AMD. >> Maybe a printed/pdf version will show up. >> >> -- >> Doug >> >> >>> Douglas Eadline wrote: >>>> It is not a real "book" (it is short), but some >>>> people may be interested in: >>>> >>>> HPC For Dummies >>>> >>>> http://www.sun.com/x64/ebooks/hpc.jsp >>>> >>>> You have to register for a copy. >>>> The author is some HPC hack. >>>> >>> I got interested, and I tried. >>> Unfortunately it requires Windows or Mac OS to be downloaded and read. >>> Weird requirement for an HPC audience. >>> Linux folks are out. >>> Why not a simple PDF file? >>> >>> Gus Correa >>> >> >> > -- Doug From richard.walsh at comcast.net Sat Sep 5 11:30:32 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sat, 5 Sep 2009 18:30:32 +0000 (UTC) Subject: [Beowulf] Re: Wake on LAN supported on both built-in interfaces ... ?? In-Reply-To: Message-ID: <126594175.7208511252175432775.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> >----- Original Message ----- >From: "David Mathog" >Subject: Wake on LAN supported on both built-in interfaces ... ?? > >richard.walsh at comcast.net wrote: > >> I have a head node that am trying to get WOL set up on. >> >> It is a SuperMicro motherboard (X8DTi-F) with two built >> in interfaces (eth0, eth1). I am told by SuperMicro support >> that both interfaces support WOL fully, but when I probe them >> with ethtool only eth0 indicates that it supports WOL with: >> >> ..... > >That board has "Intel? 82576 Dual-Port Gigabit Ethernet" > >and Intel provides some information on that here: > >http://edc.intel.com/Link.aspx?id=2372 > >where it says: > > Wake-on-LAN support: > Packet recognition and wake-up for LAN on motherboard > applications without software configuration > >and nothing more. That is ambiguous, it requires that at least one >interface support WOL, but it does not say explicitly that both do. >Most likely the hardware does support on both ports but the driver is >confused somehow by the dual chip. > >Try contacting the author of the linux driver and/or Intel directly. David/All, Here is some follow up on this WOL question. I did contact the driver folks and did some further checking. Standby power has to be supplied to both ports, but no one could assure me that it was for my motherboard. The suggestion was that because there was nothing in the EEPROM/BIOS to activate or switch to the other port that is perhaps was not. Economy of design and logic suggest that it should not be needed anyway, and upon further thought I simply swapped interface names with the MAC addresses in the ifcfg-eth0 and ifcfg-eth1 files, and then swapped my cables and rebooted, which was easy to do. There are half a dozen was to set or fix your interface names with specific MAC addresses/ports. Here is a good reference for doing this: http://www.science.uva.nl/research/air/wiki/LogicalInterfaceNames?show_comments=1#comments So, I have WOL working ... ;-) ... as I want. Thanks, rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From prentice at ias.edu Tue Sep 8 14:18:00 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 08 Sep 2009 17:18:00 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> Message-ID: <4AA6CA08.6090200@ias.edu> Douglas Eadline wrote: > It is not a real "book" (it is short), but some > people may be interested in: > > HPC For Dummies > > http://www.sun.com/x64/ebooks/hpc.jsp > > You have to register for a copy. > The author is some HPC hack. > Wasn't AMD giving this away for free at SC08? -- Prentice From prentice at ias.edu Tue Sep 8 14:26:33 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 08 Sep 2009 17:26:33 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA6CA08.6090200@ias.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA6CA08.6090200@ias.edu> Message-ID: <4AA6CC09.3080909@ias.edu> Prentice Bisbal wrote: > Douglas Eadline wrote: >> It is not a real "book" (it is short), but some >> people may be interested in: >> >> HPC For Dummies >> >> http://www.sun.com/x64/ebooks/hpc.jsp >> >> You have to register for a copy. >> The author is some HPC hack. >> > > Wasn't AMD giving this away for free at SC08? Nevermind. It was different. Must have been the virtualization for dummies book. -- Prentice From michf at post.tau.ac.il Tue Sep 1 01:08:20 2009 From: michf at post.tau.ac.il (Micha Feigin) Date: Tue, 1 Sep 2009 11:08:20 +0300 Subject: [Beowulf] GPU question In-Reply-To: References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: <20090901110820.1533b3ca@vivalunalitshi.luna.local> On Mon, 31 Aug 2009 16:13:55 +0200 Jonathan Aquilina wrote: > >One thing that's not mentioned out loud by NVIDIA (I have read only in > >CUDA programming manual) is that if the video system needs more memory > >that's not available(say you change resolution, while you're waiting > >for your process to finish), it will crash your cuda app, so I advise > >you to use a second card to display (if you have a tesla solution, > >you certainly have a "second" display card). If you are running > >remotly, this i an non issue (framebuffers don't need much memory > >neither change resolution). > in this regard then why waste a pci-x slot when u can get one that has > graphics integrated onto the board leaving the slots free to use for data > processing. is there any difference in performance in a motherboard that has > the graphics card integrated and one that does not? ?The Problem is that I've never ran into an onboard card that's capable of doing real hpc work. Laptops can come with a relatively strong nvidia chip but not enough for hpc (can't handle the wattage and cooling among other things). Motherboards usually come with a cheap intel. And it's not like you can use the graphics slot for anything else. Possibly in a few years when the Intel vision for larabee will come along where larabee is integrated on the board (although I don't think that it will happen before pci-e 3 which is supposed to handle the same speed. By the way, nvidia do say that it's better not to use the main card for cuda if you intend to do real hpc (don't remember where I read it though) and if the second card is not a tesla the suggest a quadro as the main card. They claim that you get better performance (and if you intend to also do glsl, quadro support opengl gpu affinity and I think that it also supports opengl context not on the main card. Tesla apparantly also supports opengl but it's not official). By the way, if you ask nvidia, they suggest for deployment to use only tesla and quadro as they claim that they are designed to handle 24/7 work and that g200 will downclock when it gets hot. The quadros or ridiculously expensive though. From michf at post.tau.ac.il Tue Sep 1 01:20:01 2009 From: michf at post.tau.ac.il (Micha Feigin) Date: Tue, 1 Sep 2009 11:20:01 +0300 Subject: [Beowulf] GPU question In-Reply-To: <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: <20090901112001.23110413@vivalunalitshi.luna.local> On Sun, 30 Aug 2009 04:35:30 +0500 amjad ali wrote: > Hello all, specially Gil Brandao > > Actually I want to start CUDA programming for my |C.I have 2 options to do: > 1) Buy a new PC that will have 1 or 2 CPUs and 2 or 4 GPUs. > 2) Add 1 GPUs to each of the Four nodes of my PC-Cluster. > > Which one is more "natural" and "practical" way? > Does a program written for any one of the above will work fine on the other? > or we have to re-program for the other? > If you use mpi to have several processes, each controlling one gpu then the same code would work in both scenarios. The first scenario though will make communication easier and would allow you to avoid mpi. Cuda runs asynchronously so you can do work on the pc in parallel and control all gpus from one process. If the pcs are powerful enough, the communication is low enough and you want to combine cpu and gpu that both do work than the second option may work better. It will make cooling simpler and you can use smaller PSUs. Note that each of these monsters (g200, tesla, quadro) can take over 200W and solve your heating problem for the winter if you're in Alaska. Also take note that there are big differences in terms of communication (speed and latency). If you put all cards in one pc, communication between them is gpu via pci-e to memory. In several pcs you have gpu (pc a) -- pci-e --> memory (pc a) -- network --> memory (pc b) --> gpu (pc b) > Regards. > > On Sat, Aug 29, 2009 at 5:48 PM, wrote: > > > On Sat, Aug 29, 2009 at 8:42 AM, amjad ali wrote: > > > Hello All, > > > > > > > > > > > > I perceive following computing setups for GP-GPUs, > > > > > > > > > > > > 1) ONE PC with ONE CPU and ONE GPU, > > > > > > 2) ONE PC with more than one CPUs and ONE GPU > > > > > > 3) ONE PC with one CPU and more than ONE GPUs > > > > > > 4) ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than ONE GPUs > > > (e.g. Nvidia C1060) > > > > > > 5) Cluster of PCs with each node having ONE CPU and ONE GPU > > > > > > 6) Cluster of PCs with each node having more than one CPUs and ONE > > GPU > > > > > > 7) Cluster of PCs with each node having ONE CPU and more than ONE > > GPUs > > > > > > 8) Cluster of PCs with each node having more than one CPUs and more > > > than ONE GPUs. > > > > > > > > > > > > Which of these are good/realistic/practical; which are not? Which are > > quite > > > ?natural? to use for CUDA based programs? > > > > > > > CUDA is kind of new technology, so I don't think there is a "natural > > use" yet, though I read that there people doing CUDA+MPI and there are > > papers on CPU+GPU algorithms. > > > > > > > > IMPORTANT QUESTION: Will a cuda based program will be equally good for > > > some/all of these setups or we need to write different CUDA based > > programs > > > for each of these setups to get good efficiency? > > > > > > > There is no "one size fits all" answer to your question. If you never > > developed with CUDA, buy one GPU an try it. If it fits your problems, > > scale it with the approach that makes you more comfortable (but > > remember that scaling means: making bigger problems or having more > > users). If you want a rule of thumb: your code must be > > _truly_parallel_. If you are buying for someone else, remember that > > this is a niche. The hole thing is starting, I don't thing there isn't > > many people that needs much more 1 or 2 GPUs. > > > > > > > > Comments are welcome also for AMD/ATI FireStream. > > > > > > > put it on hold until OpenCL takes of (in the real sense, not in > > "standards papers" sense), otherwise you will have to learn another > > technology that even fewer people knows. > > > > > > Gil Brandao > > From michf at post.tau.ac.il Tue Sep 1 01:58:40 2009 From: michf at post.tau.ac.il (Micha Feigin) Date: Tue, 1 Sep 2009 11:58:40 +0300 Subject: [Beowulf] GPU question In-Reply-To: <4A9BFA3B.5030809@ldeo.columbia.edu> References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> <4A9BFA3B.5030809@ldeo.columbia.edu> Message-ID: <20090901115840.19c42248@vivalunalitshi.luna.local> On Mon, 31 Aug 2009 12:28:43 -0400 Gus Correa wrote: > Hi Amjad > > 1. Beware of hardware requirements, specially on your existing > computers, which may or may not fit a CUDA-ready GPU. > Otherwise you may end up with a useless lemon. > > A) Not all NVidia graphic cards are CUDA-ready. > NVidia has lists telling which GPUs are CUDA-ready, > which are not. > All newer cards are CUDA ready but with different levels of support. basically everything from geforce 8000 and up will support cuda (even the 40$ cards by the way). g80/g90 cards (geforce 8000 and 9000 series) will only do single precision, have relatively small memory (256-768 mb), have stricter coalescing requirements (working with main card memory efficiently is harder), have limited atomic operation support g200 cards (geforce 200 series) do double precision, can reach a lot more memory (1gb on the g285, 1800mb on the g295), have more cores and higher memory bandwidth and have much better atomic operation support. the g285 has 240 cores, the g295 has to GPUs on board giving 480 cores. These are rather nice for development if you want to work on your main card in your pc and have a low budget. They are not targeted for hpc by NVidia by the way. tesla, this is basically almost the same as the g285 (240 cores) but has 4gb memory and no graphics output. You can get 4 of these in a dedicated rack mount (s1070). They are designed to run cuda, although unofficially the support opengl according to the guy in charge of them at NVidia. They are much more expensive than the g285 though. quadro, these are designed for business graphics for render farms, CAD and high dynamic range work. They have support for high dynamic range imaging and anti aliasing in hardware and are the only cards by nvidia that support gpu affinity for opengl (for glsl). There are unofficial hacks to achieve this if I'm not mistaken under linux with the g200 but no way to do it under windows. There is also a rack mount to connect four of these to one PC (quadro plex) Another thing to note is that the teslas and quadros are manufactures and supported by nvidia and designed for 24/7 deployment. For the geforce series, nvidia makes the GPU chip but not the cards and they get very angry and offended if you suggest putting one of these into a deployment system. > B) Check all the GPU hardware requirements in detail: motherboard, > PCIe version and slot, power supply capacity and connectors, etc. > See the various GPU models on NVidia site, and > the product specs from the specific vendor you choose. > > C) You need a free PCIe slot, most likely 16x, IIRR. > I couldn't find any cuda supported card that works on anything less than pci-e x 16. I found reference to one of the cheap cards (don't remember which one, I thing geforce 8400) that has a version for something less, but could actually find one. > D) Most GPU card models are quite thick, and take up its > own PCIe slot and cover the neighbor slot, which cannot be used. > Hence, if your motherboard is already crowded, make sure > everything will fit. The high end ones take two slots, one for the cooling, I thing that the g295 actually takes 3 slots if memory serves. > > For rackmount a chassis you may need at least 2U height. > On a tower PC chassis this shouldn't be a problem. > You may need some type of riser card if you plan to mount the GPU > parallel to the motherboard. > You also need appropriate cooling to take care of the ridiculous amount of heat. > E) If I remember right, you need PCIe version 1.5 (?) > or version 2 on your motherboard. > Most cards are pci-e 2 > F) You also need a power supply with enough extra power to feed > the GPU beast. > The GPU model specs should tell you how much power you need. > Most likely a 600W PS or larger, specially if you have a dual socket > server motherboard with lots of memory, disks, etc to feed. > They also take their own power input from the psu (two to three inputs) for additional power unlike the cards that take all their power from the pci-e slot so you need a supported psu. The psu also needs to be strong enough (around 200w per cards, the g285 says you need at least 550w for a single card) > G) Depending on the CUDA-ready GPU card, > the low end ones require 6-pin PCIe power connectors > from the power supply. > The higher end models require 8-pin power supply PCIe connectors. > You may find and buy molex-to-PCIe connector adapters also, > so that you can use the molex (i.e. ATA disk power connectors) > if your PS doesn't have the PCIe connectors. > However, you need to have enough power to feed the GPU and the system, > no matter what. > > *** > > 2. Before buying a lot of hardware, I would experiment first with a > single GPU on a standalone PC or server (that fits the HW requirements), > to check how much programming it takes, > and what performance boost you can extract from CUDA/GPU. > > CUDA requires quite a bit of logistics of > shipping data between memory, GPU, CPU, > etc. > It is perhaps more challenging to program than, say, > parallelizing a serial program with MPI, for instance. > Codes that are heavy in FFTs or linear algebra operations are probably > good candidates, as there are CUDA libraries for both. > There is a steep learning curve as you need to understand the hardware to get the most out of your code. I find it easier to code than mpi but I guess that is personal. You code is usually good for CUDA if you need to do the same thing a lot of times. If you code has a lot of logic (a lot of if clauses), a lot of atomic operations or complex data structures that it probably won't transfer well. Also complex (non-ordered) memory reading/writing paradigms can have a significant performance hit (reading is easier to cope with that writing) > At some point only 32-bit floating point arrays would take advantage of > CUDA/GPU, but not 64-bit arrays. > The latter would > require additional programming to change between 64/32 bit > when going to and coming back from the GPU. > Not sure if this still holds true, > newer GPU models may have efficient 64-bit capability, > but it is worth checking this out, including if performance for > 64-bit is as good as for 32-bit. > g200 and up (including tesla and quadro) have 64bit floating point support but it is much less efficient than 32bit. If memory serves it's a ratio of about 1:5. What nvidia calls cores are actually FPUs. these FPUs are 32bit and it combines these somehow to get 64bit arithmetic. ATI are better at 64bit arithmeric performance but ati streams are much more limited, are much harder to code and the documentation is VERY scarce. If you use glsl and double precision it's better to go with ATI. Maybe once OpenCL is mature enough it will also be an option but the g300 will probably be on the market by then and they may change the balance. geforce 8000 and 9000 only do single precission > 3. PGI compilers version 9 came out with "GPU directives/pragmas" > that are akin to the OpenMPI directives/pragmas, > and may simplify the use of CUDA/GPU. > At least before the promised OpenCL comes out. > Check the PGI web site. > > Note that this will give you intra-node parallelism exploring the GPU, > just like OpenMP does using threads on the CPU/cores. > I saw one of these, don't remember if it was PGI. No experience with it though. >From CUDA exprience though, there would be a lot of things that it would be hard for such a compiler to achieve. BTW, there is also a matlab toolbox called jacket from accelereyes that allows you to do cuda from matlab. The numbers are not as good as they advertize (truth in advertizing, they also provide an OpenGL visualization functions and for their code they use opengl and for the matlab version they use surf, from testing the difference is huge) > 4. CUDA + MPI may be quite a challenge to program. > > I hope this helps, > Gus Correa > > amjad ali wrote: > > Hello all, specially Gil Brandao > > > > Actually I want to start CUDA programming for my |C.I have 2 options to do: > > 1) Buy a new PC that will have 1 or 2 CPUs and 2 or 4 GPUs. > > 2) Add 1 GPUs to each of the Four nodes of my PC-Cluster. > > > > Which one is more "natural" and "practical" way? > > Does a program written for any one of the above will work fine on the > > other? or we have to re-program for the other? > > > > Regards. > > > > On Sat, Aug 29, 2009 at 5:48 PM, > > wrote: > > > > On Sat, Aug 29, 2009 at 8:42 AM, amjad ali > > wrote: > > > Hello All, > > > > > > > > > > > > I perceive following computing setups for GP-GPUs, > > > > > > > > > > > > 1) ONE PC with ONE CPU and ONE GPU, > > > > > > 2) ONE PC with more than one CPUs and ONE GPU > > > > > > 3) ONE PC with one CPU and more than ONE GPUs > > > > > > 4) ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than > > ONE GPUs > > > (e.g. Nvidia C1060) > > > > > > 5) Cluster of PCs with each node having ONE CPU and ONE GPU > > > > > > 6) Cluster of PCs with each node having more than one CPUs > > and ONE GPU > > > > > > 7) Cluster of PCs with each node having ONE CPU and more > > than ONE GPUs > > > > > > 8) Cluster of PCs with each node having more than one CPUs > > and more > > > than ONE GPUs. > > > > > > > > > > > > Which of these are good/realistic/practical; which are not? Which > > are quite > > > ?natural? to use for CUDA based programs? > > > > > > > CUDA is kind of new technology, so I don't think there is a "natural > > use" yet, though I read that there people doing CUDA+MPI and there are > > papers on CPU+GPU algorithms. > > > > > > > > IMPORTANT QUESTION: Will a cuda based program will be equally > > good for > > > some/all of these setups or we need to write different CUDA based > > programs > > > for each of these setups to get good efficiency? > > > > > > > There is no "one size fits all" answer to your question. If you never > > developed with CUDA, buy one GPU an try it. If it fits your problems, > > scale it with the approach that makes you more comfortable (but > > remember that scaling means: making bigger problems or having more > > users). If you want a rule of thumb: your code must be > > _truly_parallel_. If you are buying for someone else, remember that > > this is a niche. The hole thing is starting, I don't thing there isn't > > many people that needs much more 1 or 2 GPUs. > > > > > > > > Comments are welcome also for AMD/ATI FireStream. > > > > > > > put it on hold until OpenCL takes of (in the real sense, not in > > "standards papers" sense), otherwise you will have to learn another > > technology that even fewer people knows. > > > > > > Gil Brandao > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From brs at admin.usf.edu Thu Sep 3 10:40:52 2009 From: brs at admin.usf.edu (Smith, Brian) Date: Thu, 3 Sep 2009 13:40:52 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02053A65@mtiexch01.mti.com> References: <4A9F42ED.5050600@scalableinformatics.com><4A9FBD66.1060608@scalableinformatics.com><4A9FDE85.5020008@ldeo.columbia.edu> <9FA59C95FFCBB34EA5E42C1A8573784F02053A65@mtiexch01.mti.com> Message-ID: Gilad, Where are you finding cables for $30? The lowest I've been able to find 1M is in the $60 price range. I have a project going right now that would benefit greatly from $30 cables. -Brian -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Gilad Shainer Sent: Thursday, September 03, 2009 12:51 PM To: Rahul Nabar; Gus Correa Cc: Bewoulf Subject: RE: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes > > See these small SDR switches: > > > > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct= 13 > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 > > > > And SDR HCA card: > > > > Thanks Gus! This info was very useful. A 24port switch is $2400 and > the card $125. Thus each compute node would be approximately $300 more > expensive. (How about infiniband cables? Are those special and how > expensive. I did google but was overwhelmed by the variety available.) You can find copper cables from around $30, so the $300 will include the cable too > This isn't bad at all I think. If I base it on my curent node price > it would require only about a 20% performance boost to justify this > investment. I feel Infy could deliver that. When I had calculated it > the economics was totally off; maybe I had wrong figures. You can always run your app on available users system and see the performance boost that you will be able to get. For example, you can use the center (free of charge) - http://www.hpcadvisorycouncil.com/cluster_center.php > The price-scaling seems tough though. Stacking 24 port switches might > get a bit too cumbersome for 300 servers. But when I look at > corresponding 48 or 96 port switches the per-port-price seems to shoot > up. Is that typical? It is the same as buying blades. If you get the switches fully populated, than it will be cost effective. There is a 324 port switch, which should be a good option too. > > For a 300-node cluster you need to consider > > optical fiber for the IB uplinks, > > You mean compute-node-to-switch and switch-to-switch connections? > Again, any $$$ figures, ballpark? It all depends on the speed. If you are using IB SDR or DDR, copper cables will be enough. For QDR you can use passive copper up 7-8 meters, and active up to 12m, before you need to move to fiber. > > I don't know about your computational chemistry codes, > > but for climate/oceans/atmosphere (and probably for CFD) > > IB makes a real difference w.r.t. Gbit Ethernet. > > I have a hunch (just a hunch) that the computational chemistry codes > we use haven't been optimized to get the full advantage of the latency > benefits etc. Some of the stuff they do is pretty bizarre and > inefficient if you look at their source codes (writing to large I/O > files all the time eg.) I know this ought to be fixed but there that > seems a problem for another day! On the same web site I have listed above, there are some best practices with apps performance, You can check them out and see if some of them are more relevant. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From gmkurtzer at gmail.com Thu Sep 3 13:16:48 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Thu, 3 Sep 2009 13:16:48 -0700 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> On Wed, Sep 2, 2009 at 11:18 PM, Mark Hahn wrote: >> That brings me to another important question. Any hints on speccing >> the head-node? > > I think you imply a single, central admin/master/head node. ?this is a very > bad idea. ?first, it's generally a bad idea to have users on a fileserver. > ?next, it's best to keep cluster-infrastructure > (monitoring, management, pxe, scheduling) on a dedicated admin machine. > for 300 compute nodes, it might be a good idea to provide more than one > login node (for editing, compilation, etc). To expand on Mark's comment... I would SPEC >=2 systems for head/masters and either spread the load of the required services (e.g. management, monitoring and other sysadmin tasks and put scheduling on the other) OR put all of the services on a single master and then run a shadow master for redundancy. I would not put users on either of these systems. If you were using Perceus..... I would either create an interactive VNFS capsule (include compilers, additional libs, etc..) or make a large more bloated compute VNFS capsule and use that on all of the nodes. In this scenario, all nodes could run stateless *and* diskful so if you need to change the number of interactive nodes you can do it with a simple command sequence: # perceus vnfs import /path/to/interactive.vnfs # perceus node set vnfs interactive n000[0-4] and/or # perceus vnfs import /path/to/compute.vnfs # perceus node set vnfs compute n0[004-299] Have your cake and eat it too. :) The file system needs to be built to handle the load of the apps. 300 nodes means you can go from the low end (Linux RAID and NFS) to a higher end NFS solution, or upper end of a parallel file system or maybe even one of each (NFS and parallel) as they solve some different requirements. -- Greg Kurtzer http://www.infiscale.com/ http://www.perceus.org/ http://www.caoslinux.org/ From gmkurtzer at gmail.com Thu Sep 3 17:19:33 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Thu, 3 Sep 2009 17:19:33 -0700 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> Message-ID: <571f1a060909031719l65a152d3q1fb79c8ea97b9c3d@mail.gmail.com> On Thu, Sep 3, 2009 at 3:56 PM, Rahul Nabar wrote: > On Thu, Sep 3, 2009 at 3:16 PM, Greg Kurtzer wrote: > > Thanks for the comments Greg! Sure thing. Glad to offer what I can. > >> If you were using Perceus..... > > No. I've never used Perceus before and although it sounds interesting > this seems like a bad time to try something new! Unless it makes your job easier as you scale up. ;) Feel free to check out: http://www.linux-mag.com/id/6386 http://www.linux-mag.com/id/7239 > >> >> The file system needs to be built to handle the load of the apps. 300 >> nodes means you can go from the low end (Linux RAID and NFS) to a >> higher end NFS solution, or upper end of a parallel file system or >> maybe even one of each (NFS and parallel) as they solve some different >> requirements. > > What exactly do you mean by a "parallel" file system? Something like > GPFS? That's IBM proprietory though isn't it? On the other hand NFS > seems pretty archaic. I've seen quite a few installations use Lustre. > I am planning to play with that. Something in the OpenSource world to > keep costs down. Yes, GPFS is IBM's commercial file system and Lustre is a free solution. Both are very complicated components to the cluster that will take a large investment to do properly (either initial purchase cost and the hidden cost of administration or just a lot of hidden costs). If cost is really an issue, *and* if the applications don't require a parallel file system then why not make your job easier with the use of a quality Network Attached Storage solution (NAS) and use NFS? In either case if you look around you can find people that may even have premade Perceus VNFS capsules (the equivalent of an installer or preconfigured disk image) for Lustre servers, clients, and various other system roles. Best of luck! Greg -- Greg Kurtzer http://www.infiscale.com/ http://www.perceus.org/ http://www.caoslinux.org/ From sabujp at gmail.com Fri Sep 4 16:43:04 2009 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Fri, 4 Sep 2009 18:43:04 -0500 Subject: [Beowulf] where do I set PBSACLUSEGROUPLIST before starting pbs_server from torque 2.4.0b1? Message-ID: Hi, I can't seem to figure out how/where to set the environment variable PBSACLUSEGROUPLIST. This page: http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml says about acl_groups: NOTE: If the PBSACLUSEGROUPLIST variable is set in the pbs_server environment, acl_groups will be check against all groups of which the job user is a member. I tried exporting that variable with: export PBSACLUSEGROUPLIST export PBSACLUSEGROUPLIST=1 export PBSACLUSEGROUPLIST=true before starting pbs_server and pbs_sched but I still can't submit to queues where I belong to the secondary group for which acl_groups is set. Otherwise, I've verified that acl_groups is working because I can only submit to a queue where my primary group is the same as that set in acl_groups for that queue. Anyone have any ideas on how to get pbs_server to look at all the groups that I'm a member of, supposedly using the environment variable mentioned above? Thanks, Sabuj Pattanayek From lukasz at mbi.ucla.edu Fri Sep 4 13:11:28 2009 From: lukasz at mbi.ucla.edu (Lukasz Salwinski) Date: Fri, 04 Sep 2009 13:11:28 -0700 Subject: [Beowulf] HPC for Dummies In-Reply-To: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> Message-ID: <4AA17470.6070501@mbi.ucla.edu> Douglas Eadline wrote: > It is not a real "book" (it is short), but some > people may be interested in: > > HPC For Dummies > > http://www.sun.com/x64/ebooks/hpc.jsp > > You have to register for a copy. > The author is some HPC hack. > Here's a note I've just dropped them: >Sun Microsystems wrote: >> >> Thank you for your interest in our HPC for Dummies ebook. If you >>have any questions, please contact us at HPC_Questions at sun.com. >> >> >> Sun Microsystems, Inc., 18 Network Circle, M/S: UMPK18-124, Attn: >> Global eMarketing, Menlo Park, CA 94025 USA >> Copyright 2009 Sun Microsystems, Inc. All rights reserved. > >I'm sorry but I'm not interested any more as there's no way to get >to the ebook from linux. wish i knew it before registering. lukasz -- ------------------------------------------------------------------------- Lukasz Salwinski PHONE: 310-825-1402 UCLA-DOE Institute for Genomics & Proteomics FAX: 310-206-3914 UCLA, Los Angeles EMAIL: lukasz at mbi.ucla.edu ------------------------------------------------------------------------- From mdublin at genomeweb.com Fri Sep 4 13:18:12 2009 From: mdublin at genomeweb.com (mdublin) Date: Fri, 4 Sep 2009 16:18:12 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA17489.1080404@ldeo.columbia.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> Message-ID: <7279B23D-7E1E-40AE-A2EF-3925DE49E1B0@genomeweb.com> I'm on OSX, installed the Adobe Digital Additions app, and I still can't figure out how to download it, thanks Sun for making me feel like a child..... On Sep 4, 2009, at 4:11 PM, Gus Correa wrote: > Douglas Eadline wrote: >> It is not a real "book" (it is short), but some >> people may be interested in: >> HPC For Dummies >> http://www.sun.com/x64/ebooks/hpc.jsp >> You have to register for a copy. >> The author is some HPC hack. > > I got interested, and I tried. > Unfortunately it requires Windows or Mac OS to be downloaded and read. > Weird requirement for an HPC audience. > Linux folks are out. > Why not a simple PDF file? > > Gus Correa > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Matthew Dublin Senior Writer Genome Technology 1-212-651-5638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From trainor at presciencetrust.org Fri Sep 4 13:59:14 2009 From: trainor at presciencetrust.org (Douglas J. Trainor) Date: Fri, 4 Sep 2009 20:59:14 +0000 (GMT) Subject: [Beowulf] HPC for Dummies Message-ID: <1048939568.16294.1252097954740.JavaMail.mail@webmail09> An HTML attachment was scrubbed... URL: From margretthompson81 at yahoo.com Fri Sep 4 17:17:31 2009 From: margretthompson81 at yahoo.com (margret thompson) Date: Fri, 4 Sep 2009 17:17:31 -0700 (PDT) Subject: [Beowulf] HPC for Dummies Message-ID: <466061.58715.qm@web44703.mail.sp1.yahoo.com> The actual PDF is at http://acs.libredigital.com/books/ede6ba36-5cb7-4b71-8568-4a1d34451ece.pdf but unfortunately it's encumbered with some sort of encryption.? Who will figure out the key first?? Too bad none of us have any massive supercomputers to bruteforce it...? oh.? Wait. Other hints: http://acs.libredigital.com/prodfulfillment/URLLink.acsm?action=enterorder&ordersource=John%20Wiley%20%26%20Sons%2C%20Inc%2E&orderid=8772F613%2DCFB0%2D6C7F%2DC28711A50A6FEBE7&resid=urn%3Auuid%3Aede6ba36%2D5cb7%2D4b71%2D8568%2D4a1d34451ece&gbauthdate=Fri%2C%204%20Sep%20200918%3A45%3A41&dateval=1252086341&gblver=4&auth=6b3702ae1908fad5014043d98a44b4aaae5b0c73 http://lto.libredigital.com/js/ADEBadgeLauncher.js http://lto.libredigital.com/?SUN_AMD_HighPerformanceComputingForDummies -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdublin at genomeweb.com Tue Sep 8 14:31:43 2009 From: mdublin at genomeweb.com (mdublin) Date: Tue, 8 Sep 2009 17:31:43 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA6CC09.3080909@ias.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA6CA08.6090200@ias.edu> <4AA6CC09.3080909@ias.edu> Message-ID: Think it was the Idiot's Guide to Dummies.... On Sep 8, 2009, at 5:26 PM, Prentice Bisbal wrote: > > > > Prentice Bisbal wrote: >> Douglas Eadline wrote: >>> It is not a real "book" (it is short), but some >>> people may be interested in: >>> >>> HPC For Dummies >>> >>> http://www.sun.com/x64/ebooks/hpc.jsp >>> >>> You have to register for a copy. >>> The author is some HPC hack. >>> >> >> Wasn't AMD giving this away for free at SC08? > > Nevermind. It was different. Must have been the virtualization for > dummies book. > > -- > Prentice > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Matthew Dublin Senior Writer Genome Technology 1-212-651-5638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Wed Sep 9 10:40:23 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 9 Sep 2009 12:40:23 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? Message-ID: Our new cluster aims to have around 300 compute nodes. I was wondering what is the largest setup people have tested NFS with? Any tips or comments? There seems no way for me to say if it will scale well or not. I have been warned of performance hits but how bad will they be? Infiniband is touted as a solution but the economics don't work out. My question is this: Assume each of my compute nodes have gigabit ethernet AND I specify the switch such that it can handle full line capacity on all ports. Will there still be performance hits as I start adding compute nodes? Why? Or is it unrealistic to configure a switching setup with full line capacities on 300 ports? If not NFS then Lustre etc options do exist. But the more I read about those the more I am convinced that those open another big can of worms. Besides, if NFS would work I do not want to switch. -- Rahul From landman at scalableinformatics.com Wed Sep 9 11:52:18 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 09 Sep 2009 14:52:18 -0400 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <4AA7F962.500@scalableinformatics.com> Rahul Nabar wrote: > If not NFS then Lustre etc options do exist. But the more I read about > those the more I am convinced that those open another big can of > worms. Besides, if NFS would work I do not want to switch. You might also want to look at GlusterFS. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From coutinho at dcc.ufmg.br Wed Sep 9 12:12:04 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Wed, 9 Sep 2009 16:12:04 -0300 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: My two cents: 2009/9/9 Rahul Nabar > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or > comments? There seems no way for me to say if it will scale well or > not. > > I have been warned of performance hits but how bad will they be? > Infiniband is touted as a solution but the economics don't work out. > My question is this: > > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. > Will there still be performance hits as I start adding compute nodes? > Yes. > Why? Because final NFS server bandwidth will be the bandwidth of the most limited device, be it disk, network interface or switch. Even if you have a switch capable of fill line capacity for all 300 nodes, you must put a insanely fast interface in your NFS server and a giant pool of disks to have a decent bandwidth if all nodes access NFS at the same time. But depending on the way people run applications in your cluster, only a small set of nodes will access NFS at the same time and a Ethernet 10Gb with tens of disks will be enough. > Or is it unrealistic to configure a switching setup with full > line capacities on 300 ports? > This will be good for MPI but it will not help much your NFS server problem. > If not NFS then Lustre etc options do exist. But the more I read about > those the more I am convinced that those open another big can of > worms. Besides, if NFS would work I do not want to switch. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzaletnev at yandex.ru Tue Sep 8 19:53:26 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed, 09 Sep 2009 06:53:26 +0400 Subject: [Beowulf] NFS server for a small cluster Message-ID: <393061252464806@webmail90.yandex.ru> Hello, list I'm going to prepare NFS server for my small PS3-cluster. So, I've a question: if I use 4 SATAII ports for 250 GB disks in software RAID0, is the performance of CPU critical? Witch one should I use Core Duo/1 MB cache or Core2Duo/6 MB cache and how much of DDR2-800 RAM, 2 GB or 8 GB? I'm going to use it for transitional/ unsteady state CFD simulations. Thank you for any suggestions. -- Dmitry Zaletnev From pscadmin at avalon.umaryland.edu Wed Sep 9 12:23:30 2009 From: pscadmin at avalon.umaryland.edu (psc) Date: Wed, 09 Sep 2009 15:23:30 -0400 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <200909091900.n89J07U5031683@bluewest.scyld.com> References: <200909091900.n89J07U5031683@bluewest.scyld.com> Message-ID: <4AA800B2.7060607@avalon.umaryland.edu> I wonder what would be the sensible biggest cluster possible based on 1GB Ethernet network . And especially how would you connect those 1GB switches together -- now we have (on one of our four clusters) Two 48 ports gigabit switches connected together with 6 patch cables and I just ran out of ports for expansion and wonder where to go from here as we already have four clusters and it would be great to stop adding cluster and start expending them beyond number of outlets on the switch/s .... NFS and 1GB Ethernet works great for us and we want to stick with it , but we would love to find a way how to overcome the current "switch limitation". ... I heard that there are some "stackable switches" .. in any case -- any idea , suggestion will be appreciated. thanks!! psc > From: Rahul Nabar > Subject: [Beowulf] how large of an installation have people used NFS > with? would 300 mounts kill performance? > To: Beowulf Mailing List > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or > comments? There seems no way for me to say if it will scale well or > not. > > I have been warned of performance hits but how bad will they be? > Infiniband is touted as a solution but the economics don't work out. > My question is this: > > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. > Will there still be performance hits as I start adding compute nodes? > Why? Or is it unrealistic to configure a switching setup with full > line capacities on 300 ports? > > If not NFS then Lustre etc options do exist. But the more I read about > those the more I am convinced that those open another big can of > worms. Besides, if NFS would work I do not want to switch. > > From gmkurtzer at gmail.com Wed Sep 9 12:32:03 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Wed, 9 Sep 2009 12:32:03 -0700 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <571f1a060909091232j31eb3e7cldd72cdd46399d8ce@mail.gmail.com> On Wed, Sep 9, 2009 at 10:40 AM, Rahul Nabar wrote: > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or > comments? There seems no way for me to say if it will scale well or > not. > > I have been warned of performance hits but how bad will they be? > Infiniband is touted as a solution but the economics don't work out. > My question is this: > > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. > Will there still be performance hits as I start adding compute nodes? > Why? Or is it unrealistic to configure a switching setup with full > line capacities on 300 ports? > > If not NFS then Lustre etc options do exist. But the more I read about > those the more I am convinced that those open another big can of > worms. Besides, if NFS would work I do not want to switch. NFS itself doesn't have any hard limits and I have seen clusters well over a thousand nodes using it. With that said, you also need to consider your application and user requirements and the budget including administration costs as you architect your resource. As an aside note, generally the more specialized or non-standard the implementation, the more pressure you will put on administration costs. Keep in mind that the requirements of the system and budget need to define the architecture of the system. NFS is a good choice and can be suitable for systems much larger then 300 nodes. *BUT* that would depend on what you are doing with the cluster, application IO requirements, usage patterns, user needs, reliability/uptime goals, etc... Hope that helps. ;) -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From Greg at keller.net Wed Sep 9 13:38:45 2009 From: Greg at keller.net (Greg Keller) Date: Wed, 9 Sep 2009 15:38:45 -0500 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <200909091900.n89J07U6031683@bluewest.scyld.com> References: <200909091900.n89J07U6031683@bluewest.scyld.com> Message-ID: > Date: Wed, 9 Sep 2009 12:40:23 -0500 > From: Rahul Nabar > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or > comments? There seems no way for me to say if it will scale well or > not. > "It all depends" -- Anonymous Cluster expert I routinely run NFS with 300+ nodes, but "it all depends" on the applications' IO profiles. For example, Lot's of nodes reading and writing different files in a generically staggered fashion, may not be a big deal. 300 nodes writing to the same file at the same time... ouch! If you buy generic enough hardware you can hedge your bet, and convert to Gluster or Luster or eventually pNFS if things get ugly. But not all NFS servers are created equal, and a solid purpose built appliance may handle loads a general purpose linux NFS server won't. > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. > Will there still be performance hits as I start adding compute nodes? > Why? Or is it unrealistic to configure a switching setup with full > line capacities on 300 ports? > The bottleneck is more likely the File-server's Nic and/or it's Back- end storage performance. If the file-server is 1GbE attached then having a strong network won't help NFS all that much. 10GbE attached will keep up with a fair number of raided disks on the back-end. Load the NFS server up with a lot of RAM and you could keep a lot of nodes happy if they are reading a common set of files in parallel. Until you get to parallel FS options, it's hard to imagine the switching infrastructure being the bottleneck so long as it supports the 1 or 10GbE performance from the IO node. If you expect heavy MPI usage on the Ethernet side, then non-blocking and low latency issues become relevant, but for IO it only needs to accommodate the slowest link... the IO node. Hope this helps, Greg From jmdavis1 at vcu.edu Wed Sep 9 14:10:11 2009 From: jmdavis1 at vcu.edu (Mike Davis) Date: Wed, 09 Sep 2009 17:10:11 -0400 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA800B2.7060607@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> Message-ID: <4AA819B3.3050503@vcu.edu> psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". ... I heard that there are some "stackable switches" .. > in any case -- any idea , suggestion will be appreciated. > > thanks!! > psc > > When we started running clusters in 2000 we made the decision to use a flat networking model and a single switch if at all possible, We use 144 and 160 port Gig e switches for two of our clusters. The overall performance is better and the routing less complex. Larger switches are available as well. We try to go with a flat model as well for Infiniband. Right now we are using a 96 port Infiniband switch. When we additional nodes to that cluster we will either move up to a 144 or 288 port chassis. Running the numbers I found the cost of the large chassis to be on par with the extra switches required to network using 24 or 36 port switches. -- Mike Davis Technical Director (804) 828-3885 Center for High Performance Computing jmdavis1 at vcu.edu Virginia Commonwealth University "Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity." George S. Patton From coutinho at dcc.ufmg.br Wed Sep 9 14:50:53 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Wed, 9 Sep 2009 18:50:53 -0300 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA800B2.7060607@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> Message-ID: 2009/9/9 psc > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". ... I heard that there are some "stackable switches" .. > in any case -- any idea , suggestion will be appreciated. > > Stackable switches are small switches 16 to 48 ports that have proprietary high bandwidth uplinks to connect switches of the same type. Typically these connections are pairs of 10Gbps (as they are full-duplex, sometimes vendors say that they are 20Gbps) cables that connect all switches in ring configuration. This solution is cheaper than a modular switch, but has limited bandwidth. > thanks!! > psc > > > From: Rahul Nabar > > Subject: [Beowulf] how large of an installation have people used NFS > > with? would 300 mounts kill performance? > > To: Beowulf Mailing List > > Message-ID: > > > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Our new cluster aims to have around 300 compute nodes. I was wondering > > what is the largest setup people have tested NFS with? Any tips or > > comments? There seems no way for me to say if it will scale well or > > not. > > > > I have been warned of performance hits but how bad will they be? > > Infiniband is touted as a solution but the economics don't work out. > > My question is this: > > > > Assume each of my compute nodes have gigabit ethernet AND I specify > > the switch such that it can handle full line capacity on all ports. > > Will there still be performance hits as I start adding compute nodes? > > Why? Or is it unrealistic to configure a switching setup with full > > line capacities on 300 ports? > > > > If not NFS then Lustre etc options do exist. But the more I read about > > those the more I am convinced that those open another big can of > > worms. Besides, if NFS would work I do not want to switch. > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Wed Sep 9 15:24:57 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 9 Sep 2009 15:24:57 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA800B2.7060607@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> Message-ID: <20090909222457.GB11329@bx9.net> On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network. People build multi-thousand node clusters like that. In the HPC world, oil-and-gas and some other industrial applications don't need more than 1 gbit. Of couse these days they use 10 gbit to connect 1gbit switches. In the non-HPC world, companies with huge datacenters often have thousands to many thousands of nodes in a single layer-2 network with 1gbit to the nodes. That's limited only by the size of the mac addr tables in your switches. When you have to split your cluster into layer-3 chunks, the bandwidth between chunks sucks, but most clusters in the non-HPC world are limited to 2000-4000 nodes anyway by various software limitations. -- greg From skylar at cs.earlham.edu Wed Sep 9 16:15:35 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed, 09 Sep 2009 16:15:35 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <20090909222457.GB11329@bx9.net> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <20090909222457.GB11329@bx9.net> Message-ID: <4AA83717.6090207@cs.earlham.edu> Greg Lindahl wrote: > On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: > > >> I wonder what would be the sensible biggest cluster possible based on >> 1GB Ethernet network. >> > > People build multi-thousand node clusters like that. In the HPC world, > oil-and-gas and some other industrial applications don't need more > than 1 gbit. Of couse these days they use 10 gbit to connect 1gbit > switches. > > In the non-HPC world, companies with huge datacenters often have > thousands to many thousands of nodes in a single layer-2 network with > 1gbit to the nodes. That's limited only by the size of the mac addr > tables in your switches. When you have to split your cluster into > layer-3 chunks, the bandwidth between chunks sucks, but most clusters > in the non-HPC world are limited to 2000-4000 nodes anyway by various > software limitations. > > How do people with large layer-2 networks deal with broadcast storms? I've been in situations even on small networks where a node goes haywire and starts spewing broadcast traffic and slows everything in its broadcast domain down. The probability and impact of that goes up with larger networks, and it seems like even the baseline chatter from ARP could be significant. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From Jaime at servepath.com Wed Sep 9 15:12:14 2009 From: Jaime at servepath.com (Jaime Requinton) Date: Wed, 9 Sep 2009 15:12:14 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA819B3.3050503@vcu.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <4AA819B3.3050503@vcu.edu> Message-ID: Can you use this switch? You won't lose a port for uplink since it has fiber and/or copper uplink ports. Just my 10 cents... -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Mike Davis Sent: Wednesday, September 09, 2009 2:10 PM To: psc Cc: beowulf at beowulf.org Subject: Re: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". ... I heard that there are some "stackable switches" .. > in any case -- any idea , suggestion will be appreciated. > > thanks!! > psc > > When we started running clusters in 2000 we made the decision to use a flat networking model and a single switch if at all possible, We use 144 and 160 port Gig e switches for two of our clusters. The overall performance is better and the routing less complex. Larger switches are available as well. We try to go with a flat model as well for Infiniband. Right now we are using a 96 port Infiniband switch. When we additional nodes to that cluster we will either move up to a 144 or 288 port chassis. Running the numbers I found the cost of the large chassis to be on par with the extra switches required to network using 24 or 36 port switches. -- Mike Davis Technical Director (804) 828-3885 Center for High Performance Computing jmdavis1 at vcu.edu Virginia Commonwealth University "Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity." George S. Patton _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Wed Sep 9 17:25:25 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 9 Sep 2009 17:25:25 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA83717.6090207@cs.earlham.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <20090909222457.GB11329@bx9.net> <4AA83717.6090207@cs.earlham.edu> Message-ID: <20090910002525.GB3121@bx9.net> On Wed, Sep 09, 2009 at 04:15:35PM -0700, Skylar Thompson wrote: > How do people with large layer-2 networks deal with broadcast storms? You don't cause them, and you monitor for them. I haven't seen a wild transmitter in a while. Monitoring means you'll at least know it's there; quickly dignosing which node is the bad one can be challenging. More common is pilot error. Trying to talk to nodes which are down, for example, causes an arp every time. So don't do that too often. Modern L3 switches can't discard broadcast packets very quickly, which makes the issue much more challenging than you might think. -- greg From hahn at mcmaster.ca Wed Sep 9 20:59:22 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 9 Sep 2009 23:59:22 -0400 (EDT) Subject: [Beowulf] NFS server for a small cluster In-Reply-To: <393061252464806@webmail90.yandex.ru> References: <393061252464806@webmail90.yandex.ru> Message-ID: > I'm going to prepare NFS server for my small PS3-cluster. So, I've a > question: if I use 4 SATAII ports for 250 GB disks in software RAID0, is > the performance of CPU critical? raid0 copies data around, but is not particularly cpu-intensive (raid5 is worse; r6 worser). but the network is just gigabit, right? which means that the bandwidth is only about 1 disk work anyway (a modern disk sustains > 130 MB/s on the outer half...) > Witch one should I use Core Duo/1 MB cache > or Core2Duo/6 MB cache and how much of DDR2-800 RAM, 2 GB or 8 GB? extra ram doesn't help a fileserver much except when you can read-cache effectively (either file contents or metadata.) I wouldn't sweat cache size, and probably not ram size either (unless your working set really is < 8GB.) From skylar at cs.earlham.edu Wed Sep 9 21:10:23 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed, 09 Sep 2009 21:10:23 -0700 Subject: [Beowulf] NFS server for a small cluster In-Reply-To: References: <393061252464806@webmail90.yandex.ru> Message-ID: <4AA87C2F.9090109@cs.earlham.edu> Mark Hahn wrote: > extra ram doesn't help a fileserver much except when you can read-cache > effectively (either file contents or metadata.) I wouldn't sweat cache > size, and probably not ram size either (unless your working set really > is < 8GB.) > _______________________________________________ One thing extra RAM can buy a file server is aggregating writes into contiguous blocks. I'm not sure what the sweet spot as far as RAM size is, and I suspect strongly that the effect is dependent on the number and size of your files, so your YMMV. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From hahn at mcmaster.ca Wed Sep 9 22:11:38 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 10 Sep 2009 01:11:38 -0400 (EDT) Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or well, 300 is no problem at all. though if you're talking to a single Gb-connected server, you can't home for much BW per node... > comments? There seems no way for me to say if it will scale well or > not. it's not to hard to figure out some order-of-magnitude bandwidth requirements. how many nodes need access to a single namespace at once? do jobs drop checkpoints of a known size periodically? faster/more ports on a single NFS server gets you fairly far (hundreds of MB/s), but you can also agregate across multiple NFS servers (if you don't need all the IO in a single directory...) > I have been warned of performance hits but how bad will they be? NFS is fine at hundreds of nodes. nodes can generate a fairly high load of, for instance, getattr calls, but that can be mitigated some with an acregmin setting. > Infiniband is touted as a solution but the economics don't work out. depends on how much bandwidth you need... > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. but why? your fileservers won't support saturating all nodes links at once, so why a full-bandwidth fabric? the fabric backbone only needs to match the capacity of the storage (I'd guess 10G would be reasonable, unless you really ramp up the number of fat fileservers.) or do you mean the fabric is full-bandwidth to optimally support MPI? > If not NFS then Lustre etc options do exist. But the more I read about yes - I wouldn't resort to Lustre until it was clear that NFS wouldn't do. Lustre does a great job of scaling content bandwidth and capacity all within a single namespace. but NFS, even several instances, is indeed a lot simpler... From hearnsj at googlemail.com Wed Sep 9 22:53:50 2009 From: hearnsj at googlemail.com (John Hearns) Date: Thu, 10 Sep 2009 06:53:50 +0100 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <9f8092cc0909092253u2cee121fie3809899d06650ba@mail.gmail.com> As the guys say, it depends. NFS can be very successful on that size of cluster, if attention is paid to the performance of the NFS server and the bandwidth you have to it. SGI Enhanced NFS as a for instance. Please let me add to your list though - contact your Panasas salesman. I really think you could use a supported, commercial solution here. From hearnsj at googlemail.com Wed Sep 9 22:54:58 2009 From: hearnsj at googlemail.com (John Hearns) Date: Thu, 10 Sep 2009 06:54:58 +0100 Subject: [Beowulf] Nehalem EX Message-ID: <9f8092cc0909092254w48f1d94dk367cfb2630751c7@mail.gmail.com> Does anyone on the list have a feeling for how close Nehalem EX systems are? A bit of heat and light generated earlier int he year, but nothing being unveiled yet. From bill at cse.ucdavis.edu Wed Sep 9 23:24:13 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 09 Sep 2009 23:24:13 -0700 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <4AA89B8D.2020403@cse.ucdavis.edu> Mark Hahn wrote: >> Our new cluster aims to have around 300 compute nodes. I was wondering >> what is the largest setup people have tested NFS with? Any tips or > > well, 300 is no problem at all. though if you're talking to a single > Gb-connected server, you can't home for much BW per node... True. Then again in the days of 20+ GB/sec memory systems, 8GB/sec pci-e busses, fast raid controllers and quad ethernet pci-e/motherboards there's not much reason to connect a single GigE to a file server these days. Hell even many of the cheap 1U systems come with quad ethernet. Sun ships in on many (all?) of their systems. I was just looking at a single socket xeon lynnfield board with 4 GigE from supermicro. I suspect the supermicro quad port motherboard costs less than an intel quad port pci-e card ($422 at newegg). If you have 4 48 port GigE switches connected via the interconnect cable it would make sense to give each switch a subnet and direct connect one GigE per switch. Of course file servers themselves are cheap, at anything much more than 16 disks I usually build multiple servers. In the comparisons I've made so far 3 16 disk servers compare rather well against a single 48 disk server. Usually cheaper as well, granted it does take 9U instead of 3-4U. Granted without a fancy parallel file system you can't load balance file servers or network links. Having more file servers and multiple uplink per server can certainly substantially improve your average throughput. With that said 3 16 disk servers make a good building block for a parallel file system of your choice if you change your mind later on. Often we have different groups of users contributing to a cluster and politically it's nice to separate them off on their own server that by definition get 100% of the disk they paid for, and their use of their uplink or disk doesn't effect the other file file servers. This gives said group options that they wouldn't have with a larger shared server. >> comments? There seems no way for me to say if it will scale well or >> not. > > it's not to hard to figure out some order-of-magnitude bandwidth > requirements. how many nodes need access to a single namespace at > once? do jobs drop checkpoints of a known size periodically? > faster/more ports on a single NFS server gets you fairly far (hundreds > of MB/s), but you can also agregate across multiple NFS > servers (if you don't need all the IO in a single directory...) Indeed, this kind of thing works quite well for us with 180-230 node clusters. >> Assume each of my compute nodes have gigabit ethernet AND I specify >> the switch such that it can handle full line capacity on all ports. > > but why? your fileservers won't support saturating all nodes links at > once, so why a full-bandwidth fabric? the fabric backbone only needs to Agreed, it's a waste unless MPI or related needs it. > match the capacity of the storage (I'd guess 10G would be reasonable, I've been looking at 10G for NFS/file servers (we already use sdr and ddr for MPI), but so far the cost seems to favor not putting too many disks in a single box and using more than one uplink. So instead of 1 48 disk file server with a 10G uplink we end up with 3 16 disk servers with 4 GigE uplinks each. It also avoids each individual file server not being as mission critical. I'd be a bit nervous if an expensive cluster depended on a single piece equipment. My theory goes if you have just one you spent too much on that piece of equipment. That way we leave the switch <-> switch connections entirely for MPI and not for trying to spread the single 10G connection for all I/O across 4 switches. > unless you really ramp up the number of fat fileservers.) > or do you mean the fabric is full-bandwidth to optimally support MPI? Hopefully. >> If not NFS then Lustre etc options do exist. But the more I read about > > yes - I wouldn't resort to Lustre until it was clear that NFS wouldn't do. > Lustre does a great job of scaling content bandwidth and capacity all > within a single namespace. but NFS, even several instances, is indeed > a lot simpler... Agreed. NFS works well in the 200 node range if you aren't too I/O intensive. Sometimes it's a bit more complicated because you end up staging to a local disk for better performance. But it's stable and keeps our users happy. I'm watching the parallel file system space closely and would certainly design around it if we ended up with a significantly larger (or more I/O intensive) cluster. From kilian.cavalotti.work at gmail.com Thu Sep 10 00:25:09 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Thu, 10 Sep 2009 09:25:09 +0200 Subject: [Beowulf] Nehalem EX In-Reply-To: <9f8092cc0909092254w48f1d94dk367cfb2630751c7@mail.gmail.com> References: <9f8092cc0909092254w48f1d94dk367cfb2630751c7@mail.gmail.com> Message-ID: Hi John, On Thu, Sep 10, 2009 at 7:54 AM, John Hearns wrote: > Does anyone on the list have a feeling for how close Nehalem EX systems are? > A bit of heat and light generated earlier int he year, but nothing > being unveiled yet. The original release date was somewhere in 2nd half of 2009, but it seems to have been pushed a little bit since, to the beginning of 2010. Nothing official, though. Cheers, -- Kilian From henning.fehrmann at aei.mpg.de Thu Sep 10 00:39:57 2009 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Thu, 10 Sep 2009 09:39:57 +0200 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA800B2.7060607@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> Message-ID: <20090910073957.GA8487@gretchen.aei.mpg.de> Hi On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . Hmmm, may I cheat and use a 10Gb core switch? If you setup a cluster with few thousand nodes you have to ask yourself whether this network should be non-blocking or not. For a non blocking network you need the right core-switch technology. Unfortunately, there are not many vendors out there which provide non-blocking Ethernet based core switches but I am aware of at least two. One provides or will provide 144 10Gb Ethernet ports. Another one sells switches with more than 1000 1 GB ports. You could buy edge-switches with 4 10Gb uplinks and 48 1GB ports. If you just use 40 of them you end up with a 1440 non-blocking 1Gb ports. It might be also possible to cross connect two of these core-switches with the help of some smaller switches so that one ends up with 288 10Gb ports and, in principle, one might connect 2880 nodes in a non-blocking way, but we did not have the possibility to test it successfully yet. One of problems is that the internal hash table can not store that many mac addresses. Anyway, one probably needs to change the mac addresses of the nodes to avoid an overflow of the hash tables. An overflow might cause arp storms. Once this works one runs into some smaller problems. One of them is the arp cache of the nodes. It should be adjusted to hold as many mac addresses as you have nodes in the cluster. > And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". With NFS you can nicely test the setup. Use one NFS server and let all nodes write different files into it and look what happens. Cheers, Henning From rpnabar at gmail.com Thu Sep 10 08:19:50 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 10:19:50 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <4AA7F962.500@scalableinformatics.com> References: <4AA7F962.500@scalableinformatics.com> Message-ID: On Wed, Sep 9, 2009 at 1:52 PM, Joe Landman wrote: > Rahul Nabar wrote: > >> If not NFS then Lustre etc options do exist. But the more I read about >> those the more I am convinced that those open another big can of >> worms. Besides, if NFS would work I do not want to switch. > > You might also want to look at GlusterFS. Thanks Joe! From all the reports and advice it seems I'll start with NFS and a good muscular storage device and then if NFS fails under load move to glusterfc, lustre etc. Per se there seems no way of knowing if NFS will keep up or not a priori. -- Rahul From rpnabar at gmail.com Thu Sep 10 08:32:00 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 10:32:00 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: On Wed, Sep 9, 2009 at 2:12 PM, Bruno Coutinho wrote: > Because final NFS server bandwidth will be the bandwidth of the most limited > device, be it disk, network interface or switch. > Even if you have a switch capable of fill line capacity for all 300 nodes, > you must put a insanely fast interface in your NFS server and a giant pool > of disks to have a decent bandwidth if all nodes access NFS at the same > time. I'm thinking of having multiple 10GigE uplinks between the switch and the NFS server. The actual storage is planned to reside on a box of SAS disks. Approx 15 disks. THe NFS server is planned with at least two RAID cards with multiple SAS connections to the box. But that's just my planning. The question is do people have numbers. What I/O throughputs are your NFS devices giving? I want to get a feel for what my I/O performance envelope should be like. What kind of I/O gurrantees are available? Any vendors around want to comment? On the other hand just multiplying NFS clients by their peak bandwidth (300 x 1 GB) is an overkill. THat is a very unlikely situation. What are typical workloads like? Given x NFS mounts in a computational environment with a y GB uplink each what's the factor on the net loading of the central storage? Any back of the envelope numbers? > But depending on the way people run applications in your cluster, only a > small set of nodes will access NFS at the same time and a Ethernet 10Gb with > tens of disks will be enough. strace profiling shows that app1 has very little NFS I/O. App2 has about 10% runtime devoted to NFS I/O. Multiple seeks only. More reads than writes. (All this thanks to Jeff Laytons excellent strace analyzer and profiling help) -- Rahul From rpnabar at gmail.com Thu Sep 10 08:36:25 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 10:36:25 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <571f1a060909091232j31eb3e7cldd72cdd46399d8ce@mail.gmail.com> References: <571f1a060909091232j31eb3e7cldd72cdd46399d8ce@mail.gmail.com> Message-ID: On Wed, Sep 9, 2009 at 2:32 PM, Greg Kurtzer wrote: > NFS itself doesn't have any hard limits and I have seen clusters well > over a thousand nodes using it. Thanks Greg! That is very reassuring to know! :) I myself had a installation with 256 NFS mounts but these were ancient clusters which were essentially "groups of single cpu PCs" The "well over a 1000 node NFS clusters" that Greg refers to: Any masters of such installations around on this list? If so I'd give an arm and leg and more to be in touch and grab your tips and comments. Whenever I mention "300 nodes", "Gigabit Ethernet" and NFS in the same breath people look at me as if I was a madman. :) > As an aside note, generally the more specialized or non-standard the > implementation, the more pressure you will put on administration > costs. Exactly. Hence I want NFS to keep things simple ergo cheap. > Keep in mind that the requirements of the system and budget need to > define the architecture of the system. NFS is a good choice and can be > suitable for systems much larger then 300 nodes. *BUT* that would > depend on what you are doing with the cluster, application IO > requirements, usage patterns, user needs, reliability/uptime goals, > etc... I see too many invocations of the "it depends" rule of HPC everywhere I go! :) -- Rahul From rpnabar at gmail.com Thu Sep 10 08:44:41 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 10:44:41 -0500 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: <200909091900.n89J07U6031683@bluewest.scyld.com> Message-ID: On Wed, Sep 9, 2009 at 3:38 PM, Greg Keller wrote: >> > "It all depends" -- Anonymous Cluster expert Thanks Greg. And I hate that anonymous expert. He's the bane of my current existence. I even get nightmares with his ghastly face in them. :) > I routinely run NFS with 300+ nodes, but "it all depends" on the > applications' IO profiles. 50% projected runtime is with an application with negligible reads and writes (VASP). The other 50% goes to an app. (DACAPO) which strace shows to be using 10% of its runtime devoted to I/O. Mostly seeks. More reads than writes. Multiple small reads and writes. All cores doing I/O not a central master core. >For example, Lot's of nodes reading and writing > different files in a generically staggered fashion, How do you enforce the staggering? Do people write staggered I/O codes themselves? Or can on alliviate this problem by scheduler settings? > Luster or eventually pNFS if things get ugly. ?But not all NFS servers are > created equal, and a solid purpose built appliance may handle loads a > general purpose linux NFS server won't. Disk array connected to generic Linux server? Or standalone Fileserver? Reccomendations? What exactly does a "solid purpose built appliance" offer that a Generic Linux server (well configured) connected to an array of disks does not offer? > The bottleneck is more likely the File-server's Nic and/or it's Back-end > storage performance. ?If the file-server is 1GbE attached then having a > strong network won't help NFS all that much. ?10GbE attached will keep up > with a fair number of raided disks on the back-end. ?Load the NFS server up > with a lot of RAM and you could keep a lot of nodes happy if they are > reading a common set of files in parallel. Yup; I'm going for at least 24 GB RAM and twin 10 GigE cards connecting the file server to the switch. -- Rahul From landman at scalableinformatics.com Thu Sep 10 09:18:00 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 10 Sep 2009 12:18:00 -0400 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: <200909091900.n89J07U6031683@bluewest.scyld.com> Message-ID: <4AA926B8.6060702@scalableinformatics.com> Rahul Nabar wrote: >> Luster or eventually pNFS if things get ugly. But not all NFS servers are >> created equal, and a solid purpose built appliance may handle loads a >> general purpose linux NFS server won't. > > Disk array connected to generic Linux server? Or standalone > Fileserver? Reccomendations? At least one company on this list sells some nice fast storage boxen. I am biased of course, as I work there ... > What exactly does a "solid purpose built appliance" offer that a > Generic Linux server (well configured) connected to an array of disks > does not offer? "It depends". Your off the shelf Linux servers aren't very well designed for high performance file service. You would either need to go to a special purpose built server, or the pure purpose-built appliance boxen. The latter often have some additional features you may or may not find useful, at a price you may or may not be willing to pay for. The former, depending upon whom you speak with, will provide excellent performance for reasonable prices on your use case. >> The bottleneck is more likely the File-server's Nic and/or it's Back-end >> storage performance. If the file-server is 1GbE attached then having a >> strong network won't help NFS all that much. 10GbE attached will keep up >> with a fair number of raided disks on the back-end. Load the NFS server up >> with a lot of RAM and you could keep a lot of nodes happy if they are >> reading a common set of files in parallel. > > Yup; I'm going for at least 24 GB RAM and twin 10 GigE cards > connecting the file server to the switch. FWIW: I didn't post it to this list when we did this, but we had a single client and server show a 1 GB/s (954 MB/s really, I rounded up) over a single single-mode fibre running NFS. "Who says you can?t do Gigabyte per second NFS? I keep hearing this. Its not true though. See below. NFS client: Scalable Informatics Delta-V (?V) 4 unit NFS server: Scalable Informatics JackRabbit 4 unit. (you can buy these units today from Scalable Informatics and its partners) 10GbE: single XFP fibre between two 10GbE NICs. This is NOT a clustered NFS result. root at dv4:~# mount | grep data2 10.1.3.1:/data on /data2 type nfs (rw,intr,rsize=262144,wsize=262144,tcp,addr=10.1.3.1) root at dv4:~# mpirun -np 4 ./io-bm.exe -n 32 -f /data2/test/file -r -d -v N=32 gigabytes will be written in total each thread will output 8.000 gigabytes page size ... 4096 bytes number of elements per buffer ... 2097152 number of buffers per file ... 512 Thread=3: time = 33.665s IO bandwidth = 243.337 MB/s Thread=2: time = 33.910s IO bandwidth = 241.580 MB/s Thread=1: time = 34.262s IO bandwidth = 239.101 MB/s Thread=0: time = 34.244s IO bandwidth = 239.226 MB/s Naive linear bandwidth summation = 963.244 MB/s More precise calculation of Bandwidth = 956.404 MB/s " The machine running the code has 8GB of RAM, so writing 32 GB is far outside of its cache. The remote system, (the 10.1.3.1 unit) has a native local disk performance of about 1.6/2.0 GB/s read/write. So yes, with the right system, you can get a nice bit of performance out of it. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Thu Sep 10 09:32:22 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 10 Sep 2009 12:32:22 -0400 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <4AA92A16.3080401@scalableinformatics.com> Rahul Nabar wrote: > I'm thinking of having multiple 10GigE uplinks between the switch and > the NFS server. The actual storage is planned to reside on a box of > SAS disks. Approx 15 disks. THe NFS server is planned with at least > two RAID cards with multiple SAS connections to the box. ugh ... Why are you designing it ahead of time? Why not take your requirements and needs and use that to dictate the design? > But that's just my planning. The question is do people have numbers. > What I/O throughputs are your NFS devices giving? I want to get a Depending upon workload, you can get performance ranging from 100MB/s through GB+/s. > feel for what my I/O performance envelope should be like. What kind of > I/O gurrantees are available? Any vendors around want to comment? You want a guarantee of I/O performance? For an arbitrary I/O pattern and load? So if you suddenly start random seeking with 4kB reads, you still want to hit 1+GB/s with these 4kB random seek and reads? Not sure if anyone would be willing to guarantee a particular rate for any workload. We have found well known benchmark codes (bonnie++ 1.0x and some of 1.9x) doing not so good I/O (long OS based pauses) where other codes seem fine. We use our io-bm code, fio, and a few others to bang on our systems. fio lets us model per unit workloads fairly nicely, io-bm lets us create a system/cluster-wide I/O hammer. > On the other hand just multiplying NFS clients by their peak bandwidth > (300 x 1 GB) is an overkill. THat is a very unlikely situation. What Each 1Gb interface can move about 120MB/s best case. So 300x 120MB/s => 3.6E+4 MB/s . This is likely to be overkill, as you report your highest IO utilization is about 10% of CPU (need to get what that translates to in MB/s, I'd suggest installing iftop on that machine and measuring when it is doing its 10% time in IO). > are typical workloads like? Given x NFS mounts in a computational > environment with a y GB uplink each what's the factor on the net > loading of the central storage? Any back of the envelope numbers? In the distant past, we used 8 nodes per GbE port for a port on the NFS server. This allowed us to serve up to 32 nodes with 4GbE ports, and the NFS servers weren't badly loaded. This ratio is a function of utilization of the links, the I/O duty cycle, etc. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Thu Sep 10 09:56:02 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 11:56:02 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <4AA92A16.3080401@scalableinformatics.com> References: <4AA92A16.3080401@scalableinformatics.com> Message-ID: On Thu, Sep 10, 2009 at 11:32 AM, Joe Landman wrote: > Rahul Nabar wrote: > >> I'm thinking of having multiple 10GigE uplinks between the switch and >> the NFS server. The actual storage is planned to reside on a box of >> SAS disks. Approx 15 disks. THe NFS server is planned with at least >> two RAID cards with multiple SAS connections to the box. > > ugh ... Why are you designing it ahead of time? ?Why not take your > requirements and needs and use that to dictate the design? Oh no! I wasn't meaning that I was designing this. Sorry. I gave my vendor the specs. and these were the options that came up. He is meaning to design each component to balance loads. But realistically for 300 compute-nodes a config. close to what I wrote is to be expected he said. Just giving and idea of where my current expectation is about where we will be ending up at. -- Rahul From kus at free.net Thu Sep 10 10:52:08 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Thu, 10 Sep 2009 21:52:08 +0400 Subject: [Beowulf] Re:recommendations for a good ethernet switch for connecting ~300 compute nodes Message-ID: Many quantum chemical programs like Gamess-US or Gaussian performs well w/local I/O to local HDDs, in which case you'll have small NFS load even on such large cluster. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From Greg at keller.net Thu Sep 10 11:13:02 2009 From: Greg at keller.net (Greg Keller) Date: Thu, 10 Sep 2009 13:13:02 -0500 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: <200909091900.n89J07U6031683@bluewest.scyld.com> Message-ID: <38C4F654-C13D-41BC-B879-C94C3892E199@keller.net> On Sep 10, 2009, at 10:44 AM, Rahul Nabar wrote: > On Wed, Sep 9, 2009 at 3:38 PM, Greg Keller wrote: >> For example, Lot's of nodes reading and writing >> different files in a generically staggered fashion, > > How do you enforce the staggering? Do people write staggered I/O codes > themselves? Or can on alliviate this problem by scheduler settings? Although there's probably a way to enforce it at the app level, or scheduler, all of that would require specific knowledge of what jobs (and nodes) are accessing what files how at what time. I was thinking that if it's largely embarassingly parallel jobs that start/stop independently and have somewhat randomized IO, then there is some natural staggering. If the app starts on all nodes simultanously and then they all start reading/writing the same files nearly simultaneously, then staggering is probably impossible and a parallel FS is worth investigating. > >> Luster or eventually pNFS if things get ugly. But not all NFS >> servers are >> created equal, and a solid purpose built appliance may handle loads a >> general purpose linux NFS server won't. > > Disk array connected to generic Linux server? Or standalone > Fileserver? Reccomendations? > > What exactly does a "solid purpose built appliance" offer that a > Generic Linux server (well configured) connected to an array of disks > does not offer? Joe's post is spot on here. Don't let legend and lore scare you off, NFS can do great things on current generic and special purpose servers with the right config and software. There's nothing in your configuration and usage summary that screams NFS killer to me. If you use generic or special purpose servers, you can repurpose them as part of a parallel FS if you need to. Purpose built *appliances* generally give you: Simple setup and admin GUI Replication and other fancy features HPCC doesn't normally care about Zero flexibility if you change course and head towards a parallel FS. A singular support channel to complain to if things go badly (YMMV) None of those matter to me more than the money they cost, so I buy standard servers and run standard linux NFS on internal raid controllers with no HA, and have occasional crashes and issues I can't resolve cleanly. We are perpetually looking for a "next step" to get better support/stability, but it's good enough for our 300 and 600 node systems at the moment. > > -- > Rahul Cheers! Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jaime at servepath.com Wed Sep 9 19:01:00 2009 From: Jaime at servepath.com (Jaime Requinton) Date: Wed, 9 Sep 2009 19:01:00 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <4AA819B3.3050503@vcu.edu> Message-ID: Can you use this switch? You won't lose a port for uplink since it has fiber and/or copper uplink ports. Just my 10 cents... Forgot to paste the link: http://www.bestbuy.com/site/olspage.jsp?skuId=8891915&type=product&id=1212192931527&ref=06&loc=01&ci_src=14110944&ci_sku=8891915 -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Jaime Requinton Sent: Wednesday, September 09, 2009 3:12 PM To: Mike Davis; psc Cc: beowulf at beowulf.org Subject: RE: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? Can you use this switch? You won't lose a port for uplink since it has fiber and/or copper uplink ports. Just my 10 cents... -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Mike Davis Sent: Wednesday, September 09, 2009 2:10 PM To: psc Cc: beowulf at beowulf.org Subject: Re: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". ... I heard that there are some "stackable switches" .. > in any case -- any idea , suggestion will be appreciated. > > thanks!! > psc > > When we started running clusters in 2000 we made the decision to use a flat networking model and a single switch if at all possible, We use 144 and 160 port Gig e switches for two of our clusters. The overall performance is better and the routing less complex. Larger switches are available as well. We try to go with a flat model as well for Infiniband. Right now we are using a 96 port Infiniband switch. When we additional nodes to that cluster we will either move up to a 144 or 288 port chassis. Running the numbers I found the cost of the large chassis to be on par with the extra switches required to network using 24 or 36 port switches. -- Mike Davis Technical Director (804) 828-3885 Center for High Performance Computing jmdavis1 at vcu.edu Virginia Commonwealth University "Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity." George S. Patton _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From john.hearns at mclaren.com Thu Sep 10 01:57:35 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 10 Sep 2009 09:57:35 +0100 Subject: [Beowulf] Forget station wagons loaded with tapes Message-ID: <68A57CCFD4005646957BD2D18E60667B0D17AA6E@milexchmb1.mil.tagmclarengroup.com> Forget the odl chestnut of a station wagon loaded with tapes. In the 21st Century, solid state drives are king. Plus carrier pigeons. http://www.henriska.com/blog/?p=615 The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From pscadmin at avalon.umaryland.edu Thu Sep 10 05:28:55 2009 From: pscadmin at avalon.umaryland.edu (psc) Date: Thu, 10 Sep 2009 08:28:55 -0400 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <20090910073957.GA8487@gretchen.aei.mpg.de> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <20090910073957.GA8487@gretchen.aei.mpg.de> Message-ID: <4AA8F107.7010805@avalon.umaryland.edu> Thank you all for the answers. Would you guys please share with me some good brands of those 200+ 1GB Ethernet switches? I think I'll leave our current clusters alone , but the new cluster I will design for about 500 to 1000 nodes --- I don't think that we will go much above since for big jobs our scientists using outside resources. We do all our calculations and analysis on the nodes and only the final produce we sent to the frontend , also we don't run jobs across the nodes , so I don't need to get too much creative with the network beside being sure that I can expand the cluster without having the switches as a limitation (our current situation) thank you again! Henning Fehrmann wrote: > Hi > > On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: > >> I wonder what would be the sensible biggest cluster possible based on >> 1GB Ethernet network . >> > > Hmmm, may I cheat and use a 10Gb core switch? > > If you setup a cluster with few thousand nodes you have to ask yourself > whether this network should be non-blocking or not. > > For a non blocking network you need the right core-switch technology. > Unfortunately, there are not many vendors out there which provide > non-blocking Ethernet based core switches but I am aware of at least > two. One provides or will provide 144 10Gb Ethernet ports. Another one > sells switches with more than 1000 1 GB ports. > You could buy edge-switches with 4 10Gb uplinks and 48 1GB ports. If > you just use 40 of them you end up with a 1440 non-blocking 1Gb ports. > > It might be also possible to cross connect two of these core-switches > with the help of some smaller switches so that one ends up with 288 > 10Gb ports and, in principle, one might connect 2880 nodes in a > non-blocking way, but we did not have the possibility to test it > successfully yet. One of problems is that the internal hash table can > not store that many mac addresses. Anyway, one probably needs to change > the mac addresses of the nodes to avoid an overflow of the hash tables. > An overflow might cause arp storms. > > Once this works one runs into some smaller problems. One of them is the arp > cache of the nodes. It should be adjusted to hold as many mac addresses > as you have nodes in the cluster. > > > >> And especially how would you connect those 1GB >> switches together -- now we have (on one of our four clusters) Two 48 >> ports gigabit switches connected together with 6 patch cables and I just >> ran out of ports for expansion and wonder where to go from here as we >> already have four clusters and it would be great to stop adding cluster >> and start expending them beyond number of outlets on the switch/s .... >> NFS and 1GB Ethernet works great for us and we want to stick with it , >> but we would love to find a way how to overcome the current "switch >> limitation". >> > > With NFS you can nicely test the setup. Use one NFS server and let all > nodes write different files into it and look what happens. > > Cheers, > Henning > From orion at cora.nwra.com Thu Sep 10 15:44:32 2009 From: orion at cora.nwra.com (Orion Poplawski) Date: Thu, 10 Sep 2009 16:44:32 -0600 Subject: [Beowulf] Some perspective to this DIY storage server mentioned at Storagemojo In-Reply-To: <20090904081722.GN4508@leitl.org> References: <20090904081722.GN4508@leitl.org> Message-ID: <4AA98150.3050505@cora.nwra.com> On 09/04/2009 02:17 AM, Eugen Leitl wrote: > > http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-this-DIY-storage-server-mentioned-at-Storagemojo.html > > Some perspective to this DIY storage server mentioned at Storagemojo > > Thursday, September 3. 2009 > > I've received yesterday some mails/tweets with hints to a "Thumper for poor" > DIY chassis. Those mails asked me for an opinion towards this piece of > hardware and if it's a competition to our X4500/X4540. I'm waiting for the day I can buy a X4540 without disks. Or maybe it will still be too expensive... -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion at cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From eugen at leitl.org Fri Sep 11 02:30:08 2009 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 11 Sep 2009 11:30:08 +0200 Subject: [Beowulf] Some perspective to this DIY storage server mentioned at Storagemojo In-Reply-To: <4AA98150.3050505@cora.nwra.com> References: <20090904081722.GN4508@leitl.org> <4AA98150.3050505@cora.nwra.com> Message-ID: <20090911093008.GU9828@leitl.org> On Thu, Sep 10, 2009 at 04:44:32PM -0600, Orion Poplawski wrote: > I'm waiting for the day I can buy a X4540 without disks. Or maybe it > > will still be too expensive... I've also become sufficiently annoyed at Sun for their unwillingness or inability to ship hotplug drive carriers without (premium-priced) drives in them to switch to Supermicro. Below is maybe not Thumper, but you can put 2 TByte drives at 100 EUR/TByte costs into them yourself, giving you 50 TByte raw storage (or 48 TByte raw storage, and 4x 2.5" SSDs for hybrid storage ZIL/L2ARC cache with Opensolaris) in 4U of rack space. http://www.supermicro.com/products/nfo/chassis_storage.cfm [...] SC848A - 24 Hot-swap HDDs in 4U (Quad Motherboard Support) Optimized for enterprise-level heavy-capacity storage applications, Supermicro's SC848 Chassis supports 4-way serverboards that demand high volume I/O or computational usage and features 24 hot-swap 3.5" SAS/SATA hard drive trays and 2 fixed internal hard drive bays in a 4U space. The SC848 design offers maximum HDD per space ratio in a 4U form factor, high power efficiency (up to 88%) with (2+1) redundant 1800W power supply, optimized HDD signal trace routing and improved HDD tray design to dampen HDD vibrations and maximize performance. SC846A/TQ/E1/E2 - 24 Hot-swap HDDs in 4U Optimized for enterprise-level heavy-capacity storage applications, Supermicro's SC846 Chassis features 24 hot-swap 3.5" SAS/SATA hard drive trays and 2 fixed internal hard drive bays in a 4U space. The SC846 design offers maximum HDD per space ratio in a 4U form factor, high power efficiency, optimized HDD signal trace routing and improved HDD tray design to dampen HDD vibrations and maximize performance. Equipped with a 900W or Gold Level 1200W (93%+) high-efficiency redundant power supply and 5 hot-plug redundant cooling fans, the SC846 is a reliable and maintenance-free storage workhorse system. From rpnabar at gmail.com Sat Sep 12 08:10:43 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Sep 2009 10:10:43 -0500 Subject: [Beowulf] filesystem metadata mining tools Message-ID: As the number of total files on our server was exploding (~2.5 million / 1 Terabyte) I wrote a simple shell script that used find to tell me which users have how many. So far so good. But I want to drill down more: *Are there lots of duplicate files? I suspect so. Stuff like job submission scripts which users copy rather than link etc. (fdupes seems puny for a job of this scale) *What is the most common file (or filename) *A distribution of filetypes (executibles; netcdf; movies; text) and prevalence. *A distribution of file age and prevelance (to know how much of this material is archivable). Same for frequency of access; i.e. maybe the last access stamp. * A file size versus number plot. i.e. Is 20% of space occupied by 80% of files? etc. I've used cushion plots in the past (sequiaview; pydirstat) but those seem more desktop oriented than suitable for a job like this. Essentially I want to data mine my file usage to strategize. Are there any tools for this? Writing a new find each time seems laborious. I suspect forensics might also help identify anomalies in usage across users which might be indicative of other maladies. e.g. a user who had a runaway job write a 500GB file etc. Essentially are there any "filesystem metadata mining tools"? -- Rahul From skylar at cs.earlham.edu Sat Sep 12 11:34:25 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sat, 12 Sep 2009 11:34:25 -0700 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: Message-ID: <4AABE9B1.9020502@cs.earlham.edu> Rahul Nabar wrote: > As the number of total files on our server was exploding (~2.5 million > / 1 Terabyte) I > wrote a simple shell script that used find to tell me which users have how > many. So far so good. > > But I want to drill down more: > > *Are there lots of duplicate files? I suspect so. Stuff like job submission > scripts which users copy rather than link etc. (fdupes seems puny for > a job of this scale) > > *What is the most common file (or filename) > > *A distribution of filetypes (executibles; netcdf; movies; text) and > prevalence. > > *A distribution of file age and prevelance (to know how much of this > material is archivable). Same for frequency of access; i.e. maybe the last > access stamp. > > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of > files? etc. > > I've used cushion plots in the past (sequiaview; pydirstat) but those > seem more desktop oriented than suitable for a job like this. > > Essentially I want to data mine my file usage to strategize. Are there any > tools for this? Writing a new find each time seems laborious. > > I suspect forensics might also help identify anomalies in usage across > users which might be indicative of other maladies. e.g. a user who had a > runaway job write a 500GB file etc. > > Essentially are there any "filesystem metadata mining tools"? > > What OS is this on? If you have dtrace available you can use that to at least gather data on new files coming in, which could reduce your search scope considerably. It obviously doesn't directly answer your question, but it might make it easier to use the existing tools. Depending on what filesystem you have you might be able to query the filesystem itself for this data. On GPFS, for instance, you can write a policy that would move all files older than, say, three months to a different storage pool. You can then run that policy in a preview mode to see what files would have been moved. The policy scan on GPFS is quite a bit faster than running a find against the entire filesystem, so it's a definite win. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From coutinho at dcc.ufmg.br Sat Sep 12 12:59:57 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Sat, 12 Sep 2009 16:59:57 -0300 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: <4AABE9B1.9020502@cs.earlham.edu> References: <4AABE9B1.9020502@cs.earlham.edu> Message-ID: This tool do can do part of what you want: http://www.chiark.greenend.org.uk/~sgtatham/agedu/ This display files by size and color file by type. http://gdmap.sourceforge.net/ Perhaps agedu can handle large subsets of your files, but gdmap is desktop oriented. 2009/9/12 Skylar Thompson > Rahul Nabar wrote: > > As the number of total files on our server was exploding (~2.5 million > > / 1 Terabyte) I > > wrote a simple shell script that used find to tell me which users have > how > > many. So far so good. > > > > But I want to drill down more: > > > > *Are there lots of duplicate files? I suspect so. Stuff like job > submission > > scripts which users copy rather than link etc. (fdupes seems puny for > > a job of this scale) > > > > *What is the most common file (or filename) > > > > *A distribution of filetypes (executibles; netcdf; movies; text) and > > prevalence. > > > > *A distribution of file age and prevelance (to know how much of this > > material is archivable). Same for frequency of access; i.e. maybe the > last > > access stamp. > > > > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of > > files? etc. > > > > I've used cushion plots in the past (sequiaview; pydirstat) but those > > seem more desktop oriented than suitable for a job like this. > > > > Essentially I want to data mine my file usage to strategize. Are there > any > > tools for this? Writing a new find each time seems laborious. > > > > I suspect forensics might also help identify anomalies in usage across > > users which might be indicative of other maladies. e.g. a user who had a > > runaway job write a 500GB file etc. > > > > Essentially are there any "filesystem metadata mining tools"? > > > > > What OS is this on? If you have dtrace available you can use that to at > least gather data on new files coming in, which could reduce your search > scope considerably. It obviously doesn't directly answer your question, > but it might make it easier to use the existing tools. > > Depending on what filesystem you have you might be able to query the > filesystem itself for this data. On GPFS, for instance, you can write a > policy that would move all files older than, say, three months to a > different storage pool. You can then run that policy in a preview mode > to see what files would have been moved. The policy scan on GPFS is > quite a bit faster than running a find against the entire filesystem, so > it's a definite win. > > -- > -- Skylar Thompson (skylar at cs.earlham.edu) > -- http://www.cs.earlham.edu/~skylar/ > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.p.lux at jpl.nasa.gov Sat Sep 12 16:02:10 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Sat, 12 Sep 2009 16:02:10 -0700 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: Message-ID: On 9/12/09 8:10 AM, "Rahul Nabar" wrote: > As the number of total files on our server was exploding (~2.5 million > / 1 Terabyte) I > wrote a simple shell script that used find to tell me which users have how > many. So far so good. > > But I want to drill down more: > > *Are there lots of duplicate files? I suspect so. Stuff like job submission > scripts which users copy rather than link etc. (fdupes seems puny for > a job of this scale) > > *What is the most common file (or filename) > > *A distribution of filetypes (executibles; netcdf; movies; text) and > prevalence. > > *A distribution of file age and prevelance (to know how much of this > material is archivable). Same for frequency of access; i.e. maybe the last > access stamp. > > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of > files? etc. > Another useful application for such a tool would be to get better KLOC counts of source code trees. I find that our trees have lots of duplication among branches (e.g. Everyone has a "test.c" for unit test in with their modules, and all of them are pretty similar) From rpnabar at gmail.com Sat Sep 12 18:19:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Sep 2009 20:19:14 -0500 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: <4AABE9B1.9020502@cs.earlham.edu> Message-ID: On Sat, Sep 12, 2009 at 2:59 PM, Bruno Coutinho wrote: > This tool do can do part of what you want: > http://www.chiark.greenend.org.uk/~sgtatham/agedu/ > > This display files by size and color file by type. > http://gdmap.sourceforge.net/ Thanks Bruno! agedu sounds very promising. I just installed it. Just remains to be seen how well it scales for large filesystems. I was using gdmap before but it is very Desktop oriented, you are right. -- Rahul From rpnabar at gmail.com Sat Sep 12 18:22:10 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Sep 2009 20:22:10 -0500 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: <4AABE9B1.9020502@cs.earlham.edu> References: <4AABE9B1.9020502@cs.earlham.edu> Message-ID: On Sat, Sep 12, 2009 at 1:34 PM, Skylar Thompson wrote: > What OS is this on? Thanks Skylar! Linux. RedHat. >If you have dtrace available you can use that to at I don't but let me try to install it. > Depending on what filesystem you have ext3 I need to see if this can be queried. -- Rahul From skylar at cs.earlham.edu Sat Sep 12 19:22:11 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sat, 12 Sep 2009 19:22:11 -0700 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: <4AABE9B1.9020502@cs.earlham.edu> Message-ID: <4AAC5753.9040301@cs.earlham.edu> Rahul Nabar wrote: > On Sat, Sep 12, 2009 at 1:34 PM, Skylar Thompson wrote: > >> What OS is this on? >> > > Thanks Skylar! Linux. RedHat. > >> If you have dtrace available you can use that to at >> > > I don't but let me try to install it. > I believe there's an alpha/beta-level release of both a dtrace kernel module and the user-space tools for Linux. There's also SystemTap which should give you similar data. I've only used dtrace on Solaris and haven't used SystemTap at all, so YMMV. >> Depending on what filesystem you have >> > > ext3 > > I need to see if this can be queried. > I don't think so, but you might be able to accomplish the same thing at the application level. If you have a limited set of applications, you could have them write the metadata you need into a database as they create and update files. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From stuartb at 4gh.net Fri Sep 11 12:39:48 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Fri, 11 Sep 2009 15:39:48 -0400 (EDT) Subject: [Beowulf] Intra-cluster security Message-ID: We are working with a couple small clusters (6-8 nodes) and will soon be working with some much larger cluster/supercomputer systems. We are currently using SGE 6.2 for job queuing. We use kerberos for authentication and ssh for system access. What are peoples thoughts about secure communications between the nodes of a cluster? I see a cluster as a single computational resource and would like to see flexibility of communications between the nodes of the cluster. There seem to be a couple of approaches: - Old style rsh/rlogin. Not acceptable for me. - Kerberos with ssh works fine for interactive users, but doesn't seem to translate well to a queuing environment. Or am I missing something? - Each user creates a password-less ssh private key, puts the public key in the authorized_hosts file and has relatively unfettered ssh access between nodes (nfs shared home directory helps a lot). This seems to be the most common approach. It is end-user setup/training intensive (I suppose it could be automated/audited). I consider it dangerous to encourage use of password-less ssh keys. - It looks like SGE has some new functionality for using certificates and its own certificate authority. I haven't looked closely at this yet. It looks like each user has a password-less private certificate and the authorization comes from not having the certificate revoked. This seems almost equivalent to the password-less ssh key solution. - It looks like I can configure the cluster systems to handle local ssh transparently. This would involve setting setuid/setgid on ssh, building cluster wide authorized_keys files and other things. I haven't studied this closely but there are a few references available (http://www.snailbook.com/faq/trusted-host-howto.auto.html among others). I favor this last solution as being the most user transparent. I find is surprising that none of the cluster distributions seem to use this method. I would like some feedback as to how well this works in practice and whether there are any obvious or non-obvious gotchas people might have already encountered. Thanks, Stuart Barkley -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From hearnsj at googlemail.com Sun Sep 13 01:07:56 2009 From: hearnsj at googlemail.com (John Hearns) Date: Sun, 13 Sep 2009 09:07:56 +0100 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: <9f8092cc0909130107sf75f2c5n90cc6454d0e47078@mail.gmail.com> 2009/9/11 Stuart Barkley : > > - Each user creates a password-less ssh private key, puts the public > key in the authorized_hosts file and has relatively unfettered ssh > access between nodes (nfs shared home directory helps a lot). ?This > seems to be the most common approach. ?It is end-user setup/training > intensive (I suppose it could be automated/audited). I consider it > dangerous to encourage use of password-less ssh keys. Yes, I would agree this is the most common approach. You can automate it by having a script which runs when you first login to the cluster (Oscar does this). You can also use shosts trusts. A script which loops through cluster nodes and runs an ssh-keyscan is useful. Re. security its the armadillo principle - hard on the outside, soft on the inside. From glykos at mbg.duth.gr Sun Sep 13 02:46:47 2009 From: glykos at mbg.duth.gr (Nicholas M Glykos) Date: Sun, 13 Sep 2009 12:46:47 +0300 (EEST) Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: Hi Stuart, > - Each user creates a password-less ssh private key, puts the public > key in the authorized_hosts file and has relatively unfettered ssh > access between nodes (nfs shared home directory helps a lot). This > seems to be the most common approach. It is end-user setup/training > intensive (I suppose it could be automated/audited). A quick note to say that in the case of the perceus/warewulf/slurm combination as distributed with CaosNSA, you not only get the automation you've mentioned, but you can also restrict user access to individual nodes (this is through a pam module for slurm that only allows ssh access to those nodes that a user has active jobs on). Nicholas -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/ From nixon at nsc.liu.se Sun Sep 13 03:31:41 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Sun, 13 Sep 2009 12:31:41 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: (Stuart Barkley's message of "Fri\, 11 Sep 2009 15\:39\:48 -0400 \(EDT\)") References: Message-ID: Stuart Barkley writes: > - Kerberos with ssh works fine for interactive users, but doesn't seem > to translate well to a queuing environment. Or am I missing > something? It's quite possible to use, but you do get a ticket expiry problem. > - Each user creates a password-less ssh private key, puts the public > key in the authorized_hosts file and has relatively unfettered ssh > access between nodes (nfs shared home directory helps a lot). This > seems to be the most common approach. Yes, this is common. And a really, really BAD IDEA. Do not do this. Bad, bad, BAD. > I consider it dangerous to encourage use of password-less ssh keys. Yes, very much so. And your users will discover that they can copy that passphrase-less private key to their personal workstation and get password-less access to the cluster. (Yes, they will.) And then the key will get stolen. (Yes, it will.) And then you get http://www.us-cert.gov/current/archive/2008/09/08/archive.html#ssh_key_based_attacks Of course, you can disallow ssh key authentication from external machines to mitigate the problem, but that's just a band-aid for a mis-engineered system. In case I didn't come across clearly, let me repeat that: DO NOT USE PASSPHRASE-LESS PRIVATE KEYS! (There are some exceptions, of course, like when you want to run things in batch from cron, and similar. But then you must, must, must use proper limitations for that key in authorized_keys.) > - It looks like I can configure the cluster systems to handle local > ssh transparently. This would involve setting setuid/setgid on ssh, > building cluster wide authorized_keys files and other things. I > haven't studied this closely but there are a few references available > (http://www.snailbook.com/faq/trusted-host-howto.auto.html among > others). This is the way to go. All our systems are set up this way. Works just fine. You just need a mechanism for maintaining host keys and ssh_known_hosts. (And remember that this doesn't work for root - you need separately set up ~root/.shosts and ~root/.ssh/known_hosts if you want it.) Oh, and DO NOT USE PASSPHRASE-LESS PRIVATE KEYS! Do the Internet a service and scan your users' home directories for passphrase-less private ssh keys. This is as easy as running # grep -L ENCRYPTED /home/*/.ssh/id_?sa Delete all such keys that don't have a good reason for existence. (Yes, we do so on all our systems.) -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From ashley at pittman.co.uk Sun Sep 13 04:00:33 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Sun, 13 Sep 2009 12:00:33 +0100 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: Message-ID: <1252839633.3887.0.camel@alpha> On Sat, 2009-09-12 at 10:10 -0500, Rahul Nabar wrote: > *A distribution of file age and prevelance (to know how much of this > material is archivable). Same for frequency of access; i.e. maybe the last > access stamp. I thought access stamps were a thing of the past and everyone ran with "noatime" these days? Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From reuti at staff.uni-marburg.de Sun Sep 13 04:18:02 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Sun, 13 Sep 2009 13:18:02 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: Hi, Am 11.09.2009 um 21:39 schrieb Stuart Barkley: > We are working with a couple small clusters (6-8 nodes) and will soon > be working with some much larger cluster/supercomputer systems. We > are currently using SGE 6.2 for job queuing. We use kerberos for > authentication and ssh for system access. > > What are peoples thoughts about secure communications between the > nodes of a cluster? I see a cluster as a single computational > resource and would like to see flexibility of communications between > the nodes of the cluster. > > There seem to be a couple of approaches: > > - Old style rsh/rlogin. Not acceptable for me. I wouldn't be concerned by this per se, but I also disabled it (in / etc/xinetd.d/rsh) because of another reason: - I setup ssh hostbased authentication, but limit it to admin staff with "AllowGroups admin". The ssh-keysign has to be set suid as you mention below, I can also send you a rough outline of the necessary steps. - Users can login to any node by using an interactive job in SGE. For this special interactive queue I set h_cpu=60, hence they can't abuse it. SGE in turn will either use: a) traditional rsh, but on a random port selected by SGE (this I have right now) b) a "builtin" method in the newer versions of SGE c) plain ssh (but this will lack correct accounting), directed to a special sshd_config (because of the AllowGroups rule in the default one) d) ssh with a recompiled SGE with -tight-ssh flag for correct accounting If you have more than one queue in addition to this interactive queue on each system, you can't limit the maximum number of slots any longer in the exechost definition (as than the interactive queue to peek around would also be taken into account), but set it up in an RQS (resource quota set) like: limit queues !login.q hosts {@dualquad} to slots=8 -- Reuti PS: I saw on Debian, that their "su" will not only set the user, but also remove any imposed cpu time limit by making an su to oneself. For the SGE queue to impose the h_cpu limit this must be disabled then. I don't know, whether this is still the Debian default. From landman at scalableinformatics.com Sun Sep 13 07:06:40 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 13 Sep 2009 10:06:40 -0400 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: <4AACFC70.6040205@scalableinformatics.com> I started writing a long response to this, decrying security theatre in the face of real issues, but thought better of it. Much shorter version with free advice. Leif Nixon wrote: > Stuart Barkley writes: > >> - Kerberos with ssh works fine for interactive users, but doesn't seem >> to translate well to a queuing environment. Or am I missing >> something? > > It's quite possible to use, but you do get a ticket expiry problem. > >> - Each user creates a password-less ssh private key, puts the public >> key in the authorized_hosts file and has relatively unfettered ssh >> access between nodes (nfs shared home directory helps a lot). This >> seems to be the most common approach. > > Yes, this is common. And a really, really BAD IDEA. Do not do this. Bad, > bad, BAD. > >> I consider it dangerous to encourage use of password-less ssh keys. > > Yes, very much so. And your users will discover that they can copy that > passphrase-less private key to their personal workstation and get > password-less access to the cluster. (Yes, they will.) And then the key > will get stolen. (Yes, it will.) And then you get > > http://www.us-cert.gov/current/archive/2008/09/08/archive.html#ssh_key_based_attacks I won't fisk this, other than to note most of the exploits we have cleaned up for our customers, have been windows based attack vectors. Contrary to the implication here, the ssh-key attack vector, while a risk, isn't nearly as dangerous as others, in active use, out there. http://www.darknet.org.uk/2008/08/puttyhijack-v10-hijack-sshputty-connections-on-windows/ Real security is security in depth. Its understanding real risks, and mitigating the same, or making the downside of the compromise as small as possible. Leif had a suggestion further down about careful management of keys, that is eminently reasonable. You don't leave your house keys under the door mat, if you care about security that is. Same principle applies here. Fake security, aka security theatre (c.f. http://en.wikipedia.org/wiki/Security_theater ) are things you get when people want to seem like they are doing something, even if the thing doesn't help, or worse, gives you a false sense of security. See every anti-virus/anti-phishing package out there for windows. If you think you are safe because you are running them, you are sadly mistaken. I'd argue that security theatre is more dangerous than the real threats. Threats can be mitigated. The danger is in using theatrics and pronouncements rather than practical measures. As John Hearns pointed out, hard on the outside soft on the inside. Doesn't help with clouds, though you can do IPsec to IPsec bridging of virtual private clusters (we do this for our customers). Assume multiple attack vectors, and that the bad guys and gals are going for your weak links. You need a realistic assessment of what your weak links are, they will be exploited. Most IT managers are fearful of this conversation, many are patently in denial about it. Regardless, the successful attacks we have seen and cleaned up after all came from *inside* organizations. Where they have been thwarted, has been due to other good practices. Where they have been successful, they have had success due to very very bad practices. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From reuti at Staff.Uni-Marburg.DE Sun Sep 13 08:03:37 2009 From: reuti at Staff.Uni-Marburg.DE (Reuti) Date: Sun, 13 Sep 2009 17:03:37 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: <2194C7C3-B708-49B9-951D-644FE86839FF@staff.uni-marburg.de> Am 13.09.2009 um 12:31 schrieb Leif Nixon: > > This is the way to go. All our systems are set up this way. Works just > fine. You just need a mechanism for maintaining host keys and > ssh_known_hosts. (And remember that this doesn't work for root - you > need separately set up ~root/.shosts and ~root/.ssh/known_hosts if you > want it.) > > Oh, and DO NOT USE PASSPHRASE-LESS PRIVATE KEYS! > > Do the Internet a service and scan your users' home directories for > passphrase-less private ssh keys. This is as easy as running > > # grep -L ENCRYPTED /home/*/.ssh/id_?sa > > Delete all such keys that don't have a good reason for existence. > (Yes, > we do so on all our systems.) I agree. And to have it still convenient between multiple clusters I guide my students to use just one passphrase protected key and an ssh- agent in additions. There is nice Howto about it: http://unixwiz.net/techtips/ssh-agent-forwarding.html But: even with a passphrase the ssh-key should be protected as much as possible. Once someone has the private key, any offline brute- force to get the passphrase won't take long I fear. They could just try to recreate the public part of the key with: ssh-keygen -y which is completely offline, as this will also need the passphrase to be entered. -- Reuti From skylar at cs.earlham.edu Sun Sep 13 09:48:23 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sun, 13 Sep 2009 09:48:23 -0700 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: <1252839633.3887.0.camel@alpha> References: <1252839633.3887.0.camel@alpha> Message-ID: <4AAD2257.407@cs.earlham.edu> Ashley Pittman wrote: > On Sat, 2009-09-12 at 10:10 -0500, Rahul Nabar wrote: > >> *A distribution of file age and prevelance (to know how much of this >> material is archivable). Same for frequency of access; i.e. maybe the last >> access stamp. >> > > I thought access stamps were a thing of the past and everyone ran with > "noatime" these days? > > Ashley. > > Are there any studies showing the overhead of atime updates? I've heard anecdotal evidence saying that it makes a big difference, and the same going the other way. FWIW, you can disable atime updates on a per-file basis in Linux and FreeBSD using chattr. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From nixon at nsc.liu.se Sun Sep 13 10:58:23 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Sun, 13 Sep 2009 19:58:23 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: <4AACFC70.6040205@scalableinformatics.com> (Joe Landman's message of "Sun\, 13 Sep 2009 10\:06\:40 -0400") References: <4AACFC70.6040205@scalableinformatics.com> Message-ID: Joe Landman writes: > I won't fisk this, other than to note most of the exploits we have > cleaned up for our customers, have been windows based attack vectors. > Contrary to the implication here, the ssh-key attack vector, while a > risk, isn't nearly as dangerous as others, in active use, out there. I'm really hoping you aren't accusing me of security theatre. This may be a case of differences between user communitites - while I have seen one or maybe two cases where windows-related attacks were involved, I have seen dozens and dozens of cases where ssh key theft was involved. I have a blacklist of literally hundreds of stolen ssh keys from a very large number of sites, and I dearly miss a key revocation mechanism in ssh. We try to educate our users to use either a good strong password or to use ssh keys together with the ssh agent and agent forwarding, so that the private key never needs to leave the user's personal workstation. > Fake security, aka security theatre (c.f. > http://en.wikipedia.org/wiki/Security_theater ) are things you get > when people want to seem like they are doing something, even if the > thing doesn't help, or worse, gives you a false sense of security. See > every anti-virus/anti-phishing package out there for windows. If you > think you are safe because you are running them, you are sadly > mistaken. And on our side of the fence, we get things like Trusted IRIX, with a really elaborate, checkbox-compliant permissions system. Of course, since it was built on IRIX, any serious attacker would cut through it like a hot knife through molten butter, but there obviously wasn't a checkbox for that. -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From bill at cse.ucdavis.edu Sun Sep 13 11:34:13 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Sun, 13 Sep 2009 11:34:13 -0700 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: <4AAD3B25.8090900@cse.ucdavis.edu> Stuart Barkley wrote: > - Each user Very dangerous way to say it. Ideally you do everything possible to minimize the work of the user, that way they can't get it wrong. > creates a password-less ssh private key, puts the public I'm a fan of password-less private keys. Before the screaming begins, let me explain the wrong way to do it. Rocks creates a password-less key for each user, plops it in ~/.ssh. Unfortunately they seem very resistant to suggestions on fixing this. The main problem is that if someone leaves themselves logged in someone could slurp the private key and have access forever, even if the user tries to be security conscious and uses a large passphrase that they keep secure. You could however point the compute nodes to a different keystore which the head node does not look at. That way even if stolen it doesn't get you cluster acccess. > key in the authorized_hosts file and has relatively unfettered ssh > access between nodes (nfs shared home directory helps a lot). This I don't recommend allowing users to populate .ssh, instead I suggest managing it yourself (the admin). Users tend to only add keys when they upgrade a laptop, buy a new one, lose a laptop, get compromised, etc. So there could be keys that end up lost, shared, or compromised. By forcing users to have one you reduce your exposure. Last thing you want to hear is oh, that's access is not from my current key... > seems to be the most common approach. It is end-user setup/training > intensive (I suppose it could be automated/audited). I consider it > dangerous to encourage use of password-less ssh keys. An alternative is to use host-based ssh auth for access inside the cluster, this depends on either more labor intensive management of keys, or automating the install/reinstall node process. > - It looks like I can configure the cluster systems to handle local > ssh transparently. This would involve setting setuid/setgid on ssh, > building cluster wide authorized_keys files and other things. I > haven't studied this closely but there are a few references available > (http://www.snailbook.com/faq/trusted-host-howto.auto.html among > others). Looks pretty straight forward to me. > I favor this last solution as being the most user transparent. I find > is surprising that none of the cluster distributions seem to use this > method. I would like some feedback as to how well this works in > practice and whether there are any obvious or non-obvious gotchas > people might have already encountered. Works for us, I share your surprise that it's not more popular. From landman at scalableinformatics.com Sun Sep 13 12:13:19 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 13 Sep 2009 15:13:19 -0400 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: <4AACFC70.6040205@scalableinformatics.com> Message-ID: <4AAD444F.3030605@scalableinformatics.com> Leif Nixon wrote: > Joe Landman writes: > >> I won't fisk this, other than to note most of the exploits we have >> cleaned up for our customers, have been windows based attack vectors. >> Contrary to the implication here, the ssh-key attack vector, while a >> risk, isn't nearly as dangerous as others, in active use, out there. > > I'm really hoping you aren't accusing me of security theatre. Nope. I thought I made it clear that I wasn't (and if not, then let me re-iterate that I am not accusing you of this). I am noting that the there may be something of an overhyping of this vulnerability from where we sit. YMMV. > This may be a case of differences between user communitites - while I > have seen one or maybe two cases where windows-related attacks were Likely it is a difference. Most attacks we see are windows related, exploiting the inherent weakness of that platform, and is relative ease of compromise in order to compromise harder to take down systems. Why break through the heavily fortified door when the window (pun un-intended) is so easy to crack? This is the nature (outside of incessant ssh probes) of all of the exploits we have seen be successful at our customers sites. > involved, I have seen dozens and dozens of cases where ssh key theft was > involved. I have a blacklist of literally hundreds of stolen ssh keys > from a very large number of sites, and I dearly miss a key revocation > mechanism in ssh. > > We try to educate our users to use either a good strong password or to > use ssh keys together with the ssh agent and agent forwarding, so that > the private key never needs to leave the user's personal workstation. We have started hearing about malware infected USB dongles. If you have a password equivalent stored on your workstation ... it is at risk. > >> Fake security, aka security theatre (c.f. >> http://en.wikipedia.org/wiki/Security_theater ) are things you get >> when people want to seem like they are doing something, even if the >> thing doesn't help, or worse, gives you a false sense of security. See >> every anti-virus/anti-phishing package out there for windows. If you >> think you are safe because you are running them, you are sadly >> mistaken. > > And on our side of the fence, we get things like Trusted IRIX, with a > really elaborate, checkbox-compliant permissions system. Of course, > since it was built on IRIX, any serious attacker would cut through it > like a hot knife through molten butter, but there obviously wasn't a > checkbox for that. Trusted computing, trusted Irix, etc. are examples of what I am talking about. You have a sense of security. Whether its warranted or not is a completely separate question. Most of our users are companies, research universities, etc. We hear horror stories from admins on compromises. We do get an occasional call from a customer, wondering how a system behind a firewall could be compromised (remember that theatre and false sense of security?). Forensic examination showed us the path in, happily riding along the same connection that the user had, grabbing their keystrokes, and replaying them. Installing bits, and attempting rootkits. I have a nice little collection of rootkit detritus and dejecta, as well as logs of what the cracker attempted, all while getting in via the same compromised machine the legitimate user logged in to. It didn't really get bad ... until the user typed the root password in. No, wasn't bad until then, most of the defenses held. Their cluster, they have root. We tried warning them that there was no conceivable scenario in which they ever needed to be root. We were ignored. Their IT staff was none too pleased. I wrote up a whole series of posts on it, detailing everything (apart from the victims name/id/location/university) so that some others could learn and protect themselves. My descriptions managed to get me ... moderated ... by someone who claimed I was being alarmist ... for posting the gory details and making suggestions to the same community on how to avoid it. I am simply saying that what we see may be different, and that I hear far too much "one-size-fits-all" security prescriptions, that often fail to deter attacks, and provide what I think is a false sense of security if you follow that and ignore the other issues. I see to much of "if we install a firewall, we will be secure" mindset running about. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hearnsj at googlemail.com Sun Sep 13 23:19:58 2009 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 14 Sep 2009 07:19:58 +0100 Subject: [Beowulf] Switch recommendations and 10G to the desktop? Message-ID: <9f8092cc0909132319v180f72bdndf95dd3bcc7b4403@mail.gmail.com> I'm looking for recommendations for 1 48 port, or two stacked 24 port, switches for desktop users. The aim is to bond 2xgigabit connections. I would have normally first thought Nortel for this job. Thoughts? Secondly, do folks here have much experience of 10gig to the desktop? Distance is a bit too far for copper CX4, and fibre is pricey. I note Extreme networks have a switch with 10G over cat6, I guess there are others. Thoughts again please! From nixon at nsc.liu.se Mon Sep 14 00:45:42 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Mon, 14 Sep 2009 09:45:42 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: <4AAD444F.3030605@scalableinformatics.com> (Joe Landman's message of "Sun\, 13 Sep 2009 15\:13\:19 -0400") References: <4AACFC70.6040205@scalableinformatics.com> <4AAD444F.3030605@scalableinformatics.com> Message-ID: Joe Landman writes: > Leif Nixon wrote: >> Joe Landman writes: >> >>> I won't fisk this, other than to note most of the exploits we have >>> cleaned up for our customers, have been windows based attack vectors. >>> Contrary to the implication here, the ssh-key attack vector, while a >>> risk, isn't nearly as dangerous as others, in active use, out there. >> >> I'm really hoping you aren't accusing me of security theatre. > > Nope. I thought I made it clear that I wasn't (and if not, then let > me re-iterate that I am not accusing you of this). Good. 8^) > I am noting that the there may be something of an overhyping of this > vulnerability from where we sit. YMMV. Well, it *is* being actively exploited on a big scale. It's not just a theoretical thing. > Likely it is a difference. Most attacks we see are windows related, > exploiting the inherent weakness of that platform, and is relative > ease of compromise in order to compromise harder to take down systems. > Why break through the heavily fortified door when the window (pun > un-intended) is so easy to crack? This is the nature (outside of > incessant ssh probes) of all of the exploits we have seen be > successful at our customers sites. That's interesting. I haven't seen many cross-OS attacks. My theory has always been that the mainstream windows evil-doer has lots and lots of easy targets, and there is no point for him to spend the energy to learn how to attack these weird Linux clusters. I can't say I'd love to be proven wrong. 8^) 8^/ > I wrote up a whole series of posts on it, detailing everything (apart > from the victims name/id/location/university) so that some others > could learn and protect themselves. My descriptions managed to get me > ... moderated ... by someone who claimed I was being alarmist ... for > posting the gory details and making suggestions to the same community > on how to avoid it. Too bad. The community needs more war stories. There is too much covering up. > I am simply saying that what we see may be different, and that I hear > far too much "one-size-fits-all" security prescriptions, that often > fail to deter attacks, and provide what I think is a false sense of > security if you follow that and ignore the other issues. I see to > much of "if we install a firewall, we will be secure" mindset running > about. Exactly. Or, on the other hand, "firewalls are an inherently bad solution; all endpoints should be properly secured and should not have to rely on a firewall.". Rigid dogma is always bad. (Except, of course, when it comes to DELETING ALL THOSE PASSPHRASE-LESS KEYS!) -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From smulcahy at atlanticlinux.ie Mon Sep 14 01:31:52 2009 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Mon, 14 Sep 2009 09:31:52 +0100 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: <4AACFC70.6040205@scalableinformatics.com> <4AAD444F.3030605@scalableinformatics.com> Message-ID: <4AADFF78.5000006@atlanticlinux.ie> Leif Nixon wrote: >> I wrote up a whole series of posts on it, detailing everything (apart >> from the victims name/id/location/university) so that some others >> could learn and protect themselves. My descriptions managed to get me >> ... moderated ... by someone who claimed I was being alarmist ... for >> posting the gory details and making suggestions to the same community >> on how to avoid it. > > Too bad. The community needs more war stories. There is too much > covering up. I strongly agree with this. Real war stories would give real admins a better chance of prioritising their efforts to address real problems (rather than whatever is currently being pushed by one vendor or another). In practice, I think a lot of security problems come from layer 8 - if you can deal with that, the rest is easy :) -stephen -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From hearnsj at googlemail.com Mon Sep 14 11:55:35 2009 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 14 Sep 2009 19:55:35 +0100 Subject: [Beowulf] Switch recommendations and 10G to the desktop? In-Reply-To: <4AAE4011.1080408@tamu.edu> References: <9f8092cc0909132319v180f72bdndf95dd3bcc7b4403@mail.gmail.com> <4AAE4011.1080408@tamu.edu> Message-ID: <9f8092cc0909141155q61a81875q1c3d18fe6da044f4@mail.gmail.com> 2009/9/14 Gerry Creager : > Nortel still has switches available, and they'd be my first choice, as well, > but I'm concerned about support... and delivery! > > Force10 S-series bears looking at, as does the Nexus line from (shudder) > Cisco. ?I consider the Cisco offerings to have barely enough backplane to be > non-blocking. Gerry, thanks for that. Talkign with our supplier today looks like I'll go for Force 10 - but keeping the heid screwed I'll stay with 1Gbps. From hearnsj at googlemail.com Mon Sep 14 11:58:38 2009 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 14 Sep 2009 19:58:38 +0100 Subject: [Beowulf] PCOIP graphics Message-ID: <9f8092cc0909141158n4036b960h25b622f429e808ed@mail.gmail.com> HPCwire today had an article on BOXX workstations which use something called PCOIP to transfer the graphics display over a slowish network connection. I'm quite interested in this, and it looks like you can buy a card plus a separate box to do this job from http://www.teradici.com/ Anyone out there used one of these boxes 'in the wild'? I have looked at the software equivalents - ie VirtualGL but not had much success. From coutinho at dcc.ufmg.br Mon Sep 14 15:12:30 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Mon, 14 Sep 2009 19:12:30 -0300 Subject: [Beowulf] PCOIP graphics In-Reply-To: <9f8092cc0909141158n4036b960h25b622f429e808ed@mail.gmail.com> References: <9f8092cc0909141158n4036b960h25b622f429e808ed@mail.gmail.com> Message-ID: 2009/9/14 John Hearns > HPCwire today had an article on BOXX workstations which use something > called PCOIP to transfer > the graphics display over a slowish network connection. > Nomachine NX does this too. You can download a client and test on their servers. http://www.nomachine.com/ > I'm quite interested in this, and it looks like you can buy a card > plus a separate box to do this job from http://www.teradici.com/ > > Anyone out there used one of these boxes 'in the wild'? > > I have looked at the software equivalents - ie VirtualGL but not had > much success. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Mon Sep 14 15:40:25 2009 From: mathog at caltech.edu (David Mathog) Date: Mon, 14 Sep 2009 15:40:25 -0700 Subject: [Beowulf] Re: Intra-cluster security Message-ID: Our NFS file server does double duty, serving both the compute nodes on the inside and some workstations on the outside. That is somewhat analogous to the intra-cluster situation. It turns out that updates some time in the last year introduced an issue where the lock manager stopped respecting the ports which were supposedly assigned to it. It took me a long time to notice this since there wasn't very much file locking going on between the workstations and the file server. However, anybody who used gnome (which I don't) would have seen it, since gnome does some file locking at startup, and this bug was was causing it to start very slowly. Normally this port assignment issue wouldn't be a cluster issue, since one doesn't normally run a firewall between the file server and the compute nodes. However, this would be a problem for intra-cluster file sharing. To see if your file server has been bitten run rpcinfo -p on it, and if nlockmgr isn't in the right place, welcome to the club. For more information, see for instance: https://bugzilla.redhat.com/show_bug.cgi?id=434795 https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/28706 Long story short, the only way around this that I know of at present is to put: options lockd nlm_udpport=4001 nlm_tcpport=4001 (or whatever port you want) in /etc/modprobe.conf or an equivalent location and restart the NFS server. OK, make that: umount all mounts on the clients, restart the server, and remount on all the clients. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rpnabar at gmail.com Mon Sep 14 15:43:50 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 14 Sep 2009 17:43:50 -0500 Subject: [Beowulf] switching capacity terminology confusion Message-ID: I was totally confused on the spec sheet (one example link below) for switches. What is the difference between "Switching Fabric Capacity 288Gbps" and "User traffic capacity 176 Gbps"? The user traffic numbers seem to be a lot lower than the switching fabric numbers. Is this some sort of overhead? Or..... If this weren't enough there is another mysterious number: "Stacking Capacity 96 Gbps" which is even lower than the others. How does one interpret all of these. I started out with my simplistic view of "a gigabit eth switch that supports full line capacity on all ports" http://www.force10networks.com/products/pdf/Force10-S25N-S50N-FTOS.pdf -- Rahul From rpnabar at gmail.com Mon Sep 14 15:52:07 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 14 Sep 2009 17:52:07 -0500 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: Message-ID: On Mon, Sep 14, 2009 at 5:43 PM, Rahul Nabar wrote: > I was totally confused on the spec sheet (one example link below) for > switches. What is the difference between "Switching Fabric Capacity > 288Gbps" and "User traffic capacity 176 Gbps"? The user traffic > numbers seem to be a lot lower than the switching fabric numbers. Is > this some sort of overhead? Or..... As if to make life even more difficult a comparable Dell switch here has an additional characteristic too: "Forwarding Rate 131 Mpps" How does that tie in to the big picture? http://www.dell.com/downloads/global/products/pwcnt/en/PC_6200Series_proof1.pdf -- Rahul From hearnsj at googlemail.com Mon Sep 14 23:36:44 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 15 Sep 2009 07:36:44 +0100 Subject: [Beowulf] PCOIP graphics In-Reply-To: References: <9f8092cc0909141158n4036b960h25b622f429e808ed@mail.gmail.com> Message-ID: <9f8092cc0909142336k77710f16n2852f8dab0644a4d@mail.gmail.com> 2009/9/14 Bruno Coutinho : > > 2009/9/14 John Hearns >> >> HPCwire today had an article on BOXX workstations which use something >> called PCOIP to transfer >> the graphics display over a slowish network connection. > > Nomachine NX does this too. > You can download a client and test on their servers. > http://www.nomachine.com/ > Bruno, I am a big fan of Nomachine, and I use it on my personal workstation. However, in my experience Nomachine does not handle OpenGL accelerated graphics - there is a recipe on their site for doing that, but it did not work for me. Also, and I hesitate to do a company down who I think are generally marvellous, they were not interested in doing an IA64 version for our Prism visualization system. If anyone is using Nomachine NX successfully to run a remote display of a decently specced visualization box running Linux and OpenGL then I'd really like to hear about it. From cap at nsc.liu.se Tue Sep 15 02:58:21 2009 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Tue, 15 Sep 2009 11:58:21 +0200 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: Message-ID: <200909151158.21995.cap@nsc.liu.se> On Tuesday 15 September 2009, Rahul Nabar wrote: ... > As if to make life even more difficult a comparable Dell switch here > has an additional characteristic too: > > "Forwarding Rate 131 Mpps" How does that tie in to the big picture? This is packet rate (packets per second), not bandwidth (bytes per second). /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From prentice at ias.edu Tue Sep 15 11:26:56 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 15 Sep 2009 14:26:56 -0400 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: Message-ID: <4AAFDC70.4010805@ias.edu> I have used perl in the past to gather summaries of file usage like this. The details are fuzzy(it was a couple of years ago), but I think I did a 'find -ls' to a text file and then used perl to parse the file and and add up the various statistics. I wasn't gathering as many statistics as you, but it was pretty easy to write for a novice perl programmer like me. Prentice Rahul Nabar wrote: > As the number of total files on our server was exploding (~2.5 million > / 1 Terabyte) I > wrote a simple shell script that used find to tell me which users have how > many. So far so good. > > But I want to drill down more: > > *Are there lots of duplicate files? I suspect so. Stuff like job submission > scripts which users copy rather than link etc. (fdupes seems puny for > a job of this scale) > > *What is the most common file (or filename) > > *A distribution of filetypes (executibles; netcdf; movies; text) and > prevalence. > > *A distribution of file age and prevelance (to know how much of this > material is archivable). Same for frequency of access; i.e. maybe the last > access stamp. > > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of > files? etc. > > I've used cushion plots in the past (sequiaview; pydirstat) but those > seem more desktop oriented than suitable for a job like this. > > Essentially I want to data mine my file usage to strategize. Are there any > tools for this? Writing a new find each time seems laborious. > > I suspect forensics might also help identify anomalies in usage across > users which might be indicative of other maladies. e.g. a user who had a > runaway job write a 500GB file etc. > > Essentially are there any "filesystem metadata mining tools"? > From davidramirezmolina at gmail.com Mon Sep 14 11:04:51 2009 From: davidramirezmolina at gmail.com (David Ramirez) Date: Mon, 14 Sep 2009 13:04:51 -0500 Subject: [Beowulf] Virtualization in head node ? Message-ID: Still a newbie in HPC, in the first stages of building a Beowulf cluster (8 nodes). I wonder if anybody out there has used Linux virtual machines in the head node, just to be able to experiment with different configurations & deployments and jump back without much effort if things go bad. Considering (out of experience in desktop) VMWare or Sun Virtualbox. Any hints, comments ? David Ramirez Grad Student CIS Prairie View A&M University, Texas -- | David Ramirez Molina | davidramirezmolina at gmail.com | Houston, Texas - USA Ancora Imparo (A?n aprendo) - Michelangelo a los 80 a?os -------------- next part -------------- An HTML attachment was scrubbed... URL: From reuti at Staff.Uni-Marburg.DE Tue Sep 15 16:05:44 2009 From: reuti at Staff.Uni-Marburg.DE (Reuti) Date: Wed, 16 Sep 2009 01:05:44 +0200 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: Message-ID: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> Am 14.09.2009 um 20:04 schrieb David Ramirez: > Still a newbie in HPC, in the first stages of building a Beowulf > cluster (8 nodes). > > I wonder if anybody out there has used Linux virtual machines in > the head node, just to be able to experiment with different > configurations & deployments and jump back without much effort if > things go bad. Considering (out of experience in desktop) VMWare or > Sun Virtualbox. Any hints, comments ? I was also just thinking of it in the last couple of days. In the past often a dedicated login- and dedicated file-server was used to split the load. Putting the login-node in a virtual machine will still keep the users away from the file-server, but allows you to combine them in one piece of metal when it's powerful enough. Or maybe even one virtual login node per user, when you don't allow anyone seeing the actions of others. To operate Sun VirtualBox w/o the graphical interface is possible, and you can also direct the virtual console to any remote machine using "rdesktop" as client on any platform you like. -- Reuti > David Ramirez > Grad Student CIS > Prairie View A&M University, Texas > > -- > | David Ramirez Molina > | davidramirezmolina at gmail.com > | Houston, Texas - USA > > Ancora Imparo (A?n aprendo) - Michelangelo a los 80 a?os > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Tue Sep 15 16:21:11 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 15 Sep 2009 16:21:11 -0700 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: Message-ID: <20090915232111.GB16891@bx9.net> On Mon, Sep 14, 2009 at 05:52:07PM -0500, Rahul Nabar wrote: > "Forwarding Rate 131 Mpps" How does that tie in to the big picture? Most layer 3 devices are not capable of forwarding full line-rate traffic of tiny packets. You should go hunt down a lab report on switch testing to see these kinds of details discussed. -- g From hearnsj at googlemail.com Tue Sep 15 22:44:54 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed, 16 Sep 2009 06:44:54 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> References: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> Message-ID: <9f8092cc0909152244t60bd1f5t883c8638dd6b67d9@mail.gmail.com> 2009/9/16 Reuti : > To operate Sun VirtualBox w/o the graphical interface is possible, and you > can also direct the virtual console to any remote machine using "rdesktop" > as client on any platform you like. I agree re. Virtualbox - I'm evaluating it for desktop use, not for the purpose suggested here, and its quite impressive. The latest version offers direct access to accelerated 3D graphics on the host. Also I put a virtual machine to sleep yesterday when running a CAD style application, and started it up later, the application picked itself up mid-stride and continued on. Well, maybe I'm easily impressed. From award at uda.ad Wed Sep 16 00:23:07 2009 From: award at uda.ad (Alan Ward) Date: Wed, 16 Sep 2009 09:23:07 +0200 Subject: RS: [Beowulf] Virtualization in head node ? References: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> <9f8092cc0909152244t60bd1f5t883c8638dd6b67d9@mail.gmail.com> Message-ID: I have been working quite a lot with VBox, mostly for server stuff. I agree it can be quite impressive, and has some nice features (e.g. do not stop a machine, sleep it - and wake up pretty fast). On the other hand, we found that anything that has to do with disk access is pretty slow, specially when working with a local disk image file. Cheers, -Alan -----Missatge original----- De: beowulf-bounces at beowulf.org en nom de John Hearns Enviat el: dc. 16/09/2009 07:44 Per a: Beowulf Mailing List Tema: Re: [Beowulf] Virtualization in head node ? 2009/9/16 Reuti : > To operate Sun VirtualBox w/o the graphical interface is possible, and you > can also direct the virtual console to any remote machine using "rdesktop" > as client on any platform you like. I agree re. Virtualbox - I'm evaluating it for desktop use, not for the purpose suggested here, and its quite impressive. The latest version offers direct access to accelerated 3D graphics on the host. Also I put a virtual machine to sleep yesterday when running a CAD style application, and started it up later, the application picked itself up mid-stride and continued on. Well, maybe I'm easily impressed. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjrc at sanger.ac.uk Wed Sep 16 02:34:48 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed, 16 Sep 2009 10:34:48 +0100 Subject: RS: [Beowulf] Virtualization in head node ? In-Reply-To: References: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> <9f8092cc0909152244t60bd1f5t883c8638dd6b67d9@mail.gmail.com> Message-ID: <8B848D15-066C-436B-BB3E-84A8FC27C9B1@sanger.ac.uk> On 16 Sep 2009, at 8:23 am, Alan Ward wrote: > > I have been working quite a lot with VBox, mostly for server stuff. > I agree it can be quite impressive, and has some nice features (e.g. > do not stop a machine, sleep it - and wake up pretty fast). > > On the other hand, we found that anything that has to do with disk > access is pretty slow, specially when working with a local disk > image file. I think that's pretty standard for most virtualisation, whichever vendor it comes from. The I/O is fairly sub-optimal. I've had a fair bit of experience now of various VMware flavours. The I/O performance of the desktop versions is fairly shocking; this is presumably largely down to the fact that desktops and laptops tend to have fairly slow I/ O to start with, and the virtualisation penalty is very noticeable. Our production virtualisation system uses dual-fabric SAN-attached storage (EVA5000), ESX 4.0 as the hypervisor, and we're running about 20 virtual machines per physical host. Most of these applications are not I/O heavy, but really trivial benchmarking using hdparm indicates I/O bandwidth within the VM of about half that if the machine were physical. Very unscientific test, though. I should do some proper testing with bonnie++... Virtual disk performance in ESX 4.0 definitely feels better than ESX 3.5, but that's largely because they've got rid of some fairly serious brokenness in memory handling in the hypervisor which was leading to unnecessary swapping of the VMs. ESX 4.0 also has a new guest paravirtual SCSI driver which is supposed to improve virtual disk performance by about 20% but I have yet to test that. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From eugen at leitl.org Wed Sep 16 03:27:30 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 16 Sep 2009 12:27:30 +0200 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 Message-ID: <20090916102730.GU9828@leitl.org> Below bonnie++ stats are pretty bad for a RAID 10 of 14 SATA drives (WD RE4, 2 TByte), right? [oracle at localhost data]$ bonnie++ -d /data/blah Writing with putc()...done Writing intelligently...done Rewriting...done Reading with getc()...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP localhost.lo 47320M 82406 99 466201 46 124266 17 73482 96 541644 33 415.4 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 5488 4 +++++ +++ 4292 0 3883 3 +++++ +++ 2580 1 localhost.localdomain,47320M,82406,99,466201,46,124266,17,73482,96,541644,33,415.4,0,16,5488,4,+++++,+++,4292,0,3883,3,+++++,+++,2580,1 -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From smulcahy at atlanticlinux.ie Wed Sep 16 04:18:27 2009 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Wed, 16 Sep 2009 12:18:27 +0100 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 In-Reply-To: <20090916102730.GU9828@leitl.org> References: <20090916102730.GU9828@leitl.org> Message-ID: <4AB0C983.1070509@atlanticlinux.ie> Hi, I'm not sure what else people read from bonnie++ results but I normally focus on the sequential output block (which I think of as "block write speed" and sequential input block (which I think of as "block read speed"). Smarter folk on this list may be able to provide a more scientific analysis of your results than that though. In your case, I'm reading the results below as a block write speed of 455 MB/sec and a block read speed of 528 MB/sec which seems pretty good to me (unless my math has failed me). What kind of performance are you expecting? -stephen Eugen Leitl wrote: > Below bonnie++ stats are pretty bad for a RAID 10 of 14 SATA > drives (WD RE4, 2 TByte), right? > > [oracle at localhost data]$ bonnie++ -d /data/blah > Writing with putc()...done > Writing intelligently...done > Rewriting...done > Reading with getc()...done > Reading intelligently...done > start 'em...done...done...done... > Create files in sequential order...done. > Stat files in sequential order...done. > Delete files in sequential order...done. > Create files in random order...done. > Stat files in random order...done. > Delete files in random order...done. > Version 1.03 ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > localhost.lo 47320M 82406 99 466201 46 124266 17 73482 96 541644 33 415.4 0 > ------Sequential Create------ --------Random Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 5488 4 +++++ +++ 4292 0 3883 3 +++++ +++ 2580 1 > localhost.localdomain,47320M,82406,99,466201,46,124266,17,73482,96,541644,33,415.4,0,16,5488,4,+++++,+++,4292,0,3883,3,+++++,+++,2580,1 > > -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From eugen at leitl.org Wed Sep 16 04:41:15 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 16 Sep 2009 13:41:15 +0200 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 In-Reply-To: <4AB0C983.1070509@atlanticlinux.ie> References: <20090916102730.GU9828@leitl.org> <4AB0C983.1070509@atlanticlinux.ie> Message-ID: <20090916114115.GK9828@leitl.org> On Wed, Sep 16, 2009 at 12:18:27PM +0100, stephen mulcahy wrote: > I'm not sure what else people read from bonnie++ results but I normally I realize people mostly prefer IOZone, or similar. > focus on the sequential output block (which I think of as "block write > speed" and sequential input block (which I think of as "block read > speed"). Smarter folk on this list may be able to provide a more > scientific analysis of your results than that though. > > In your case, I'm reading the results below as a block write speed of > 455 MB/sec and a block read speed of 528 MB/sec which seems pretty good Thanks, this was the kind of comment I was looking for. The raw aggregate disk speed is roughly 770 MB/sec, so this result is not all that bad. This is Linux md RAID 10, CentOS 5.3 [oracle at localhost data]$ cat /proc/mdstat Personalities : [raid10] md4 : active raid10 sdp[13] sdo[12] sdn[11] sdm[10] sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1] sdc[0] 13674601472 blocks 64K chunks 2 near-copies [14/14] [UUUUUUUUUUUUUU] unused devices: [oracle at localhost data]$ uname -a Linux localhost.localdomain 2.6.18-128.el5xen #1 SMP Wed Jan 21 11:12:42 EST 2009 x86_64 x86_64 x86_64 GNU/Linux Dual-socket Nehalem E5520 @ 2.27GHz, 24 GByte RAM. > to me (unless my math has failed me). What kind of performance are you > expecting? I expected slightly more, but this is adequate. Thanks again! -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From landman at scalableinformatics.com Wed Sep 16 05:32:02 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 16 Sep 2009 08:32:02 -0400 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 In-Reply-To: <20090916102730.GU9828@leitl.org> References: <20090916102730.GU9828@leitl.org> Message-ID: <4AB0DAC2.2020600@scalableinformatics.com> Eugen Leitl wrote: > Below bonnie++ stats are pretty bad for a RAID 10 of 14 SATA > drives (WD RE4, 2 TByte), right? Well, we'd suggest using fio rather than bonnie++, but I'll save that for a post somewhere else. > > [oracle at localhost data]$ bonnie++ -d /data/blah [...] > Version 1.03 ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > localhost.lo 47320M 82406 99 466201 46 124266 17 73482 96 541644 33 415.4 0 > ------Sequential Create------ --------Random Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 5488 4 +++++ +++ 4292 0 3883 3 +++++ +++ 2580 1 > localhost.localdomain,47320M,82406,99,466201,46,124266,17,73482,96,541644,33,415.4,0,16,5488,4,+++++,+++,4292,0,3883,3,+++++,+++,2580,1 Your block output is 466MB/s, for something like 7 disks (14 disk RAID10). So This means you are getting 66.6 MB/s per drive for write, and 77.4 MB/s per drive for read. The 2TB RE4 drives can read at about 110 MB/s and write at about 105 MB/s (from our tests in the lab). So you are in the range of 60 to 70% efficiency. Not bad. RAID10 tends to be more efficient than RAID6 or RAID5 (at least the software versions of these). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From smulcahy at atlanticlinux.ie Wed Sep 16 05:40:19 2009 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Wed, 16 Sep 2009 13:40:19 +0100 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 In-Reply-To: <20090916114115.GK9828@leitl.org> References: <20090916102730.GU9828@leitl.org> <4AB0C983.1070509@atlanticlinux.ie> <20090916114115.GK9828@leitl.org> Message-ID: <4AB0DCB3.9020502@atlanticlinux.ie> Eugen Leitl wrote: > On Wed, Sep 16, 2009 at 12:18:27PM +0100, stephen mulcahy wrote: > >> I'm not sure what else people read from bonnie++ results but I normally > > I realize people mostly prefer IOZone, or similar. Personally, I like bonnie++ for the general overview it gives. It's a shame there isn't a public collection of good bonnie++ date people can use to compare their results with others (for a general, is my hardware performing in the right ballpark, type check) >> In your case, I'm reading the results below as a block write speed of >> 455 MB/sec and a block read speed of 528 MB/sec which seems pretty good > > Thanks, this was the kind of comment I was looking for. The raw > aggregate disk speed is roughly 770 MB/sec, so this result is not all > that bad. Like I said, maybe wait for some of the storage experts to wake up and respond and you may get a more accurate analysis of your results. -stephen -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From rockwell at pa.msu.edu Tue Sep 15 12:45:29 2009 From: rockwell at pa.msu.edu (Tom Rockwell) Date: Tue, 15 Sep 2009 15:45:29 -0400 Subject: [Beowulf] XEON power variations Message-ID: <4AAFEED9.6040302@pa.msu.edu> Hi, Intel assigns the same power consumption to different clockspeeds of L, E, X series XEON. All L series have the same rating, all E series etc. So, taking their numbers, the fastest of each type will always have the best performance per watt. And there is no power consumption penalty for buying the fastest clockspeed of each type. Vendor's power calculators reflect this (dell.com/calc for example). To me this seems like marketing simplification... Anybody know any different, e.g. have you seen other numbers from vendors or tested systems yourself? Is the power consumption of a system with an E5502 CPU really the same as one with an E5540? Thanks, Tom Rockwell From dzaletnev at yandex.ru Tue Sep 15 15:55:15 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed, 16 Sep 2009 02:55:15 +0400 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: Message-ID: <229971253055315@webmail34.yandex.ru> When install CentOS 5.3, you get Xen virtual machine for free, with a nice interface, and in it, modes with internal network and NAT to outside world work simultaneously, witch is not the case of Sun xVM VirtualBox. Never used VMWare because of its value of $189, people say it's a good VM. But whatfor, if there is CentOS 5.3 with Xen, the industry best emulator/VM? Sincerely, Dmitry Zaletnev, Saint-Petersburg, Russia > Still a newbie in HPC, in the first stages of building a Beowulf cluster (8 nodes). > I wonder if anybody out there has used Linux virtual machines in the head node, just to be able to experiment with different configurations & deployments and jump back without much effort if things go bad. Considering (out of experience in desktop) VMWare or Sun Virtualbox. Any hints, comments ? > David Ramirez > Grad Student CIS > Prairie View A&M University, Texas > -- > | David Ramirez Molina > | davidramirezmolina at gmail.com > | Houston, Texas - USA > Ancora Imparo (A?n aprendo) - Michelangelo a los 80 a?os > From david.ritch.lists at gmail.com Wed Sep 16 05:25:36 2009 From: david.ritch.lists at gmail.com (David B. Ritch) Date: Wed, 16 Sep 2009 08:25:36 -0400 Subject: RS: [Beowulf] Virtualization in head node ? In-Reply-To: <8B848D15-066C-436B-BB3E-84A8FC27C9B1@sanger.ac.uk> References: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> <9f8092cc0909152244t60bd1f5t883c8638dd6b67d9@mail.gmail.com> <8B848D15-066C-436B-BB3E-84A8FC27C9B1@sanger.ac.uk> Message-ID: <4AB0D940.1050908@gmail.com> At the RedHat Summit a couple of weeks ago, RH said that with a switch from Xen to KVM and lots of tuning, they were able to get the I/O overhead down to 5%. I thought that was pretty impressive. They also introduced a new product RedHat Enterprise Virtualization, which is supposed to support process migration and all the other niceties that we've come to expect from virtualization. I haven't played with it yet, but it sounds quite interesting. I'd be interested to hear of anyone else's experiences with these. David On 9/16/2009 5:34 AM, Tim Cutts wrote: > > On 16 Sep 2009, at 8:23 am, Alan Ward wrote: > >> >> I have been working quite a lot with VBox, mostly for server stuff. I >> agree it can be quite impressive, and has some nice features (e.g. do >> not stop a machine, sleep it - and wake up pretty fast). >> >> On the other hand, we found that anything that has to do with disk >> access is pretty slow, specially when working with a local disk image >> file. > > I think that's pretty standard for most virtualisation, whichever > vendor it comes from. The I/O is fairly sub-optimal. I've had a fair > bit of experience now of various VMware flavours. The I/O performance > of the desktop versions is fairly shocking; this is presumably largely > down to the fact that desktops and laptops tend to have fairly slow > I/O to start with, and the virtualisation penalty is very noticeable. > > Our production virtualisation system uses dual-fabric SAN-attached > storage (EVA5000), ESX 4.0 as the hypervisor, and we're running about > 20 virtual machines per physical host. Most of these applications are > not I/O heavy, but really trivial benchmarking using hdparm indicates > I/O bandwidth within the VM of about half that if the machine were > physical. Very unscientific test, though. I should do some proper > testing with bonnie++... > > Virtual disk performance in ESX 4.0 definitely feels better than ESX > 3.5, but that's largely because they've got rid of some fairly serious > brokenness in memory handling in the hypervisor which was leading to > unnecessary swapping of the VMs. > > ESX 4.0 also has a new guest paravirtual SCSI driver which is supposed > to improve virtual disk performance by about 20% but I have yet to > test that. > > Tim > > From stuartb at 4gh.net Wed Sep 16 05:35:50 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Wed, 16 Sep 2009 08:35:50 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: Message-ID: On Mon, 14 Sep 2009 at 14:04 -0000, David Ramirez wrote: > Still a newbie in HPC, in the first stages of building a Beowulf > cluster (8 nodes). Also a newbie to HPC, but now accumulating systems very quickly. > I wonder if anybody out there has used Linux virtual machines in the > head node, just to be able to experiment with different > configurations & deployments and jump back without much effort if > things go bad. Considering (out of experience in desktop) VMWare or > Sun Virtualbox. Any hints, comments I simulated a complete rocks system early on with VMs. I had 4 compute nodes and one head node. It was very useful for understanding some of the configuration basics. I never actually ran a problem on this system. I used Ubuntu server on a quad core system with kvm for this purpose and found it quite acceptable. I experimented with running our sge master as a VM on a login node but decided I preferred running this VM one one of our Xen hosts. We are using CentOS and Xen for production VMs. Once I figured out a prototype Xen configuration file it became pretty easy to run VMs for infrastructure purposes. Our system will consists of several compute clusters which need to share a common infrastructure. Rocks and other out of the box clustering systems are useful to study, but none of them look sufficient to handle our needs. Having an experimental VM structure is a good way to study the various systems. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From tjrc at sanger.ac.uk Wed Sep 16 07:37:23 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed, 16 Sep 2009 15:37:23 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: <229971253055315@webmail34.yandex.ru> References: <229971253055315@webmail34.yandex.ru> Message-ID: On 15 Sep 2009, at 11:55 pm, Dmitry Zaletnev wrote: > When install CentOS 5.3, you get Xen virtual machine for free, with > a nice interface, and in it, modes with internal network and NAT to > outside world work simultaneously, witch is not the case of Sun xVM > VirtualBox. Never used VMWare because of its value of $189, people > say it's a good VM. But whatfor, if there is CentOS 5.3 with Xen, > the industry best emulator/VM? VMware has some free versions; the pay-for versions have a number of extra features which are generally missing from the competitors, and some of which are quite shiny. The automated hot migration of VMs to load balance, for example. Last time I looked, you could manually migrate Xen VMs from one host to another, but it wouldn't do it automatically. vSphere also has high availability and fault tolerance features; I use the HA but not the FT yet (FT is like Marathon for Windows - it runs two copies of the VM in lock-step on two hosts, so that if one of the physical servers dies, the VM doesn't even need to reboot. Obviously there's a significant performance penalty in this). The other thing I find useful in vSphere that isn't yet present in Xen (at least last time I looked) was the ability to give particular users fine-grained access to their VM. For example, I used to have to give some users sudo access to their machines, and generally I can get around that now by allowing them to reboot their virtual machine instead, and they no longer need sudo at all. I consider this an improvement. More contentious is the memory deduplication trick. I can see arguments both for and against this. VMware's workstations products, and Xen, and presumably other hypervisors, give the VM as much RAM as you configure it with, regardless of whether it's going to use it or not. ESX can be configured to do this too, but by default it doesn't, and allows you to overcommit memory. It pays for this partly by deduplicating memory pages. Here's the output from the esxtop monitor program on one of our VMware servers (an HP BL490 blade server with 72GB of RAM): 3:29:06pm up 6 days 23:40, 161 worlds; MEM overcommit avg: 0.00, 0.00, 0.00 PMEM /MB: 73718 total: 618 cos, 967 vmk, 30880 other, 41252 free VMKMEM/MB: 72164 managed: 4329 minfree, 5980 rsvd, 65700 ursvd, high state COSMEM/MB: 69 free: 1239 swap_t, 1239 swap_f: 0.00 r/s, 0.00 w/s NUMA /MB: 36582 (11703), 36034 (29548) PSHARE/MB: 3953 shared, 121 common: 3832 saving SWAP /MB: 13 curr, 0 target: 0.00 r/s, 0.00 w/s MEMCTL/MB: 0 curr, 0 target, 25748 max The PSHARE row is the key one here; it's identified 3953 MB of memory pages which are the same on various machines, and is using only 121 MB to store them, saving 3832 MB of RAM. Our VMs are very heterogeneous; there are CentOS, Scientific Linux, Debian 4, Debian 5, SLES 10 SP2, Windows XP (both 32 and 64-bit), Windows Server 2003 (both 32 and 64- bit), Solaris... if they were more homogeneous, I'm sure the PSHARE saving would be much higher. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From rgb at phy.duke.edu Wed Sep 16 09:01:22 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 16 Sep 2009 12:01:22 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: On Wed, 16 Sep 2009, Tim Cutts wrote: > > On 15 Sep 2009, at 11:55 pm, Dmitry Zaletnev wrote: > >> When install CentOS 5.3, you get Xen virtual machine for free, with a nice >> interface, and in it, modes with internal network and NAT to outside world >> work simultaneously, witch is not the case of Sun xVM VirtualBox. Never >> used VMWare because of its value of $189, people say it's a good VM. But >> whatfor, if there is CentOS 5.3 with Xen, the industry best emulator/VM? > > VMware has some free versions; the pay-for versions have a number of extra > features which are generally missing from the competitors, and some of which > are quite shiny. The automated hot migration of VMs to load balance, for I'll also speak out in favor of VMware, as I use it pretty extensively (and had negative experiences the first few times I tried Xen). I haven't tried what is it, KVM, only because of a lack of time and energy. The primary advantage of VMware at the server level is probably its management interface, which is quite powerful and intuitive. In the latest server edition, it is web-based which gives you extremely easy ways of performing remote server management. I think it is an ideal way of running Windows servers where you can't live without them -- put e.g. Centos in a rock-solid, conservative, stripped, firewalled configuration on a multicore multiprocessor big memory server, create as many Windows Server VMs as you need and/or the machine supports, and you get the ability to do pretty much anything you want remotely (such as "hard reboot" a hung VM, checkpoint the VMs, stop the VMs and back up the then-hard VM images, "freeze" the VM so that on a reboot it goes back to the last pristine saved VM image, forgetting all the data and viruses and so on that might have accumulated in the meantime) on a very stable and secure base. It also gives you some interesting ways of accomplishing failover, as you can imagine, as the backed up VM images are quite portable and can be moved around in a VMware server farm. It's good for linux VMs too, don't get me wrong -- in fact, you can download a whole bunch of "canned" preconfigured VMs for e.g. mail server, web server etc. that only require you to boot them, adjust the configuration to fit your local requirements, and you're done. These prebuilt VMs can easily be made into a sandbox, or put outside your security boundary to do various chores with something even stronger than a chroot relative to "inside" servers and clients, on a single piece of hardware. I just don't think it is possible for one to propagate from inside a VM to the host OS or into other VMs, as everything is private unless you very deliberately set up sharing; they are really separate systems (although I'm guessing that somebody with root on the toplevel system would still own the world, but who knows?). But that's not where I use VMware the most. It's "Workstation" product is just awesome. It gives you what I would argue is a SLIGHTLY BETTER control interface to any or all VMs you might want to run on your personal desktop or laptop. I used to install Win/Lin dual boot against the not-rare-enough times I absolutely had to use Win for something, and of course this forces you to lose Lin. It also made it difficult and cumbersome to e.g. play Win games through an emulator or after a reboot. It was wasteful of resources, and of course BOTH Win AND Lin want to control the boot process, and getting a dual boot to actually work was a pain then and remains an even bigger pain today, since Vista-of-Evil doesn't want to play AT ALL nice in a dual boot environment -- I have yet to get it to work although I haven't tried infinitely hard I admit because I hate it anyway. Now it is trivial. Pop up the VMware console, boot up my XPPro VM and who needs Vista? XPPro will run forever on the virtualized hardware interface as long as I can get linux to boot and run devices on the toplevel system. If I change machines, my XPPro VM can go with me without all of the tedious crap from Windows Update and phone calls to Windows service people that don't know what you're talking about or what to do about it once they do. Diablo II expansion, a click away, snapshot/suspendable and resumable INDEFINITELY. Various window apps I have to have to work, sometimes, rarely. I can also run Fedora and Debian on the same machine in case I want to develop on both. I can set up a sandbox personal webserver on which to do web development without exposing my personal data to cracking and theft. Now for the bad news. VMware has its own share of "problems" with e.g. the rapidly varying linux kernel, and they are not always rapidly resolved. 6.5.2 would only run under e.g. Fedora 11 with a tedious patch and some flakes for most of its existence as a revision. 6.5.3 runs fine on Fedora 11, but the install RPM is broken and only a real linux geek can ease it through an install (by hand-making the target of the hung build script and then gently killing the hung steps of the make until the script frees up and concludes). I've heard of larger problems with bleeding edge kernels. For a server these usually aren't a problem, but for workstations and laptops (far more likely to run one of the dynamic, less debugged but more niftily tricked out distros) it can be. And yeah, it costs money, although Duke has a site license (finally) so it won't cost ME money any more, or at least not much. But I've paid for a Workstation license out of pocket. It's worth it. Unless/until Xen or KVM or something else comes out with a similarly powerful and tricked out console and ease of use and (still, overall) reliability, VMware will be on my personal laptops for the rest of time. It's just too useful a tool to live without, if you are a serious computer geek who develops software, webware, does consulting, plays games, needs multiple OS's but only want to carry one box and don't want to have to reconfigure reboot to get to them. "This message was paid for by the VMware corporation..." -- (not, just kidding, kidding:-) which now owes me at LEAST a couple of free copies of Workstation for the unsolicited testimonial that I would guess I will receive when hell freezes over... rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From gerry.creager at tamu.edu Wed Sep 16 09:28:17 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed, 16 Sep 2009 11:28:17 -0500 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: <20090915232111.GB16891@bx9.net> References: <20090915232111.GB16891@bx9.net> Message-ID: <4AB11221.4060902@tamu.edu> Greg Lindahl wrote: > On Mon, Sep 14, 2009 at 05:52:07PM -0500, Rahul Nabar wrote: > >> "Forwarding Rate 131 Mpps" How does that tie in to the big picture? > > Most layer 3 devices are not capable of forwarding full line-rate > traffic of tiny packets. You should go hunt down a lab report on > switch testing to see these kinds of details discussed. A couple of vendors went after the Really Tiny Packet market some time back, among them Anritsu. Fujitsu resorbed Anritsu's switch-making capabilities, so it you want a real good idea of good small packet performance peruse the specs on the Fujitsu switches. The guiding document on packet switching is (or was) RFC2544. In looking at the Force10 S50 vs the Dell switch, I'd go with the increased performance specs in the Force10, even though the S50 isn't really Force10 silicon, if I recall correctly. I've several S50s in my data center, hammer the fool out of them, and am happy. Prior to them, we used Foundry EdgeIron1G switches for our gigabit-connected clusters. They worked well. For our newer gigabit-connected cluster we went with the HP 5412zl, and have been happy. I'd not recommend cheap switches: They can bite you if you go too cheap and result in poor MPI and I/O performance. gerry From jlb17 at duke.edu Wed Sep 16 10:12:22 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 16 Sep 2009 13:12:22 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: On Wed, 16 Sep 2009 at 12:01pm, Robert G. Brown wrote > Unless/until Xen or KVM or something else comes out with a similarly > powerful and tricked out console and ease of use and (still, overall) > reliability, VMware will be on my personal laptops for the rest of time. > It's just too useful a tool to live without, if you are a serious > computer geek who develops software, webware, does consulting, plays > games, needs multiple OS's but only want to carry one box and don't want > to have to reconfigure reboot to get to them. I was a dyed-in-the-wool vmware user until quite recently, too, but the pain of keeping it running on "current" distros (read: Fedora) finally forced me to look elsewhere. I think you'll be pleasantly surprised by VirtualBox if you give it a shot. Then again, who knows what Oracle will do with it... -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From ashley at pittman.co.uk Wed Sep 16 10:39:27 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 16 Sep 2009 18:39:27 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: Message-ID: <1253122767.3677.14.camel@alpha> On Mon, 2009-09-14 at 13:04 -0500, David Ramirez wrote: > Still a newbie in HPC, in the first stages of building a Beowulf > cluster (8 nodes). > > I wonder if anybody out there has used Linux virtual machines in the > head node, just to be able to experiment with different configurations > & deployments and jump back without much effort if things go bad. I do this all the time, I tend to run either a virtual frontend and N virtual compute nodes locally on a server here or more commonly I just start a number of Amazon EC2 instances as compute nodes and run the frontend functionality on node zero. For parallel jobs the compute performance is dreadful (I over-commit the virtual CPU's) but for experimenting and testing different setups it's ideal, I typically run 128 process jobs at a cost of $.40 per hour. The software configuration and deployment is exactly the same on virtual machines as it is on physical machines unless you have any non-ethernet devices. As for running production systems using Virtual machines, for the front-end I'd be happy to do that as well, compute nodes should not be considered for this however. On medium sized clusters there is typically a "head node" which to run the management software, the resource manager, and such and a second front-end machine known as a "login node" for users to login to to compile code, submit jobs and perform cluster I/O from. On smaller machines these two rolls are more often rolled into one machine. Virtualisation allows you to separate out these two rolls again onto separate VM's, if budget allows getting two physical machines and running one Management VM and two Login VM's across them strikes me as a good solution for providing resilience. I'm sure a case could be made for running ten Login instances here but I'm not sure of the benefits myself. Ashley Pittman. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From sbyna at nec-labs.com Wed Sep 16 03:30:54 2009 From: sbyna at nec-labs.com (Suren Byna) Date: Wed, 16 Sep 2009 06:30:54 -0400 Subject: [Beowulf] [hpc-announce] CfP: Special Issue of JPDC on "Data Intensive Computing", Submission: Jan 15th 2010 Message-ID: Call for Papers: Special Issue of Journal of Parallel and Distributed Computing on "Data Intensive Computing" --------------------------------------------------------------------------- Data intensive computing is posing many challenges in exploiting parallelism of current and upcoming computer architectures. Data volumes of applications in the fields of sciences and engineering, finance, media, online information resources, etc. are expected to double every two years over the next decade and further. With this continuing data explosion, it is necessary to store and process data efficiently by utilizing enormous computing power that is available in the form of multi/manycore platforms. There is no doubt in the industry and research community that the importance of data intensive computing has been raising and will continue to be the foremost fields of research. This raise brings up many research issues, in forms of capturing and accessing data effectively and fast, processing it while still achieving high performance and high throughput, and storing it efficiently for future use. Programming for high performance yielding data intensive computing is an important challenging issue. Expressing data access requirements of applications and designing programming language abstractions to exploit parallelism are at immediate need. Application and domain specific optimizations are also parts of a viable solution in data intensive computing. While these are a few examples of issues, research in data intensive computing has become quite intense during the last few years yielding strong results. This special issue of the Journal Parallel and Distributed Computing (JPDC) is seeking original unpublished research articles that describe recent advances and efforts in the design and development of data intensive computing, functionalities and capabilities that will benefit many applications. Topics of interest include (but are not limited to): * Data-intensive applications and their challenges * Storage and file systems * High performance data access toolkits * Fault tolerance, reliability, and availability * Meta-data management * Remote data access * Programming models, abstractions for data intensive computing * Compiler and runtime support * Data capturing, management, and scheduling techniques * Future research challenges of data intensive computing * Performance optimization techniques * Replication, archiving, preservation strategies * Real-time data intensive computing * Network support for data intensive computing * Challenges and solutions in the era of multi/many-core platforms * Stream computing * Green (Power efficient) data intensive computing * Security and protection of sensitive data in collaborative environments Guide for Authors Papers need not be solely abstract or conceptual in nature: proofs and experimental results can be included as appropriate. Authors should follow the JPDC manuscript format as described in the "Information for Authors" at the end of each issue of JPDC or at http://ees.elsevier.com/jpdc/ . The journal version will be reviewed as per JPDC review process for special issues. Important Dates: Paper Submission : January 15, 2010 Notification of Acceptance/Rejection : May 31, 2010 Final Version of the Paper : September 15, 2010 Submission Guidelines All manuscripts and any supplementary material should be submitted through Elsevier Editorial System (EES) at http://ees.elsevier.com/ jpdc . Authors must select "Special Issue: Data Intensive Computing" when they reach the "Article Type" step in the submission process. First time users must register themselves as Author. For the latest details of the JPDC special issue see http://www.cs.iit.edu/~suren/jpdc Guest Editors: Dr. Surendra Byna NEC Labs America E-mail: sbyna at nec-labs.com Prof. Xian-He Sun Illinois Institute of Technology E-mail: sun at cs.iit.edu From stuartb at 4gh.net Wed Sep 16 11:33:06 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Wed, 16 Sep 2009 14:33:06 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: On Wed, 16 Sep 2009 at 12:01 -0000, Robert G. Brown wrote: > XPPro will run forever on the virtualized hardware interface as long > as I can get linux to boot and run devices on the toplevel system. > If I change machines, my XPPro VM can go with me without all of the > tedious crap from Windows Update and phone calls to Windows service > people that don't know what you're talking about or what to do about > it once they do. Are you sure about this? We have some interest in being able to archive complete installations for future use (e.g. +5-10 years). I'm skeptical of the existing trend of virtualization to handle all of the needs for product activation or other software licensing schemes. The following is speculation not facts. If you move an existing VM within the same virtualization and cpu technology you may be able to get away without reactivation or obtaining a new license key. MAC addresses can be set in several virtual environment which can help in some cases. However, a lot of other things can impact product activation and licensing checks. Different virtual environments provide different emulated devices. Emulated disk serial numbers, BIOS versions, cpu family, cpu stepping, processor flags and other unknown things may be included in the product activation or licensing checks. It may be that some of the processor emulation technologies can provide this functionality. qemu can emulate a number of hardware systems but again only in specific configurations which may differ from a real world licensed configuration. Stuart From bill at cse.ucdavis.edu Wed Sep 16 13:36:40 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 16 Sep 2009 13:36:40 -0700 Subject: [Beowulf] XEON power variations In-Reply-To: <4AAFEED9.6040302@pa.msu.edu> References: <4AAFEED9.6040302@pa.msu.edu> Message-ID: <4AB14C58.4080303@cse.ucdavis.edu> Tom Rockwell wrote: > Hi, > > Intel assigns the same power consumption to different clockspeeds of L, > E, X series XEON. All L series have the same rating, all E series etc. > So, taking their numbers, the fastest of each type will always have the > best performance per watt. Wrong, well they might. But not because the power use is the same. > And there is no power consumption penalty > for buying the fastest clockspeed of each type. No marketing visible penalty. Intel doesn't want you buying their low profit cheap chips instead of the high profit expensive chips because of the power you save. > Vendor's power > calculators reflect this (dell.com/calc for example). To me this seems > like marketing simplification... Anybody know any different, e.g. have > you seen other numbers from vendors or tested systems yourself? Try silicon mechanics, their configurator shows the system power for different configs. > Is the power consumption of a system with an E5502 CPU really the same > as one with an E5540? A random dual socket at SI shows 197 watts for a 5502 and 259 watts for a 5540. Intel/AMD often just bin them by power for marketing reasons. It's just funny business. So for instance the 3 new lynnfields are <= 95 watts, yet in testing the slowest is more power efficient they their "efficient" line which includes the 9550S which is rated as under <= 65 watts. So while intel isn't lying, it seems clear they are avoiding posting real numbers to make the faster clocked cpus more attractive. From rgb at phy.duke.edu Wed Sep 16 15:12:52 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 16 Sep 2009 18:12:52 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: On Wed, 16 Sep 2009, Stuart Barkley wrote: > On Wed, 16 Sep 2009 at 12:01 -0000, Robert G. Brown wrote: > >> XPPro will run forever on the virtualized hardware interface as long >> as I can get linux to boot and run devices on the toplevel system. >> If I change machines, my XPPro VM can go with me without all of the >> tedious crap from Windows Update and phone calls to Windows service >> people that don't know what you're talking about or what to do about >> it once they do. > > Are you sure about this? We have some interest in being able to > archive complete installations for future use (e.g. +5-10 years). > I'm skeptical of the existing trend of virtualization to handle all of > the needs for product activation or other software licensing schemes. Obviously, I have no idea. How could I? But it is plausible that it would work over a 10 year frame. There is really very, very little reason to change the virtualized hardware interface; it is usually chosen to be a wrapper that simulates something with boring, common, simple drivers OR it is a passthru. Passthru's obviously won't usually work -- if you had an OS from a decade ago that didn't know about USB, I'd guess that the USB drivers would either not work or would actively break it. OTOH, you can deconfigure USB passthru. I'd say that there is an excellent chance of running VMs for a 5+ year time frame, decreasing out to 10 but still quite possible at 10, especially for "vanilla" VMs that e.g. just use VGA, a basic network, and nothing else. You're obviously at serious risk of the CPU itself going away at the ten year mark. Who knows if even 32 bit emulation will exist in a decade, or if even 64 bit emulation will still exist? Who knows if the instruction set will have changed? Getting something that downshift emulates a 32 bit intel CPU on a 128 bit 16 core 2020 CPU with an entirely new instruction set... well, that might be a problem. > The following is speculation not facts. > > If you move an existing VM within the same virtualization and cpu > technology you may be able to get away without reactivation or > obtaining a new license key. MAC addresses can be set in several > virtual environment which can help in some cases. Usually this is the case, and often it is even legal, if you don't run two copies of one VM at once -- VMs are a good way to get failover and generally are viewed as a reinstall of a system from backup on new hardware, which is usually permitted or at least tolerated by any company in the server business. MS gets pissy IIRC about virtualizing Vista or better versions of Windows -- I think you have to have at least a business or professional class Vista license before it is legal to virtualize it. Of course this is shooting themselves in the foot, because it means that people trash Vista Home and install XPPro as a VM instead, stretching out that support lifetime still more. In a VM nobody cares -- if you boot a frozen VM image and mount space from elsewhere from data, even if they REMOVE all support and update streams you can't credibly get a virus on it. > However, a lot of other things can impact product activation and > licensing checks. Different virtual environments provide different > emulated devices. Emulated disk serial numbers, BIOS versions, cpu > family, cpu stepping, processor flags and other unknown things may be > included in the product activation or licensing checks. > > It may be that some of the processor emulation technologies can > provide this functionality. qemu can emulate a number of hardware > systems but again only in specific configurations which may differ > from a real world licensed configuration. Yeah, I don't really view it as a way of bypassing licensing, and in my case since Duke has a site license for Windows as well cloning an XPPro VM is totally legal for me. I did it yesterday -- it was awesome. A straight copy of the VM over, boot it on the new system AND on the old system, there it is right down to my (cough cough) copy of DII expansion. Which, um, I don't plan to play on more than one VM at a time...;-) rgb > > Stuart > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From stuartb at 4gh.net Wed Sep 16 16:51:30 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Wed, 16 Sep 2009 19:51:30 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: This is drifting off topic but I want to clarify two points: - I'm not advocating violating any licensing agreement. I am interested in aspects of environments which interact with license management code. - I suspect attempting to move a Windows VM between two different VM implementations is troublesome. For example, moving between VMWare and VirtualBox or between major versions of a single emulation environment. I suspect these systems present substantially different environments which would trigger the Microsoft Product Activation code. I am curious about people's experience in moving virtual machines between VM implementations. I'm also interested in archival aspects of virtual machines and expected future operational capabilities. I'm glad to see relatively successful emulations of early systems I've used (IBM 1130, DEC-10). In addition to archiving source code and data files, does anyone worry about archiving hardware or other items to ensure reproducibility of results in the future? Stuart Barkley From rpnabar at gmail.com Thu Sep 17 06:08:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 17 Sep 2009 08:08:39 -0500 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: <4AB11221.4060902@tamu.edu> References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> Message-ID: On Wed, Sep 16, 2009 at 11:28 AM, Gerry Creager wrote: > silicon, if I recall correctly. ?I've several S50s in my data center, hammer > the fool out of them, and am happy. Thanks Gerry! I have been getting many great reviews on Force10. Maybe I will seriously consider them. >Prior to them, we used Foundry > EdgeIron1G switches for our gigabit-connected clusters. ?They worked well. > ?For our newer gigabit-connected cluster we went with the HP 5412zl, and > have been happy. > > I'd not recommend cheap switches: They can bite you if you go too cheap and > result in poor MPI and I/O performance. On the other end of the spectrum is Cisco. Their gear seems at such a huge $$ premium with respect to the other vendors and when I ask why the best answer I get is "Cisco is the market leader in switches". They won't show me which of their parameters make a Cisco switch better than the rest. -- Rahul From hahn at mcmaster.ca Thu Sep 17 07:47:56 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 17 Sep 2009 10:47:56 -0400 (EDT) Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> Message-ID: > the best answer I get is "Cisco is the market leader in switches". > They won't show me which of their parameters make a Cisco switch > better than the rest. well, one parameter is "market share". another is "brand". cisco owners will also often advocate benchmarking based on the "comfort-level" parameter, but IMO this result overlaps two others. it's remarkably hard to find comparisons on old-fashioned, boring metrics like latency, backplane bandwidth/pps or mtbf. every for-cisco decision I've ever seen has been based on the first category of parameters. From rpnabar at gmail.com Thu Sep 17 14:37:26 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 17 Sep 2009 16:37:26 -0500 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> Message-ID: On Thu, Sep 17, 2009 at 9:47 AM, Mark Hahn wrote: >> the best answer I get is "Cisco is the market leader in switches". >> They won't show me which of their parameters make a Cisco switch >> better than the rest. > > well, one parameter is "market share". ?another is "brand". > cisco owners will also often advocate benchmarking based on the > "comfort-level" parameter, but IMO this result overlaps two others. > > it's remarkably hard to find comparisons on old-fashioned, boring > metrics like latency, backplane bandwidth/pps or mtbf. > It is! Nobody will tell me those. And each switch manufacturer seems to use different metrics making apples-for-apples comparisons hard. I have the feeling that buying the backbone is going to be the most in-exact part of the venture. Greg previously mentioned looking at 3rd party lab reports. But I haven't found anything useful+ current yet. Any pointers? Lots of home-hardware lab reports but nothing on the HPC / enterprise switching side. -- Rahul From rpnabar at gmail.com Thu Sep 17 15:21:21 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 17 Sep 2009 17:21:21 -0500 Subject: [Beowulf] storage server hardware considerations In-Reply-To: References: <4A5E0740.7080209@cora.nwra.com> <4A5E107A.4050804@scalableinformatics.com> <4A5E16BB.9090900@cora.nwra.com> Message-ID: On Wed, Jul 15, 2009 at 8:57 PM, Mark Hahn wrote: > if you can't saturate gigabit with very modest raids, > you're doing something wrong. ?a single ultra-cheap disk > these days (seagate 7200.12 500G, $60 or so) will hit close to 135 MB/s on > outer tracks and average > 105 over > the whole disk. ?I'm guessing that you're losing performance > due to either bad controllers (avoid HW raid on anything that's not fairly > recent) or a combination of raid6 and a write-heavy workload... Mark in a recent discussion recommends avoiding HW RAID. I am not exactly sure what the caveat is about. Does this mean that software RIAD under Linux outperform a dedicated RAID controllers? I was speccing out a DAS box with a storage server. The storage server had external RAID cards supposed to work with the SAS disks in the box. Is this a config to be frowned upon? Maybe I am misreading Mark's caveat. Any comments? -- Rahul From rpnabar at gmail.com Thu Sep 17 17:13:54 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 17 Sep 2009 19:13:54 -0500 Subject: [Beowulf] storage server hardware considerations In-Reply-To: <20090716063658.GM23524@leitl.org> References: <4A5E0740.7080209@cora.nwra.com> <4A5E107A.4050804@scalableinformatics.com> <4A5E16BB.9090900@cora.nwra.com> <20090716063658.GM23524@leitl.org> Message-ID: On Thu, Jul 16, 2009 at 1:36 AM, Eugen Leitl wrote: > If it has to be cheap I'd take e.g. a Sun with 8x 2.5" SATA drives, > using 300 GByte WD VelociRaptors as a stripe over mirrors, or 15 > krpm SAS drives. > > Right now you can populate a SuperMicro chassis with 16x SATA 3.5" > achieving e.g. a 24 GByte dual-socket Nehalem with 32 TByte raw > storage (WD RE4; 4.8 TByte with WD VelociRaptor) for about > 6.2 kEUR sans VAT. Eugen, your setup seems very close to what I might go for. The cost seems attractive too. Do you know what kind of IOPS and MB/sec you are getting on this setup? -- Rahul From gerry.creager at tamu.edu Thu Sep 17 19:18:06 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu, 17 Sep 2009 21:18:06 -0500 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: <4AB2EDDE.1080005@tamu.edu> Joshua Baker-LePain wrote: > On Wed, 16 Sep 2009 at 12:01pm, Robert G. Brown wrote > >> Unless/until Xen or KVM or something else comes out with a similarly >> powerful and tricked out console and ease of use and (still, overall) >> reliability, VMware will be on my personal laptops for the rest of time. >> It's just too useful a tool to live without, if you are a serious >> computer geek who develops software, webware, does consulting, plays >> games, needs multiple OS's but only want to carry one box and don't want >> to have to reconfigure reboot to get to them. > > I was a dyed-in-the-wool vmware user until quite recently, too, but the > pain of keeping it running on "current" distros (read: Fedora) finally > forced me to look elsewhere. I think you'll be pleasantly surprised by > VirtualBox if you give it a shot. > > Then again, who knows what Oracle will do with it... I'm not sure I'd TRY to keep it running on Fedora. Too bleeding edge for my clusters! -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From gerry.creager at tamu.edu Thu Sep 17 20:58:52 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Thu, 17 Sep 2009 22:58:52 -0500 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> Message-ID: <4AB3057C.3060607@tamu.edu> Rahul Nabar wrote: > On Wed, Sep 16, 2009 at 11:28 AM, Gerry Creager wrote: >> silicon, if I recall correctly. I've several S50s in my data center, hammer >> the fool out of them, and am happy. > > Thanks Gerry! I have been getting many great reviews on Force10. Maybe > I will seriously consider them. > >> Prior to them, we used Foundry >> EdgeIron1G switches for our gigabit-connected clusters. They worked well. >> For our newer gigabit-connected cluster we went with the HP 5412zl, and >> have been happy. >> >> I'd not recommend cheap switches: They can bite you if you go too cheap and >> result in poor MPI and I/O performance. > > On the other end of the spectrum is Cisco. Their gear seems at such a > huge $$ premium with respect to the other vendors and when I ask why > the best answer I get is "Cisco is the market leader in switches". > They won't show me which of their parameters make a Cisco switch > better than the rest. With the POSSIBLE exception of the newer Nexus line from Cisco, I can't think of a reason I'd put a Cisco-labeled switch in my data center... except for a Linksys for non-critical applications. -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From eugen at leitl.org Fri Sep 18 01:15:25 2009 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 18 Sep 2009 10:15:25 +0200 Subject: [Beowulf] storage server hardware considerations In-Reply-To: References: <4A5E0740.7080209@cora.nwra.com> <4A5E107A.4050804@scalableinformatics.com> <4A5E16BB.9090900@cora.nwra.com> <20090716063658.GM23524@leitl.org> Message-ID: <20090918081525.GO9828@leitl.org> On Thu, Sep 17, 2009 at 07:13:54PM -0500, Rahul Nabar wrote: > Eugen, your setup seems very close to what I might go for. The cost > seems attractive too. Do you know what kind of IOPS and MB/sec you are > getting on this setup? I've posted the bonnie++ data the other day (14x RE4 as RAID 10, 2x RE4 as RAID 1 root). I've used two LSI SAS3081E-R. No idea about IOPS, which benchmark should I run? An interesting variation on this is to use a 24-drive chassis (with 2x3.5" internally, populated with 4x Intel SSD for ZIL and L2ARC), use 3x LSI SAS3081E-R and to use Opensolaris with zfs on top of that. There's a thread on zfs-discuss@ ongoing right now: http://mail.opensolaris.org/pipermail/zfs-discuss/2009-September/thread.html Some relevant links: http://blogs.sun.com/relling/tags/mttdl http://blogs.sun.com/relling/entry/a_story_of_two_mttdl http://blogs.sun.com/brendan/entry/test http://blogs.sun.com/ahl/entry/expand_o_matic_raid_z -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From rpnabar at gmail.com Fri Sep 18 05:11:41 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 18 Sep 2009 07:11:41 -0500 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: <4AB3057C.3060607@tamu.edu> References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> <4AB3057C.3060607@tamu.edu> Message-ID: On Thu, Sep 17, 2009 at 10:58 PM, Gerry Creager wrote: > > With the POSSIBLE exception of the newer Nexus line from Cisco, I can't > think of a reason I'd put a Cisco-labeled switch in my data center... except > for a Linksys for non-critical applications. Nexus is expensive though! They tell me that it approaches latencies of Infiniband and is close to the operations of a lossless switch. That's what I got from their sales guy. But again, we weren't ready to pay the premium. -- Rahul From rgb at phy.duke.edu Fri Sep 18 05:15:35 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Fri, 18 Sep 2009 08:15:35 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: <4AB2EDDE.1080005@tamu.edu> References: <229971253055315@webmail34.yandex.ru> <4AB2EDDE.1080005@tamu.edu> Message-ID: On Thu, 17 Sep 2009, Gerry Creager wrote: >> I was a dyed-in-the-wool vmware user until quite recently, too, but the >> pain of keeping it running on "current" distros (read: Fedora) finally >> forced me to look elsewhere. I think you'll be pleasantly surprised by >> VirtualBox if you give it a shot. >> >> Then again, who knows what Oracle will do with it... > > I'm not sure I'd TRY to keep it running on Fedora. Too bleeding edge for my > clusters! I don't use Fedora on clusters, I use it on laptops, where bleeding edge is often necessary. I just got and reinstalled a Studio 17 Dell (which came with VoEvil, of course) and it wouldn't even boot the F10 install image (at least not without a lot more energy than I had to put into it). F11 it booted, and installed, flawlessly. From what Google turned up, Ubuntu will work too. The VMware hassle on F11 (and Ubuntu -- actually on current-gen kernels in general) has been the exception rather than the rule and seems to be due to a surprising lag between recent major changes in some of the kernel sources, plus the shift in Fedora from OSS to ALSA-only with OSS emulation a deprecated, difficult to restore option. But I will try VBox at my next reasonable opportunity. On servers I run Centos or RHEL (licenses and all) as the vendor of the software requires. Generally Centos on top, then VMware, then RHEL VMs. Works fine. The only bad thing I've seen about Centos in the past is the dark side of a long term freeze -- some very useful tools and libraries have been in rapid development (notably the GSL and Yum). RHEL 4 just sucked in this regard, with up2date instead of yum, and an early, broken version of the GSL. Fedora is too fast, RHEL too slow. What can you do? rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From tjrc at sanger.ac.uk Fri Sep 18 06:13:36 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Fri, 18 Sep 2009 14:13:36 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> <4AB2EDDE.1080005@tamu.edu> Message-ID: <8F1ABB96-9E6A-4595-9565-A38A41D339CB@sanger.ac.uk> On 18 Sep 2009, at 1:15 pm, Robert G. Brown wrote: > On Thu, 17 Sep 2009, Gerry Creager wrote: > >>> I was a dyed-in-the-wool vmware user until quite recently, too, >>> but the pain of keeping it running on "current" distros (read: >>> Fedora) finally forced me to look elsewhere. I think you'll be >>> pleasantly surprised by VirtualBox if you give it a shot. >>> Then again, who knows what Oracle will do with it... >> >> I'm not sure I'd TRY to keep it running on Fedora. Too bleeding >> edge for my clusters! > > I don't use Fedora on clusters, I use it on laptops, where bleeding > edge > is often necessary. I just got and reinstalled a Studio 17 Dell > (which > came with VoEvil, of course) and it wouldn't even boot the F10 install > image (at least not without a lot more energy than I had to put into > it). F11 it booted, and installed, flawlessly. From what Google > turned > up, Ubuntu will work too. Ah, OK, so I can understand the VMware pain from that side. But the pain we were talking about was maintaining old OS services for a long time, and of course that's hopefully less difficult; as long as VMware don't change the virtual hardware too much, we should be fine (and so far they've been very good at maintaining backward compatibility). I still take the point (that someone made, sorry I don't remember who) that there may still be licensing issues for services built on proprietary operating systems and such, but in my view that's a good argument for building such services on open source software in the first place. "Doctor, it hurts when I poke this sharp stick in my eye"... :-) > The VMware hassle on F11 (and Ubuntu -- actually on current-gen > kernels > in general) has been the exception rather than the rule and seems to > be > due to a surprising lag between recent major changes in some of the > kernel sources, plus the shift in Fedora from OSS to ALSA-only with > OSS > emulation a deprecated, difficult to restore option. But I will try > VBox at my next reasonable opportunity. > On servers I run Centos or RHEL (licenses and all) as the vendor of > the > software requires. Generally Centos on top, then VMware, then RHEL > VMs. > Works fine. The only bad thing I've seen about Centos in the past is > the dark side of a long term freeze -- some very useful tools and > libraries have been in rapid development (notably the GSL and Yum). > RHEL 4 just sucked in this regard, with up2date instead of yum, and an > early, broken version of the GSL. Fedora is too fast, RHEL too slow. > What can you do? I'm not sure there's any perfect answer to that one. The Debian family of distros have a similar problem. Debian stable changes too slowly, testing is too fast. Ubuntu seem to have a reasonable compromise; two updates a year if you want bleeding edge, and LTS releases every so often for those for whom stability is everything. The only problem with the debian family, of course, is struggles with ISV support, although that is coming, slowly. VMware now fully support Debian as well as Ubuntu as ESX guests, which has made my life much easier. They don't seem to support CentOS, but I just lie to VMware and tell it the machine is running Red Hat, and it seems to behave fine. Our solution at Sanger to the stable vs uptodate argument has basically been to go with Debian stable, and maintain our own repository of backported packages for when we need something more recent. Fortunately the number of packages we've had to backport or patch has been fairly small. Regards, Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From h-bugge at online.no Fri Sep 18 06:31:56 2009 From: h-bugge at online.no (=?ISO-8859-1?Q?H=E5kon_Bugge?=) Date: Fri, 18 Sep 2009 15:31:56 +0200 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> <4AB3057C.3060607@tamu.edu> Message-ID: <12A7C6D6-56AF-4591-9BDF-154CA6BCB553@online.no> On Sep 18, 2009, at 14:11 , Rahul Nabar wrote: > On Thu, Sep 17, 2009 at 10:58 PM, Gerry Creager > wrote: > > > Nexus is expensive though! They tell me that it approaches latencies > of Infiniband and is close to the operations of a lossless switch. > That's what I got from their sales guy. But again, we weren't ready to > pay the premium. You might check out the reports from the Tolly Group (www.tolly.com), they used to evaluate different eth switches. Not sure how un-biased they are though. H?kon From prentice at ias.edu Fri Sep 18 10:37:43 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Fri, 18 Sep 2009 13:37:43 -0400 Subject: [Beowulf] Ubuntu/laptops In-Reply-To: References: <229971253055315@webmail34.yandex.ru> <4AB2EDDE.1080005@tamu.edu> Message-ID: <4AB3C567.5040400@ias.edu> Robert G. Brown wrote: > On Thu, 17 Sep 2009, Gerry Creager wrote: > > I don't use Fedora on clusters, I use it on laptops, where bleeding edge > is often necessary. I just got and reinstalled a Studio 17 Dell (which > came with VoEvil, of course) and it wouldn't even boot the F10 install > image (at least not without a lot more energy than I had to put into > it). F11 it booted, and installed, flawlessly. From what Google turned > up, Ubuntu will work too. Off-topic: From my experience, Ubuntu works fantastic on laptops, better than anything else, except *maybe* OS X. -- Prentice From john.hearns at mclaren.com Thu Sep 17 01:24:20 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 17 Sep 2009 09:24:20 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D2C9AAA@milexchmb1.mil.tagmclarengroup.com> If you move an existing VM within the same virtualization and cpu technology you may be able to get away without reactivation or obtaining a new license key. MAC addresses can be set in several virtual environment which can help in some cases. I get your point here. I deal a lot with licensed ISV type software - each has its own foibles, though most use some flavour of FlexLM. Maybe we should look at this the other way round - software licensing should be changed to cope with virtual machines being commonplace. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From john.hearns at mclaren.com Thu Sep 17 02:13:56 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 17 Sep 2009 10:13:56 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D2C9B4B@milexchmb1.mil.tagmclarengroup.com> Passthru's obviously won't usually work -- if you had an OS from a decade ago that didn't know about USB, I'd guess that the USB drivers would either not work or would actively break it. OTOH, you can deconfigure USB passthru. For what its worth, my current desktop is a 'recycled' XP workstation - actually quite nice as it has SCSI drives. I had the idea of using the existing XP partition under Virtualbox (you can do this - you can point Virtualbox to a real disk partition and boot it up). I got the OS to boot up - but it fails as the 'real' install of XP never had the drivers for USB keyboards. You would have to boot it up for real with a PS2 keyboard and mouse, then load up the correct drivers. I didn't take it much further than that. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From gmkurtzer at gmail.com Fri Sep 18 08:22:29 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Fri, 18 Sep 2009 08:22:29 -0700 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> <4AB2EDDE.1080005@tamu.edu> Message-ID: <571f1a060909180822u4bfd4493w786c133f6bd1fcf7@mail.gmail.com> On Fri, Sep 18, 2009 at 5:15 AM, Robert G. Brown wrote: [snip] > On servers I run Centos or RHEL (licenses and all) as the vendor of the > software requires. ?Generally Centos on top, then VMware, then RHEL VMs. > Works fine. ?The only bad thing I've seen about Centos in the past is > the dark side of a long term freeze -- some very useful tools and > libraries have been in rapid development (notably the GSL and Yum). > RHEL 4 just sucked in this regard, with up2date instead of yum, and an > early, broken version of the GSL. ?Fedora is too fast, RHEL too slow. > What can you do? > We have tried maintaining that balance with Caos Linux, and focusing now on Caos NSA (Node Server Appliance) which makes it a decent HPC solution (at least that is what we are going for) for many of the reasons that we have stated. Warning, it is not intended to be anything aside from a cluster/server type of solution so don't try it on your laptop. ;) http://www.caoslinux.org/ Greg -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From lindahl at pbm.com Fri Sep 18 16:31:37 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Fri, 18 Sep 2009 16:31:37 -0700 Subject: [Beowulf] memory for sale Message-ID: <20090918233137.GB17778@bx9.net> If you were thinking, Hey, my DDR2 FBDimm cluster could use a memory upgrade, I have 2800+ 2GB modules I'm about to Ebay. Apacer 78.AKGAB.425 2GB FBD PC2-5300 CL5 Our accountant tells us it's probably a bad idea to show revenue before our launch... -- greg From niftyompi at niftyegg.com Sat Sep 19 16:58:28 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Sat, 19 Sep 2009 16:58:28 -0700 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> <4AB3057C.3060607@tamu.edu> Message-ID: <20090919235828.GA3449@compegg> On Fri, Sep 18, 2009 at 07:11:41AM -0500, Rahul Nabar wrote: > On Thu, Sep 17, 2009 at 10:58 PM, Gerry Creager wrote: > > > > With the POSSIBLE exception of the newer Nexus line from Cisco, I can't > > think of a reason I'd put a Cisco-labeled switch in my data center... except > > for a Linksys for non-critical applications. > > > Nexus is expensive though! They tell me that it approaches latencies > of Infiniband and is close to the operations of a lossless switch. > That's what I got from their sales guy. But again, we weren't ready to > pay the premium. I suspect one important point is the difference between data center needs and some types of Beowulf cluster needs. Today a data center needs lots of management bells and whistles while a cluster locally needs flat out packet switching and minimum management overhead. A modern business data closet is full of a tangled mix of services and regulated activities that requires more than just data switching. In this regard Cisco (and others too) has some very valuable product offerings. A small but full feature switch can act as the gate keeper filtering packets interfacing to the Internet, campus and routing services. In some cases it can also provide audit that net Nannie policy may require. i.e. it is part of the campus IT service foundation. The cluster itself needs very little in the way of special services and can be setup and managed as a homogeneous soft gooey center with a hard crusty outside. A "simple" but fast switch with enough ports seems sufficient. NFS traffic (fast cascading funnel tree) can be different than say MPI traffic with all hosts communicating at the same time with all the neighbors (one big cross bar). Your cluster design may well shape your switch benchmark testing. A quick look at Nexus data sheets tells me that you are paying for future expansion. The chassis behind the cards is fast, very fast. Heck call Cisco and ask for an evaluation sample, or evaluation discount. ;-) -- T o m M i t c h e l l Found me a new hat, now what? From jellogum at gmail.com Tue Sep 22 12:01:53 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Tue, 22 Sep 2009 12:01:53 -0700 Subject: [Beowulf] How do I work around this patent? Message-ID: I joined this community many years ago to learn about GRID computing when I was studying biology and the Linux file system, with future goals to write interesting open source programs. It's the future and I just hit a wall in the design process of writing code for my study. This problem is related to the boring world of business and IP patents. I like patents, but lately I wonder... RE: "...Worlds.com filed a lawsuit Dec. 24 against NCSoft in the U.S. District Court for the Eastern District of Texas, Tyler Division, for violating patent 7181690. The patent is described as a method for enabling users to interact in a virtual space through avatars." Online read on patents: http://www.google.com/patents?id=wv5-AAAAEBAJ&dq=7,181,690 http://www.google.com/patents/about?id=BYoGAAAAEBAJ&dq=6,219,045 Can someone help me to better understand how these patents interact with the open source bazaar method of programing, Linux, the law, GIS systems with meta data that is essentially 3-D access for a user's avatar, etc? I am having flow chart issues that are not flowing... and I am now back to the world of research (patents) when I would rather be writing and compiling software. -- Jeremy Baker PO 297 Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Tue Sep 22 13:10:57 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 22 Sep 2009 13:10:57 -0700 Subject: [Beowulf] How do I work around this patent? In-Reply-To: References: Message-ID: <20090922201057.GA27344@bx9.net> > This problem is related to > the boring world of business and IP patents. I like patents, but lately I > wonder... This is fairly off-topic for this list, but: It's basically impossible to write any significant program these days without infringing on dozens or hundreds of patents. The standard legal advice to software startups is to not read any patents, in order to avoid willful infringement. *If* you get sued, then it's worth looking at the patent in question to see if you can work around it. You can see this process in action in Linux with the argument over workarounds for the "long names in FAT filesystems" patent. The area you're apparenetly interested in, virtual worlds, likely has a zillion patents with a lot of overlap. The situation is the same for things like distributed filesystems, compilers, and perhaps MPI. -- greg From james.p.lux at jpl.nasa.gov Tue Sep 22 13:55:11 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 22 Sep 2009 13:55:11 -0700 Subject: [Beowulf] How do I work around this patent? In-Reply-To: References: Message-ID: From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Jeremy Baker Sent: Tuesday, September 22, 2009 12:02 PM To: beowulf at beowulf.org Subject: [Beowulf] How do I work around this patent? I joined this community many years ago to learn about GRID computing when I was studying biology and the Linux file system, with future goals to write interesting open source programs. It's the future and I just hit a wall in the design process of writing code for my study.? This problem is related to the boring world of business and IP patents. I like patents, but lately I wonder... RE: "...Worlds.com filed a lawsuit Dec. 24 against NCSoft in the U.S. District Court for the Eastern District of Texas, Tyler Division, for violating patent 7181690. The patent is described as a method for enabling users to interact in a virtual space through avatars." Online read on patents: http://www.google.com/patents?id=wv5-AAAAEBAJ&dq=7,181,690 http://www.google.com/patents/about?id=BYoGAAAAEBAJ&dq=6,219,045 Can someone help me to better understand how these patents interact with the open source bazaar method of programing, Linux, the law, GIS systems with meta data that is essentially 3-D access for a user's avatar, etc? I am having flow chart issues that are not flowing... and I am now back to the world of research (patents) when I would rather be writing and compiling software. --- First off, you should know that only the claims determine what the patent covers. The rest of the patent is just useful information. You need to decide if your application "reads on" the claims. Hiring a patent attorney used to doing this kind of analysis is useful.. a few hundred bucks well spent. Note there's a hierarchy of claims here.. Claim 1 is a big claim, and then, 2,3,4,and 5 hang on Claim 1. Glancing through the claims, it looks like they are patenting a scheme very much like described in Michael Crichton's "Disclosure", especially with avatars for other users. Or any of a number of other multi user schemes. Maybe your implementation doesn't read on the claims. I note that most of the claims specifically reference "less than all the other users'" etc. If your implementation has your local client receiving ALL the other user info, then this patent doesn't apply. (Claims 1,6, 9, 10, 11, 15, and 18 all have the "less than all" wording, the rest are subordinate claims) In any case, if what you are doing happens to match the claims, you can always try to break the patent (i.e. find a prior disclosure of what's being patented.. a description in a novel might actually be good enough.. consult your patent attorney). Or, you can patent something yourself, and then offer to cross license with the holder here. Maybe worlds.com would be willing? Whether you are doing open source or not doesn't really have any influence on whether something is infringing. If you're doing something described by the claims, you're infringing. Note that this patent was originally applied for in the mid 90s.. going to be dicey on the prior art. From niftyompi at niftyegg.com Tue Sep 22 16:07:41 2009 From: niftyompi at niftyegg.com (Nifty Tom Mitchell) Date: Tue, 22 Sep 2009 16:07:41 -0700 Subject: [Beowulf] How do I work around this patent? In-Reply-To: References: Message-ID: <20090922230741.GA9189@tosh2egg.ca.sanfran.comcast.net> On Tue, Sep 22, 2009 at 01:55:11PM -0700, Lux, Jim (337C) wrote: > Subject: [Beowulf] How do I work around this patent? > > I joined this community many years ago to learn about GRID computing... > > "...Worlds.com filed a lawsuit .... for violating patent 7181690. This sounds like this is a patent for implementing the dictionary definition of an avatar. The dictionary definition may provide prior art ;-) and narrow their applicability. Reading through it the implementation includes bits I know or suspect to be in well known programs like Microsoft Flight simulator, the SGI "dog" multiuser flight simulator an SGI paper airplane demo that prunes the 3D space to render and interact with all combined with bits of centralized "Go" and "Chess" game servers that have been out there almost as long as the internet. And Big Bertha networked progressive slot machines too.. Greg and Jim's comments are spot on. Greg has his name on some clever patents, I do not know abut Jim. One of the critical points in a patent is that it not be obvious. So the point that you should not look at patents is spot on. If you reinvent the idea with trivial effort - one point to you. So.. Unplug your development stations from the Internet and go back to work in isolation on a private internet. Document your design and go see a patent attorney with your design. Update the design and send him/her updates on a regular basis. In some cases he does not need to read them, just date and file them. In your design document comment on all the moving parts, trivial, clever, novel, critical to the product etc. A good one may also see value in things you might dismiss. Keep the inventor list up to date too. One IMPORTANT point is the moment (date time stamp) that your code is seen live outside of the lab. Alpha and Beta testers can start the clock for you on some critical bits. Same for investor disclosure without NDA etc... demos for the kids etc. Good legal advice can help on all these bits. -- T o m M i t c h e l l Found me a new hat, now what? From james.p.lux at jpl.nasa.gov Tue Sep 22 16:12:44 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Tue, 22 Sep 2009 16:12:44 -0700 Subject: [Beowulf] How do I work around this patent? In-Reply-To: <20090922230741.GA9189@tosh2egg.ca.sanfran.comcast.net> References: <20090922230741.GA9189@tosh2egg.ca.sanfran.comcast.net> Message-ID: > Greg and Jim's comments are spot on. > Greg has his name on some clever patents, I do not know abut Jim. > I don't know that it's necessarily clever, but US 5,971,765 is mine... It's certainly unique... (and has been litigated, too..) From rpnabar at gmail.com Tue Sep 22 20:51:40 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 22 Sep 2009 23:51:40 -0400 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: <20090919235828.GA3449@compegg> References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> <4AB3057C.3060607@tamu.edu> <20090919235828.GA3449@compegg> Message-ID: On Sat, Sep 19, 2009 at 7:58 PM, Nifty Tom Mitchell wrote: > > The cluster itself needs very little in the way of special services and can be > setup and managed as a homogeneous soft gooey center with a hard crusty > outside. ? A "simple" but fast switch with enough ports seems sufficient. > NFS traffic (fast cascading funnel tree) can be different than say MPI traffic > with all hosts communicating at the same time with all the neighbors (one big cross bar). > Your cluster design may well shape your switch benchmark testing. Yup, I guess we need to wait for a "simple computing" switch model for HPC just like siilar offerings on the compute server side recently. I've had not much success getting any significant discounts or evaluation switches off my local Cisco Vendors. Now if only there are any Cisco powers-that-be on this mailing list............ :-) -- Rahul From rpnabar at gmail.com Tue Sep 22 20:53:55 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 22 Sep 2009 23:53:55 -0400 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: <12A7C6D6-56AF-4591-9BDF-154CA6BCB553@online.no> References: <20090915232111.GB16891@bx9.net> <4AB11221.4060902@tamu.edu> <4AB3057C.3060607@tamu.edu> <12A7C6D6-56AF-4591-9BDF-154CA6BCB553@online.no> Message-ID: On Fri, Sep 18, 2009 at 9:31 AM, H?kon Bugge wrote: > You might check out the reports from the Tolly Group (www.tolly.com), they > used to evaluate different eth switches. Not sure how un-biased they are > though. Thanks Hakon! The Tolly group is definately a good lead. i found some Dell reviews on there. But Cisco switches seem non existant. The three reviews that I could find were from back in the 90's! -- Rahul From eugen at leitl.org Wed Sep 23 06:26:05 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 23 Sep 2009 15:26:05 +0200 Subject: [Beowulf] Microsoft acquires the technology assets of Interactive Supercomputing (ISC) Message-ID: <20090923132605.GZ27331@leitl.org> FYI http://blogs.technet.com/windowsserver/archive/2009/09/21/microsoft-has-acquired-the-technology-assets-of-interactive-supercomputing-isc.aspx Microsoft acquires the technology assets of Interactive Supercomputing (ISC) Hello everyone, Today, I?m very excited to announce that Microsoft has acquired the technology assets of Interactive Supercomputing (ISC), a company that specializes in bringing the power of parallel computing to the desktop and making high performance computing more accessible to end users. This move represents our ongoing commitment to parallel computing and high performance computing (HPC) and will bring together complementary technologies that will help simplify the complexity and difficulty of expressing problems that can be parallelized. ISC?s products and technology enable faster prototyping, iteration, and deployment of large-scale parallel solutions, which is well aligned with our vision of making high performance computing and parallel computing easier, both on the desktop and in the cluster. Bill Blake, CEO of ISC, is bringing over a team of industry leading experts on parallel and high performance computing that will join the Microsoft team at the New England Research & Development Center in Cambridge, MA. He and I are both excited to start working together on the next generation of technology for researchers, analysts, and engineers, as well as those who have yet to be exposed to the benefits of parallel computing and HPC technologies or may have thought they were out of reach. We have recently begun plans to integrate ISC technologies into future versions of Microsoft products and will provide more information over the coming months on where and how that integration will occur. Beginning immediately, Microsoft will provide support for ISC?s current Star-P customers and we are committed to continually listening to customer needs as we develop the next generation of HPC and parallel computing technologies. I?m looking forward to the opportunities our two combined groups have to greatly improve the capability, performance, and accessibility of parallel computing and HPC technologies. You can find more information on HPC and parallel computing at Microsoft in these links and stay up to date on integration news and updates at Microsoft Pathways, our acquisition information site. Kyril Faenov General Manager, High Performance & Parallel Computing Technologies Filed under: HPC, High Performance Computing, windows hpc server 2008, Parallel Computing From hahn at mcmaster.ca Wed Sep 23 13:28:03 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 23 Sep 2009 16:28:03 -0400 (EDT) Subject: [Beowulf] Microsoft acquires the technology assets of Interactive Supercomputing (ISC) In-Reply-To: <20090923132605.GZ27331@leitl.org> References: <20090923132605.GZ27331@leitl.org> Message-ID: > Microsoft acquires the technology assets of Interactive Supercomputing (ISC) ... > Filed under: HPC, High Performance Computing, windows hpc server 2008, > Parallel Computing filed under: yikes!, resistance-is-futile, borgasm, sigh. From trainor at presciencetrust.org Tue Sep 22 20:52:06 2009 From: trainor at presciencetrust.org (Douglas J. Trainor) Date: Wed, 23 Sep 2009 03:52:06 +0000 (GMT) Subject: [Beowulf] How do I work around this patent? Message-ID: <168146350.113478.1253677926162.JavaMail.mail@webmail11> An HTML attachment was scrubbed... URL: From m at pavis.biodec.com Wed Sep 23 01:18:10 2009 From: m at pavis.biodec.com (m at pavis.biodec.com) Date: Wed, 23 Sep 2009 10:18:10 +0200 Subject: [Beowulf] How do I work around this patent? In-Reply-To: References: Message-ID: <20090923081810.GB28861@pavis.biodec.com> * Jeremy Baker (jellogum at gmail.com) [090922 21:17]: > > Can someone help me to better understand how these patents interact with > the open source bazaar method of programing, Linux, the law, GIS systems > with meta data that is essentially 3-D access for a user's avatar, etc? I > am having flow chart issues that are not flowing... and I am now back to > the world of research (patents) when I would rather be writing and > compiling software. > you just ignore it, as everybody should, then move to a country where software patents simply do not exists (like Europe, for example: yes there is a thing called the European Patent Office, no those pieces of paper that they issue have no value whatsoever) -- .*. finelli /V\ (/ \) -------------------------------------------------------------- ( ) Linux: Friends dont let friends use Piccolosoffice ^^-^^ -------------------------------------------------------------- From kc0hwa at gmail.com Wed Sep 23 09:19:31 2009 From: kc0hwa at gmail.com (J Lee Hughes) Date: Wed, 23 Sep 2009 11:19:31 -0500 Subject: [Beowulf] Re: Beowulf Digest, Vol 67, Issue 31 In-Reply-To: <200909231329.n8NDTM3W013688@bluewest.scyld.com> References: <200909231329.n8NDTM3W013688@bluewest.scyld.com> Message-ID: 1. with a cluster can the controller powered up node's' when it is need and can the controller powered down node's' when it is not need 2. that is the different between a computer cluster, computer ray, and computer grid! ============================== J Lee Hughes K C 0 H W A 73 ============================= Do what you can every day! Learn what you can every day! Life is good! ============================= Mike Ditka - "If God had wanted man to play soccer, he wouldn't have given us arms." On Wed, Sep 23, 2009 at 8:29 AM, wrote: > Send Beowulf mailing list submissions to > beowulf at beowulf.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://www.beowulf.org/mailman/listinfo/beowulf > or, via email, send a message with subject or body 'help' to > beowulf-request at beowulf.org > > You can reach the person managing the list at > beowulf-owner at beowulf.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Beowulf digest..." > > > Today's Topics: > > 1. How do I work around this patent? (Jeremy Baker) > 2. Re: How do I work around this patent? (Greg Lindahl) > 3. RE: How do I work around this patent? (Lux, Jim (337C)) > 4. Re: How do I work around this patent? (Nifty Tom Mitchell) > 5. RE: How do I work around this patent? (Lux, Jim (337C)) > 6. Re: Re: switching capacity terminology confusion (Rahul Nabar) > 7. Re: Re: switching capacity terminology confusion (Rahul Nabar) > 8. Microsoft acquires the technology assets of Interactive > Supercomputing (ISC) (Eugen Leitl) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 22 Sep 2009 12:01:53 -0700 > From: Jeremy Baker > Subject: [Beowulf] How do I work around this patent? > To: beowulf at beowulf.org > Message-ID: > > Content-Type: text/plain; charset="iso-8859-1" > > I joined this community many years ago to learn about GRID computing when I > was studying biology and the Linux file system, with future goals to write > interesting open source programs. It's the future and I just hit a wall in > the design process of writing code for my study. This problem is related > to > the boring world of business and IP patents. I like patents, but lately I > wonder... > > RE: "...Worlds.com filed a lawsuit Dec. 24 against NCSoft in the U.S. > District Court for the Eastern District of Texas, Tyler Division, for > violating patent 7181690. The patent is described as a method for enabling > users to interact in a virtual space through avatars." > > Online read on patents: > http://www.google.com/patents?id=wv5-AAAAEBAJ&dq=7,181,690 > > http://www.google.com/patents/about?id=BYoGAAAAEBAJ&dq=6,219,045 > > Can someone help me to better understand how these patents interact with > the > open source bazaar method of programing, Linux, the law, GIS systems with > meta data that is essentially 3-D access for a user's avatar, etc? I am > having flow chart issues that are not flowing... and I am now back to the > world of research (patents) when I would rather be writing and compiling > software. > > -- > Jeremy Baker > PO 297 > Johnson, VT > 05656 > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > http://www.scyld.com/pipermail/beowulf/attachments/20090922/645d9a8b/attachment-0001.html > > ------------------------------ > > Message: 2 > Date: Tue, 22 Sep 2009 13:10:57 -0700 > From: Greg Lindahl > Subject: Re: [Beowulf] How do I work around this patent? > To: beowulf at beowulf.org > Message-ID: <20090922201057.GA27344 at bx9.net> > Content-Type: text/plain; charset=us-ascii > > > This problem is related to > > the boring world of business and IP patents. I like patents, but lately I > > wonder... > > This is fairly off-topic for this list, but: > > It's basically impossible to write any significant program these days > without infringing on dozens or hundreds of patents. The standard > legal advice to software startups is to not read any patents, in order > to avoid willful infringement. *If* you get sued, then it's worth > looking at the patent in question to see if you can work around it. > You can see this process in action in Linux with the argument over > workarounds for the "long names in FAT filesystems" patent. > > The area you're apparenetly interested in, virtual worlds, likely has > a zillion patents with a lot of overlap. The situation is the same for > things like distributed filesystems, compilers, and perhaps MPI. > > -- greg > > > > > ------------------------------ > > Message: 3 > Date: Tue, 22 Sep 2009 13:55:11 -0700 > From: "Lux, Jim (337C)" > Subject: RE: [Beowulf] How do I work around this patent? > To: Jeremy Baker , "beowulf at beowulf.org" > > Message-ID: > > > > Content-Type: text/plain; charset="iso-8859-1" > > From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On > Behalf Of Jeremy Baker > Sent: Tuesday, September 22, 2009 12:02 PM > To: beowulf at beowulf.org > Subject: [Beowulf] How do I work around this patent? > > I joined this community many years ago to learn about GRID computing when I > was studying biology and the Linux file system, with future goals to write > interesting open source programs. It's the future and I just hit a wall in > the design process of writing code for my study. This problem is related to > the boring world of business and IP patents. I like patents, but lately I > wonder... > > RE: "...Worlds.com filed a lawsuit Dec. 24 against NCSoft in the U.S. > District Court for the Eastern District of Texas, Tyler Division, for > violating patent 7181690. The patent is described as a method for enabling > users to interact in a virtual space through avatars." > > Online read on patents: > http://www.google.com/patents?id=wv5-AAAAEBAJ&dq=7,181,690 > > http://www.google.com/patents/about?id=BYoGAAAAEBAJ&dq=6,219,045 > > Can someone help me to better understand how these patents interact with > the open source bazaar method of programing, Linux, the law, GIS systems > with meta data that is essentially 3-D access for a user's avatar, etc? I am > having flow chart issues that are not flowing... and I am now back to the > world of research (patents) when I would rather be writing and compiling > software. > > > > --- > > First off, you should know that only the claims determine what the patent > covers. The rest of the patent is just useful information. You need to > decide if your application "reads on" the claims. Hiring a patent attorney > used to doing this kind of analysis is useful.. a few hundred bucks well > spent. Note there's a hierarchy of claims here.. Claim 1 is a big claim, > and then, 2,3,4,and 5 hang on Claim 1. > > Glancing through the claims, it looks like they are patenting a scheme very > much like described in Michael Crichton's "Disclosure", especially with > avatars for other users. Or any of a number of other multi user schemes. > > Maybe your implementation doesn't read on the claims. I note that most of > the claims specifically reference "less than all the other users'" etc. If > your implementation has your local client receiving ALL the other user info, > then this patent doesn't apply. (Claims 1,6, 9, 10, 11, 15, and 18 all have > the "less than all" wording, the rest are subordinate claims) > > > In any case, if what you are doing happens to match the claims, you can > always try to break the patent (i.e. find a prior disclosure of what's being > patented.. a description in a novel might actually be good enough.. consult > your patent attorney). Or, you can patent something yourself, and then > offer to cross license with the holder here. Maybe worlds.com would be > willing? > > Whether you are doing open source or not doesn't really have any influence > on whether something is infringing. If you're doing something described by > the claims, you're infringing. > > Note that this patent was originally applied for in the mid 90s.. going to > be dicey on the prior art. > > > > > > ------------------------------ > > Message: 4 > Date: Tue, 22 Sep 2009 16:07:41 -0700 > From: Nifty Tom Mitchell > Subject: Re: [Beowulf] How do I work around this patent? > To: "Lux, Jim (337C)" > Cc: "beowulf at beowulf.org" > Message-ID: <20090922230741.GA9189 at tosh2egg.ca.sanfran.comcast.net> > Content-Type: text/plain; charset=us-ascii > > On Tue, Sep 22, 2009 at 01:55:11PM -0700, Lux, Jim (337C) wrote: > > Subject: [Beowulf] How do I work around this patent? > > > > I joined this community many years ago to learn about GRID computing... > > > > > "...Worlds.com filed a lawsuit .... for violating patent 7181690. > > This sounds like this is a patent for implementing the dictionary > definition of an avatar. The dictionary definition may provide prior art > ;-) > and narrow their applicability. > > Reading through it the implementation includes bits I know or suspect to > be in well known programs like Microsoft Flight simulator, the SGI "dog" > multiuser flight simulator an SGI paper airplane demo that prunes the 3D > space to render and interact with all combined with bits of centralized > "Go" and "Chess" game servers that have been out there almost as long > as the internet. And Big Bertha networked progressive slot machines too.. > > Greg and Jim's comments are spot on. > Greg has his name on some clever patents, I do not know abut Jim. > > One of the critical points in a patent is that it not be obvious. > So the point that you should not look at patents is spot on. If > you reinvent the idea with trivial effort - one point to you. > > So.. Unplug your development stations from the Internet and go back > to work in isolation on a private internet. Document your design and > go see a patent attorney with your design. Update the design and send > him/her updates on a regular basis. In some cases he does not need to > read them, just date and file them. In your design document comment on > all the moving parts, trivial, clever, novel, critical to the product etc. > A good one may also see value in things you might dismiss. > Keep the inventor list up to date too. > > One IMPORTANT point is the moment (date time stamp) that your code is > seen live outside of the lab. Alpha and Beta testers can start the clock > for you on some critical bits. Same for investor disclosure without NDA > etc... > demos for the kids etc. > > Good legal advice can help on all these bits. > > > > -- > T o m M i t c h e l l > Found me a new hat, now what? > > > > ------------------------------ > > Message: 5 > Date: Tue, 22 Sep 2009 16:12:44 -0700 > From: "Lux, Jim (337C)" > Subject: RE: [Beowulf] How do I work around this patent? > To: Nifty Tom Mitchell > Cc: "beowulf at beowulf.org" > Message-ID: > > > > Content-Type: text/plain; charset="us-ascii" > > > Greg and Jim's comments are spot on. > > Greg has his name on some clever patents, I do not know abut Jim. > > > > I don't know that it's necessarily clever, but US 5,971,765 is mine... > It's certainly unique... (and has been litigated, too..) > > > > ------------------------------ > > Message: 6 > Date: Tue, 22 Sep 2009 23:51:40 -0400 > From: Rahul Nabar > Subject: Re: [Beowulf] Re: switching capacity terminology confusion > To: Nifty Tom Mitchell > Cc: beowulf at beowulf.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > On Sat, Sep 19, 2009 at 7:58 PM, Nifty Tom Mitchell > wrote: > > > > > The cluster itself needs very little in the way of special services and > can be > > setup and managed as a homogeneous soft gooey center with a hard crusty > > outside. A "simple" but fast switch with enough ports seems sufficient. > > NFS traffic (fast cascading funnel tree) can be different than say MPI > traffic > > with all hosts communicating at the same time with all the neighbors (one > big cross bar). > > Your cluster design may well shape your switch benchmark testing. > > Yup, I guess we need to wait for a "simple computing" switch model for > HPC just like siilar offerings on the compute server side recently. > I've had not much success getting any significant discounts or > evaluation switches off my local Cisco Vendors. Now if only there are > any Cisco powers-that-be on this mailing list............ :-) > > -- > Rahul > > > > ------------------------------ > > Message: 7 > Date: Tue, 22 Sep 2009 23:53:55 -0400 > From: Rahul Nabar > Subject: Re: [Beowulf] Re: switching capacity terminology confusion > To: H?kon Bugge > Cc: beowulf at beowulf.org > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > On Fri, Sep 18, 2009 at 9:31 AM, H?kon Bugge wrote: > > > You might check out the reports from the Tolly Group (www.tolly.com), > they > > used to evaluate different eth switches. Not sure how un-biased they are > > though. > > Thanks Hakon! The Tolly group is definately a good lead. i found some > Dell reviews on there. But Cisco switches seem non existant. The three > reviews that I could find were from back in the 90's! > > -- > Rahul > > > > ------------------------------ > > Message: 8 > Date: Wed, 23 Sep 2009 15:26:05 +0200 > From: Eugen Leitl > Subject: [Beowulf] Microsoft acquires the technology assets of > Interactive Supercomputing (ISC) > To: Beowulf at beowulf.org > Message-ID: <20090923132605.GZ27331 at leitl.org> > Content-Type: text/plain; charset=utf-8 > > > FYI > > > http://blogs.technet.com/windowsserver/archive/2009/09/21/microsoft-has-acquired-the-technology-assets-of-interactive-supercomputing-isc.aspx > > Microsoft acquires the technology assets of Interactive Supercomputing > (ISC) > > Hello everyone, > > Today, I???m very excited to announce that Microsoft has acquired the > technology assets of Interactive Supercomputing (ISC), a company that > specializes in bringing the power of parallel computing to the desktop and > making high performance computing more accessible to end users. This move > represents our ongoing commitment to parallel computing and high > performance > computing (HPC) and will bring together complementary technologies that > will > help simplify the complexity and difficulty of expressing problems that can > be parallelized. ISC???s products and technology enable faster > prototyping, > iteration, and deployment of large-scale parallel solutions, which is well > aligned with our vision of making high performance computing and parallel > computing easier, both on the desktop and in the cluster. > > Bill Blake, CEO of ISC, is bringing over a team of industry leading experts > on parallel and high performance computing that will join the Microsoft > team > at the New England Research & Development Center in Cambridge, MA. He and > I > are both excited to start working together on the next generation of > technology for researchers, analysts, and engineers, as well as those who > have yet to be exposed to the benefits of parallel computing and HPC > technologies or may have thought they were out of reach. > > We have recently begun plans to integrate ISC technologies into future > versions of Microsoft products and will provide more information over the > coming months on where and how that integration will occur. Beginning > immediately, Microsoft will provide support for ISC???s current Star-P > customers and we are committed to continually listening to customer needs > as > we develop the next generation of HPC and parallel computing technologies. > I???m looking forward to the opportunities our two combined groups have to > greatly improve the capability, performance, and accessibility of parallel > computing and HPC technologies. > > You can find more information on HPC and parallel computing at Microsoft in > these links and stay up to date on integration news and updates at > Microsoft > Pathways, our acquisition information site. > > Kyril Faenov > > General Manager, High Performance & Parallel Computing Technologies > > Filed under: HPC, High Performance Computing, windows hpc server 2008, > Parallel Computing > > > ------------------------------ > > _______________________________________________ > Beowulf mailing list > Beowulf at beowulf.org > http://www.beowulf.org/mailman/listinfo/beowulf > > > End of Beowulf Digest, Vol 67, Issue 31 > *************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken at kschuster.org Wed Sep 23 09:52:42 2009 From: ken at kschuster.org (ken at kschuster.org) Date: Wed, 23 Sep 2009 09:52:42 -0700 (PDT) Subject: [Beowulf] Re: How do I work around this patent? Message-ID: <659673.73527.qm@web56405.mail.re3.yahoo.com> "learn about GRID computing when I was studying biology and the Linux file system, with future goals to write interesting open source programs. It's the future and I just hit a wall in the design process of writing code for my study." ? Jeremy, For me the first question to be addressed is motive.? Are you planning to?distribute your work in such a way as to profit from it in a monetary sense?? If you are?not then the issue of patent is basically moot.? The primary purpose of a patent is to protect the creator from intellectual and financial harm.? If you are going to distribute the material give create to those that you know contributed, which you should do whether it is patented or not.? If you are not going to have a financial benefit then you are not financially harming the other party.? The caveat here is if the patent holder has produced the material in such a manner that they are being financially rewarded and you start to distribute something similar to it for free then you are hurting them financially.? This is a very rough overview and I am sure any bad lawyer would add 50 pages to clarify what I have said. ? I do agree with Greg's proceed in ignorance concept.? If you are not trying to infringe and do not intentional infringe you will mental proceed better and have a lesser case if one were to be brought.? I may know that the speed limit is 35 but I will still try to go 45 (;>) ? FYI - I am not a practicing attorney and this was not my primary field of legal studies.? It is JMHO. Ken// -------------- next part -------------- An HTML attachment was scrubbed... URL: From hearnsj at googlemail.com Wed Sep 23 14:11:13 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed, 23 Sep 2009 22:11:13 +0100 Subject: [Beowulf] Re: Beowulf Digest, Vol 67, Issue 31 In-Reply-To: References: <200909231329.n8NDTM3W013688@bluewest.scyld.com> Message-ID: <9f8092cc0909231411l444abdd8ye4d4a8dd807e4c8a@mail.gmail.com> 2009/9/23 J Lee Hughes : > 1. > with a cluster > can the controller powered up node's' when it is need > and > can the controller? powered down node's' when it is not need That depends. The answer is yes if a) the nodes (servers) in the cluster are equipped with IPMI management cards or b) the nodes (servers) are powered from intelligent mains power distribution units The question I think you are asking is that if the cluster load is low then will nodes (servers) be powered down. The answer is yes, some schedulers (load levellers) offer this facility. > 2. > that is the different between a computer cluster, computer ray, and computer > grid! A compute cluster would be a set of nodes (servers) in one physical location, linked by a LAN or a cluster interconnect. It is presented to the user as one set of resources. A grid is a geographical diverse set of resources - a users task may be moved to the most appropriate part of the grid to be executed. A ray - search me. Never heard the term. From jellogum at gmail.com Wed Sep 23 15:09:57 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Wed, 23 Sep 2009 15:09:57 -0700 Subject: [Beowulf] Re: How do I work around this patent? In-Reply-To: <659673.73527.qm@web56405.mail.re3.yahoo.com> References: <659673.73527.qm@web56405.mail.re3.yahoo.com> Message-ID: Yes, open source and free to access, but other elements to the study related to the project may result in fees for service that are both lawful and expensive to manage. [Still training the horse to pull its first cart...] Hypothetical use of free program could create a micro economy dependent upon unknown user activity, so it's fair to say that a lawsuit will appear if people use it. The main study (program) is based upon science ideas to explore aspects not associated with the game. For example, the MMORPG is a desired device to "harvest" gamers to advance BOINC packets, for social problem solving, to present educational material within the engine to help mature these young and growing gamers who will be adults someday, etc. all dressed in a "game." I study programming for the love of it, so I model problems for reasons other than profit. I wonder if/how this patent will impact robots (user) in a work environment (server managing coordinate system), systems modeling real world phenomena in virtual worlds (weather, DoD simulations, GIS innovations, car apps for GPS... in traffic, real estate virtual tours, etc.)? MMORPG's will likely be just the beginning... and the case may swing in my favor if major industry gets hit for fees to conduct routine day to day business using robots in a coordinate system, or when airports get nailed for a scalable virtual 3-D environment to handle multiusers (planes) employing a server. Who knows? I anticipate that the patents will result in Worlds.com discovering that virtual worlds and server side management is an obvious function dependent upon hardware configurations, that their roylaties will appear only when someone uses specific proprietary hardware and its related software to facilite a desired mission statement employing said hardware; in other words, the hardware of the server will dictate their claim's success, that the court will see through the attempt to patent a concept. I may be wrong, and perhaps someone will patent frequencies used over phones that correlate to selling a commodity vs just saying hello... (same multiple users use same computers and servers to post blogs or use Skype, but when using same stuff to play 3-D environment and/or chat it is patentable?). Observing the SCO vs IBM lawsuit, from its very beginning, and researching to understand its progression, I know this issue will go on for a few years... that the money involved will manipulte the court to its advantage... Apple IIe, 1987, I got in trouble at home, in high school, for not doing my homework because I was trying to build a 3-D engine to make a game I had start in '84 based upon D&D. Virtual multiuser environment was/is obvious to me... I trust prior art is out there because my code has long been lost. Maybe a simple work around might be a server for 11-D world ;) Thank you folks, for the input. I am settled to pause work and if I continue, and get to the bazaar level, I will release any finished work in a country where it is lawful. I'm not interested in breaking the law. On Wed, Sep 23, 2009 at 9:52 AM, wrote: > "learn about GRID computing when I was studying biology and the Linux file > system, with future goals to write interesting open source programs. It's > the future and I just hit a wall in > the design process of writing code for my study." > > Jeremy, > For me the first question to be addressed is motive. Are you planning > to distribute your work in such a way as to profit from it in a monetary > sense? If you are not then the issue of patent is basically moot. The > primary purpose of a patent is to protect the creator from intellectual and > financial harm. If you are going to distribute the material give create to > those that you know contributed, which you should do whether it is patented > or not. If you are not going to have a financial benefit then you are not > financially harming the other party. The caveat here is if the patent > holder has produced the material in such a manner that they are being > financially rewarded and you start to distribute something similar to it for > free then you are hurting them financially. This is a very rough overview > and I am sure any bad lawyer would add 50 pages to clarify what I have said. > > I do agree with Greg's proceed in ignorance concept. If you are not trying > to infringe and do not intentional infringe you will mental proceed better > and have a lesser case if one were to be brought. I may know that the speed > limit is 35 but I will still try to go 45 (;>) > > FYI - I am not a practicing attorney and this was not my primary field of > legal studies. It is JMHO. Ken// > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- Jeremy Baker SBN 634 337 College Hill Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From niftyompi at niftyegg.com Wed Sep 23 17:19:53 2009 From: niftyompi at niftyegg.com (NiftyOMPI Tom Mitchell) Date: Wed, 23 Sep 2009 17:19:53 -0700 Subject: [Beowulf] How do I work around this patent? In-Reply-To: References: <20090922230741.GA9189@tosh2egg.ca.sanfran.comcast.net> Message-ID: <88815dc10909231719h11261a4ai39d90621f656700e@mail.gmail.com> Well we already knew that Jim was a "whirl wind" outstanding in his field. On Tue, Sep 22, 2009 at 4:12 PM, Lux, Jim (337C) wrote: >> Greg and Jim's comments are spot on. >> Greg has his name on some clever patents, I do not know abut Jim. >> > > I don't know that it's necessarily clever, but US 5,971,765 is mine... > It's certainly unique... (and has been litigated, too..) > -- NiftyOMPI T o m M i t c h e l l From jellogum at gmail.com Wed Sep 23 20:05:21 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Wed, 23 Sep 2009 23:05:21 -0400 Subject: [Beowulf] How do I work around this patent? In-Reply-To: <88815dc10909231719h11261a4ai39d90621f656700e@mail.gmail.com> References: <20090922230741.GA9189@tosh2egg.ca.sanfran.comcast.net> <88815dc10909231719h11261a4ai39d90621f656700e@mail.gmail.com> Message-ID: Virtual world = coordinate system indexing objects; avatar = human or computer/machine that may seek selective real-time coordinate data about objects; server = any computer process or device; object = memory handling a script process or 3-D collection of points or both; such is how I read the patent. If this is a correct interpretation of the patent, it sounds like the Linux file system to me, facilitating multiple users/processes for a single box or a cluster or a GRID. Please correct me if I am wrong. I thought it was the expert farmer who was outstanding in his field? On Wed, Sep 23, 2009 at 8:19 PM, NiftyOMPI Tom Mitchell < niftyompi at niftyegg.com> wrote: > Well we already knew that Jim was a "whirl wind" outstanding in his field. > > On Tue, Sep 22, 2009 at 4:12 PM, Lux, Jim (337C) > wrote: > >> Greg and Jim's comments are spot on. > >> Greg has his name on some clever patents, I do not know abut Jim. > >> > > > > I don't know that it's necessarily clever, but US 5,971,765 is mine... > > It's certainly unique... (and has been litigated, too..) > > > > > > -- > NiftyOMPI > T o m M i t c h e l l > -- Jeremy Baker SBN 634 337 College Hill Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From eugen at leitl.org Thu Sep 24 00:01:54 2009 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 24 Sep 2009 09:01:54 +0200 Subject: [Beowulf] Facebook: Yes, We Need 100-GigE Message-ID: <20090924070154.GG27331@leitl.org> http://www.lightreading.com/document.asp?doc_id=181899 Facebook: Yes, We Need 100-GigE September 16, 2009 | Craig Matsumoto | Comments (13) It's become clich? to say that companies like Facebook would use 100-Gbit/s Ethernet right now if they had it. But it helps when someone from Facebook actually shows up and hammers on that point. Facebook network engineer Donn Lee did that yesterday, pleading his case at a technology seminar on 40- and 100-Gbit/s Ethernet, hosted in Santa Clara, Calif., by The Ethernet Alliance . Representatives from Google (Nasdaq: GOOG) and the Amsterdam Internet Exchange B.V. (AMS-IX) gave similar pleas, but Lee's presentation included some particularly sobering numbers. He said it's reasonable to think Facebook will need its data center backbone fabric to grow to 64 Tbit/s total capacity by the end of next year. How to build such a thing? Lee said his ideal Ethernet box would have 16-Tbit/s switching capacity and 80 100-Gbit/s Ethernet ports or 800 10-Gbit/s Ethernet ports. No such box exists commercially, and Lee is reluctant to go build his own. That leaves him with an unpleasant alternative. Lee drew up a diagram of what Facebook's future data center fabric -- that is, the interconnection of its switch/routers -- would look like if he had to use today's equipment and 10-Gbit/s Ethernet. Instead of the familiar criss-crossing mesh diagram, he got a solid wall of black, signifying just how many connections he'd need. "I would say anybody in the top 25 Websites easily has this problem," he said later. (Lee didn't say anything about how long it would take just to plug in all those fibers. Maybe that job could be created by funds from the U.S. Recovery Act.) Lee also showed charts showing the disconnect between Facebook's wish list and the market. Facebook needed 512 10-Gbit/s Ethernet ports per chassis in 2007 and is likely to need 1,500 in 2010. No chassis offers more than 200 ports, he said. Even though Lee is a veteran of Google and Cisco Systems Inc. (Nasdaq: CSCO), you might wonder if he's just one renegade engineer who doesn't represent the Facebook norm. Not really. It turns out Facebook has only five network engineers -- although Lee said that's a 20 percent increase from the spring of 2008 [math note: which means they had approximately 4.165 engineers at that time]. Even though 100-Gbit/s development started four years ago, Lee thinks it came too late, and that's got him worried about the next generation. He's pulling for 400-Gbit/s Ethernet discussions to start right away. "Let's start the work that doesn't require money, now," he said. "If we have the standard, we can build the product later. I don't mind using an old standard." He might get his wish. The Optoelectronics Industry Development Association (OIDA) is already organizing meetings with an aim toward getting federal money for terabit Ethernet research, said John D'Ambrosia, a Force10 Networks Inc. scientist who helped organize yesterday's event. Of course, money is a major obstacle to the next wave of Ethernet. During an open commenting and Q&A session, multiple audience members pointed out that optical components margins are too thin to support advanced research at many companies, and that carriers are seeing their big, expensive networks getting used to make money for over-the-top services. "There's no revenue in all that bandwidth increase," one audience member commented, citing the carrier case in particular. From john.hearns at mclaren.com Thu Sep 24 01:54:26 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 24 Sep 2009 09:54:26 +0100 Subject: [Beowulf] Facebook: Yes, We Need 100-GigE In-Reply-To: <20090924070154.GG27331@leitl.org> References: <20090924070154.GG27331@leitl.org> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D48A314@milexchmb1.mil.tagmclarengroup.com> http://www.lightreading.com/document.asp?doc_id=181899 Facebook: Yes, We Need 100-GigE September 16, 2009 | Craig Matsumoto | Comments (13) It's become clich? to say that companies like Facebook would use 100-Gbit/s Ethernet right now if they had it. But it helps when someone from Facebook actually shows up and hammers on that point. "Hold on there Bald Eagle" Isn't this a case of every problem being a nail if you only have a hammer? What is all this bandwidth being USED for? Surely on the Beowulf list we should be looking at this from the angle of parallel processing - if your huge central pipe is maxed out, then maybe distributing the problem among many smaller, cheaper pipes (and processing units) is the way to go. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From john.hearns at mclaren.com Thu Sep 24 03:40:09 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 24 Sep 2009 11:40:09 +0100 Subject: [Beowulf] Nhalem EX Message-ID: <68A57CCFD4005646957BD2D18E60667B0D48A498@milexchmb1.mil.tagmclarengroup.com> Found some details on the Nehalem EX: http://www.semiaccurate.com/2009/08/25/intel-details-becton-8-cores-and- all/ Internal ring buses? How long till you lot are benchmarking them and claiming your code is taking too long because the data is moving round the bus in the wrong direction :-) I thought understanding L1, 2 and 3 caches was hard enough, without having to think about rings. Ah well. Toroids on chip next? The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From kilian.cavalotti.work at gmail.com Thu Sep 24 04:56:23 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Thu, 24 Sep 2009 13:56:23 +0200 Subject: [Beowulf] Nhalem EX In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D48A498@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D48A498@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Thu, Sep 24, 2009 at 12:40 PM, Hearns, John wrote: > Internal ring buses? How long till you lot are benchmarking them and > claiming your > code is taking too long because the data is moving round the bus in the > wrong direction :-) Well, if you add cache coloring (http://en.wikipedia.org/wiki/Cache_coloring) to the mix, you can pretty much have the whole DC metro running in you cores. :) > I thought understanding L1, 2 and 3 caches was hard enough, without > having to think about rings. Since creating a monolithic 24MB L3 cache would have make it slow as a slug, they basically added a second level of L2 cache, local to each core, and connected them together with a bidirectionnal bus, so that "if any core needs a byte from any other cache, it is no more than 4 ring hops to the right cache slice." It looks a bit like HT Assist (http://www.bit-tech.net/news/hardware/2009/06/01/amd-launches-6-core-istanbul-opteron-proces/1)? Except it's in-chip rather than inter-CPUs. And it's supposed to behave like a large shared L3. > Ah well. Toroids on chip next? Further down, in the posted article: "The transistor count of 2.3 billion backs that up. To make it all work, the center of the chip has a block called the router. It is a crossbar switch that connects all internal and external channels, up to eight at a time." The chip itself is becoming a NUMA-like system, with its own internal network, a crossbar switch and its own internal topology. At some time, if the number of cores continues to grow, it wouldn't be that surprising to see some locality emerge, in the form of local clusters of cores, tightly coupled on a bus ring, and interconnected to other cluster of cores through QPI links (intra- or inter-chips). Network architectures as we see them today at the Infiniband interconnect level could very well make their way into the chips. So yes, toroids, why not? :) "With 4 QPI links, 8 memory channels, 8 cores, 8 cache slices, 2 memory controllers, 2 cache agents, 2 home agents and a pony, this chip is getting quite complex." You bet! Cheers, -- Kilian From john.hearns at mclaren.com Thu Sep 24 05:13:21 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 24 Sep 2009 13:13:21 +0100 Subject: [Beowulf] Nhalem EX In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D48A498@milexchmb1.mil.tagmclarengroup.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D48A5CC@milexchmb1.mil.tagmclarengroup.com> The chip itself is becoming a NUMA-like system, with its own internal network, a crossbar switch and its own internal topology. At some time, if the number of cores continues to grow, it wouldn't be that surprising to see some locality emerge, in the form of local clusters of cores, tightly coupled on a bus ring, Kilian, I agree. I was just being a bit jocular about the complexity of this thing. Yes, it?s a little NUMA system on a chip - which is quite exciting really. Take a four-socket one of these, and a bucketload of RAM, and you have a pretty nice personal system. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From eugen at leitl.org Thu Sep 24 05:50:21 2009 From: eugen at leitl.org (Eugen Leitl) Date: Thu, 24 Sep 2009 14:50:21 +0200 Subject: [Beowulf] Intel Light Peak: 10-100 GBit/s optical consumer LAN Message-ID: <20090924125021.GV27331@leitl.org> (But will it be a packet-switchable protocol over that physical link, or yet another USB brain damage?) http://news.cnet.com/8301-30685_3-10360047-264.html September 23, 2009 12:54 PM PDT Intel's Light Peak: One PC cable to rule them all by Stephen Shankland The Light Peak technology sends signals with infrared light over optical fibers. (Credit: Intel) SAN FRANCISCO--Intel unveiled technology called Light Peak that it hopes ultimately will replace the profusion of different cables sprouting from today's PCs with a single type of fiber-optic link. Dadi Perlmutter, the newly promoted co-general manager of Intel's Architecture Group, demonstrated Light Peak at the Intel Developer Forum here and said components for the technology, though not Light Peak-enabled PCs, will be ready in 2010. "We hope to see one single cable," Perlmutter said, adding that one thing getting in the way of smaller laptops is the profusion of cable ports around the systems' edges. This prototype PC has the Light Peak controller and optical connector that sends signals down a single white optical cable. This prototype PC has the Light Peak controller and optical connector that sends signals down a single white optical cable. (Credit: Stephen Shankland/CNET) In a demonstration, Perlmutter showed a PC connected to a monitor across the stage showing high-definition video sent over a Light Peak optical cable. The cable can be as long as 100 meters and can carry data at 10 gigabits per second in both directions simultaneously, though Intel expects it will reach 100 gigabits per second in the next decade, said Jason Ziller, Intel's director of optical input-ouput program office, in an interview. The company envisions Light Peak as a replacement for the cables that currently lead to monitors, external drives, scanners, and just about anything else that plugs in to a computer. A PC could have a number of Light Peak ports for different devices, or a connection could lead to a hub--perhaps an external monitor--with multiple connections of its own, Ziller said. It's not clear how much the technology will cost or how many years it will take to become mainstream. And wireless communication technology--Intel itself has promoted Ultra-Wideband (UWB) for years--offers the attraction of getting rid of some cables altogether. The Light Peak technology handles multiple communication protocols at the same time, with quality-of-service provisions to ensure high-priority traffic such as video get preferred treatment, he said. Intel's Dadi Perlmutter traces the Light Peak cable from a PC to a monitor on the other side of the stage. Light Peak can traverse distances up to 100 meters. Intel's Dadi Perlmutter traces the Light Peak cable from a PC to a monitor on the other side of the stage. Light Peak can traverse distances up to 100 meters. (Credit: Stephen Shankland/CNET) In addition, Intel said it's working on bundling the optical fiber with copper wire so Light Peak can be used to power devices plugged into the PC, he said. The cables themselves are durable, Ziller said: "You can tie a knot in it and it'll still work." Intel has a lot of clout in the computing marketplace, but building support for a radical new connection that could replace DVI, DisplayPort, USB, Firewire, HDMI, and any number of other connections would require broad industry support. Intel's taking the usual approach to tackling that problem: "We're working with the industry to standardize it," Ziller said. Intel has been briefing other companies for "the last few months," and now is trying to get the standards process started in earnest with partners including companies in the computing, consumer electronics, and telephone handset markets, he said. Ziller wouldn't say who else is participating in the effort, but Intel published a statement of support from Sony, which has a lot of clout of its own in many markets. "Sony is excited about the potential for Light Peak technology that Intel has been developing, and believe it could enable a new generation of high-speed device connectivity," said Ryosuke Akahane, vice president of Sony's Vaio Business Group. So will Light Peak become a universal port? "Intel's long-term vision is you could get to that," Ziller said. From coutinho at dcc.ufmg.br Thu Sep 24 11:05:26 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Thu, 24 Sep 2009 15:05:26 -0300 Subject: [Beowulf] Nhalem EX In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D48A5CC@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D48A498@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0D48A5CC@milexchmb1.mil.tagmclarengroup.com> Message-ID: 2009/9/24 Hearns, John > > > The chip itself is becoming a NUMA-like system, with its own internal > network, a crossbar switch and its own internal topology. At some > time, if the number of cores continues to grow, it wouldn't be that > surprising to see some locality emerge, in the form of local clusters > of cores, tightly coupled on a bus ring, > > > Kilian, I agree. I was just being a bit jocular about the complexity of > this thing. > Yes, it?s a little NUMA system on a chip - which is quite exciting really. > Take a four-socket one of these, and a bucketload of RAM, and you have a > pretty nice > personal system. > > As the number of cores increase, NUMA multicores will inevitably appear, but I never though that it would happen that soon. > > The contents of this email are confidential and for the exclusive use of > the intended recipient. If you receive this email in error you should not > copy it, retransmit it, use it or disclose its contents but should return it > to the sender immediately and delete your copy. > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Thu Sep 24 14:54:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 24 Sep 2009 16:54:39 -0500 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <4AA926B8.6060702@scalableinformatics.com> References: <200909091900.n89J07U6031683@bluewest.scyld.com> <4AA926B8.6060702@scalableinformatics.com> Message-ID: On Thu, Sep 10, 2009 at 11:18 AM, Joe Landman wrote: > > root at dv4:~# mpirun -np 4 ./io-bm.exe -n 32 -f /data2/test/file -r -d ?-v In order to see how good (or bad) my current operating point is I was trying to replicate your test. But what is "io-bm.exe"? Is that some proprietary code or could I have it to run a similar test? -- Rahul From landman at scalableinformatics.com Thu Sep 24 14:55:11 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 24 Sep 2009 17:55:11 -0400 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: <200909091900.n89J07U6031683@bluewest.scyld.com> <4AA926B8.6060702@scalableinformatics.com> Message-ID: <4ABBEABF.2080103@scalableinformatics.com> Rahul Nabar wrote: > On Thu, Sep 10, 2009 at 11:18 AM, Joe Landman > wrote: > >> root at dv4:~# mpirun -np 4 ./io-bm.exe -n 32 -f /data2/test/file -r -d -v > > In order to see how good (or bad) my current operating point is I was > trying to replicate your test. But what is "io-bm.exe"? Is that some > proprietary code or could I have it to run a similar test? > its a tool we wrote for testing. We are releasing it soon. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Thu Sep 24 18:16:20 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 24 Sep 2009 20:16:20 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? Message-ID: I now ran bonnie++ but have trouble figuring out if my perf. stats are up to the mark or not. My original plan was to only estimate the IOPS capabilities of my existing storage setup. But then again I am quite ignorant about the finer nuances. Hence I thought maybe I should post the stats. here and if anyone has comments I'd very much appreciate hearing them. In any case, maybe my stats help someone else sometime! I/O stats on live HPC systems seem hard to find. Data posted below. Since this is an NFS store I ran bonnie++ from both a NFS client compute node and the server. (head node) Server side bonnie++ http://dl.getdropbox.com/u/118481/io_benchmarks/bonnie_op.html Client side bonnie++ http://dl.getdropbox.com/u/118481/io_benchmarks/bonnie_op_node25.html Caveat: The cluster was in production so there is a chance of externalities affecting my data. (am trying it hard to explain why some stats seem better on the client run than the server run) Subsidary Goal: This setup had 23 clients for NFS. In a new cluster that I am setting up we want to scale this up about 250 clients. Hence want to estimate what sort of performance I'll be looking for in the Storage. (I've found most conversations with vendors pretty non-productive with them weaving vague terms and staying as far away from quantitative estimates as is possible.) (Other specs: Gigabit ethernet. RAID5 array of 5 total SAS 10k RPM disks. Total storage ~ 1.5 Terabyte; both server and client have 16GB RAM; Dell 6248 switches. Port bonding on client servers) -- Rahul From landman at scalableinformatics.com Thu Sep 24 19:06:33 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 24 Sep 2009 22:06:33 -0400 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: References: Message-ID: <4ABC25A9.6060907@scalableinformatics.com> Rahul Nabar wrote: > I now ran bonnie++ but have trouble figuring out if my perf. stats are > up to the mark or not. My original plan was to only estimate the IOPS > capabilities of my existing storage setup. But then again I am quite Best way to get IOPs data in a "standard" manner is to run the type of test that generates 8k random reads. I'd suggest not using bonnie++. It is, honestly, not that good for HPC IO performance measurement. I have lots of caveats on it, having used it for a while as a test, while looking ever more deeply at it. I've found fio (http://freshmeat.net/projects/fio/) to be an excellent testing tool for disk systems. To use it, compile it (requires libaio), and then run it as fio input.fio For a nice simple IOP test, try this: [random] rw=randread size=4g directory=/data iodepth=32 blocksize=8k numjobs=16 nrfiles=1 group_reporting ioengine=sync loops=1 This file will do 4GB of IO into a directory named /data, using an IO depth of 32, a block size of 8k (the IOP measurement standard) with random reads as the major operation, using standard unix IO. We have 16 simultaneous jobs doing IO, each job using 1 file. It will aggregate all the information from each job and report it, and it will run once. We use this to model bonnie++ and other types of workloads. It provides a great deal of useful information. > ignorant about the finer nuances. Hence I thought maybe I should post > the stats. here and if anyone has comments I'd very much appreciate > hearing them. In any case, maybe my stats help someone else sometime! > I/O stats on live HPC systems seem hard to find. It looks like channel bonding isn't helping you much. Is your server channel bonded? Clients? Both? > > Data posted below. Since this is an NFS store I ran bonnie++ from both > a NFS client compute node and the server. (head node) > > Server side bonnie++ > http://dl.getdropbox.com/u/118481/io_benchmarks/bonnie_op.html > > Client side bonnie++ > http://dl.getdropbox.com/u/118481/io_benchmarks/bonnie_op_node25.html > > > Caveat: The cluster was in production so there is a chance of > externalities affecting my data. (am trying it hard to explain why > some stats seem better on the client run than the server run) > > Subsidary Goal: This setup had 23 clients for NFS. In a new cluster > that I am setting up we want to scale this up about 250 clients. Hence > want to estimate what sort of performance I'll be looking for in the > Storage. (I've found most conversations with vendors pretty > non-productive with them weaving vague terms and staying as far away > from quantitative estimates as is possible.) Heh ... depends on the vendor. We are pretty open and free with our numbers (to our current/prospective customers), and our test cases. Shortly we are releasing the io-bm code for people to test single and parallel IO, and publishing our results as we obtain them. > (Other specs: Gigabit ethernet. RAID5 array of 5 total SAS 10k RPM > disks. Total storage ~ 1.5 Terabyte; both server and client have 16GB > RAM; Dell 6248 switches. Port bonding on client servers) What RAID adapter and drives? I am assuming some sort of Dell unit. What is the connection from the server to the network ... single gigabit (ala Rocks clusters), or 10 GbE, or channel bonded gigabit? -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Thu Sep 24 19:34:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 24 Sep 2009 21:34:14 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: <4ABC25A9.6060907@scalableinformatics.com> References: <4ABC25A9.6060907@scalableinformatics.com> Message-ID: On Thu, Sep 24, 2009 at 9:06 PM, Joe Landman wrote: > Best way to get IOPs data in a "standard" manner is to run the type of test > that generates 8k random reads. THanks again Joe! I'll run that one. Got to figure out the exact command line. Bonnie is complicated. > I've found fio (http://freshmeat.net/projects/fio/) to be an excellent > testing tool for disk systems. Ok, I have fio. Actually downloaded that after reading some of your comments on your blog. :) Unfortunately the things a beast. Couldn't figure out how to use it. And I really didn't want to do a PhD on disk I/O. Thanks much for your recipie. I am going to try that now. > It looks like channel bonding isn't helping you much. ?Is your server No? From which numbers? > channel bonded? ?Clients? ?Both? Both. >> > Heh ... depends on the vendor. ?We are pretty open and free with our numbers > (to our current/prospective customers), and our test cases. True. I shouldn't generalize. But most vendors still. I wish they'd scrap all their "whitepapers". Vendor whitepapers (to me) seem the kind of document that strives to attain the minimum information density. Probably a good reading for the clueless non-technical guys sitting in top-management. Give me quantitative benchmarks or spec sheets any day! (Ok, ok, I am probably venting here; but talking knowledgably with vendors has been hard!) > > What RAID adapter and drives? ?I am assuming some sort of Dell unit. Correct. A Dell Power Connect with an internal RAID card and drives. >What is > the connection from the server to the network Three channel bonded eth connections. >... single gigabit (ala Rocks > clusters), or 10 GbE, or channel bonded gigabit? Nope. No 10 Gig E on this cluster. -- Rahul From rpnabar at gmail.com Thu Sep 24 19:35:12 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 24 Sep 2009 21:35:12 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: References: <4ABC25A9.6060907@scalableinformatics.com> Message-ID: On Thu, Sep 24, 2009 at 9:34 PM, Rahul Nabar wrote: > Correct. A Dell Power Connect with an internal RAID card and drives. > Sorry. PowerEdge, I meant to type, -- Rahul From laytonjb at att.net Fri Sep 25 06:52:19 2009 From: laytonjb at att.net (Jeff Layton) Date: Fri, 25 Sep 2009 09:52:19 -0400 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: References: <4ABC25A9.6060907@scalableinformatics.com> Message-ID: <4ABCCB13.3070807@att.net> OK, I've had enough of Rahul's posting to this list so I thought I would publically respond to his comments since they directly affect me, my integrity, and the company I work for during the day. >> Heh ... depends on the vendor. We are pretty open and free with our numbers >> (to our current/prospective customers), and our test cases. >> > > True. I shouldn't generalize. But most vendors still. I wish they'd > scrap all their "whitepapers". Vendor whitepapers (to me) seem the > kind of document that strives to attain the minimum information > density. Probably a good reading for the clueless non-technical guys > sitting in top-management. Give me quantitative benchmarks or spec > sheets any day! (Ok, ok, I am probably venting here; but talking > knowledgably with vendors has been hard!) > I have sent Rahul several whitepapers about products that we are proposing. They discuss several aspects of what you _could_ achieve for various configurations. The problem is that we are proposing a specific configuration and have given him estimates of the performance. But without actually building it and testing it we don't know the exact performance. However, this is the case for any customer and any configuration. Plus Rahul keeps asking about what benchmarks to run on his current cluster to tell us something about the proposed cluster. The problem is that he doesn't want to run the benchmarks I have suggested. Rahul has not been able to grasp that storage performance depends upon a great number of things: - The number and type of disks - The RAID configuration - How the drives are connected to the host node - The exact benchmarks I have tried to talk to him about what he should likely to see in performance but he has not, to this point, been able to understand that it takes alot of _details_ to specify exact performance. I don't want to take a public list to pick a fight but I don't want Rahul to bad mouth me, my reputation, and the company I work for without some sort or rebuttal. Jeff From rpnabar at gmail.com Fri Sep 25 09:12:08 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 25 Sep 2009 11:12:08 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: <4ABCCB13.3070807@att.net> References: <4ABC25A9.6060907@scalableinformatics.com> <4ABCCB13.3070807@att.net> Message-ID: On Fri, Sep 25, 2009 at 8:52 AM, Jeff Layton wrote: >I don't want to take a public list to pick a fight but I don't want Rahul to >bad mouth me, my reputation, and the company I work for without some >sort or rebuttal. Jeff: I'm sorry that you take this as bad mouthing you. I sincerely apologize as this wasn't the intention at all. I am speaking with several vendors and my comments weren't directed at you. It is true though that we have been a major purchaser of Dell systems. Sometimes we are happy and sometimes we are not . And this goes to all vendors. Now there is one thing (in hindsight) that I think *I am* guilty of: I've been posting a lot of questions on the list. Many of you guys are very well paid professionals and I ought not to be abusing your time for free. I apologize if I have. On the other hand let me make my position clearer: Buying a good system for my institution is a goal, sure. But a lot of this benchmarking that I do here is from a purely scientific curiosity I have in understanding what I am buying or working with. Building, selling, or maintaining Beowulf systems is *not* my primary occupation. But all through my graduate student stint I've worked on these and I cannot help that I have a curiosity for this stuff. Clusters, linux and Beowulf and all the related stuff is fascinating. I just want to know more about the systems I am using. I am willing to do my homework but sometimes post to get opinions from other experts that know better. Of course, I do realize that this is free advice from peers and I have no expectation of more. Again, I apologize if I hurt any feelings. That wasn't my intention at all. -- Rahul From mwill at penguincomputing.com Fri Sep 25 11:27:29 2009 From: mwill at penguincomputing.com (Michael Will) Date: Fri, 25 Sep 2009 11:27:29 -0700 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? References: <4ABC25A9.6060907@scalableinformatics.com><4ABCCB13.3070807@att.net> Message-ID: <433093DF7AD7444DA65EFAFE3987879C90F7C7@orca.penguincomputing.com> Rahul, >From what you posted to the list there is no need to apologize. If there was something going on behind the curtains personally between you and Jeff then that is not really a concern to the list. Keep asking questions and keep curious, and keep posting back results that you find. In fact I find it very interesting, especially since technology progresses so fast, and estimates based on experience expire rather quickly. Maybe I should mention some pieces of advice on how to deal with vendors that I base on my experience of being on the other side of the fence as a sales engineer for many years: Engage multiple vendors and let them propose technical solutions to your use cases / performance / application goals. Some will have quite technically skilled sales engineering resources, and Jeff Layton for sure is somebody with deep technical experience. Treat public whitepapers as marketing material that will show the solution in the best light for the best use case and not as a hard technical information. They are still interesting to skim over as they describe specific solutions and their intended properties. Then engage in a personal technical dialog with vendors software engineers and try to get to a more technically sound solution plus estimate of what it can really do, what its potential drawbacks and advantages are. I know vendors don't like this (I work for one) but it is in your best interest to talk through the alternative designs you got from other vendors with the sales engineer and have him shoot it down to the best of his ability. Keep in mind he is trying to sell you his solution, so he won't necessarily highlight the good in the alternative design, but he will highlight the weaknesses that the other designer did not choose to focus on. Bring up those points with the alternative designer to see if they can rebut it quickly, and also do your own research about it. Then consolidate the designs and base your choice of vendor not just on the cheapest price of components but on how qualified the design was, because a lot of value is in getting the right solution and ongoing support, not just the right parts. Let all vendors know why you choose design / vendor X so they can adjust their offerings in the future to fit your needs. Best of luck - Michael Will, HPC Software Engineer at Penguin Computing -----Original Message----- From: beowulf-bounces at beowulf.org on behalf of Rahul Nabar Sent: Fri 9/25/2009 9:12 AM To: Jeff Layton Cc: Beowulf Mailing List Subject: Re: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? On Fri, Sep 25, 2009 at 8:52 AM, Jeff Layton wrote: >I don't want to take a public list to pick a fight but I don't want Rahul to >bad mouth me, my reputation, and the company I work for without some >sort or rebuttal. Jeff: I'm sorry that you take this as bad mouthing you. I sincerely apologize as this wasn't the intention at all. I am speaking with several vendors and my comments weren't directed at you. It is true though that we have been a major purchaser of Dell systems. Sometimes we are happy and sometimes we are not . And this goes to all vendors. Now there is one thing (in hindsight) that I think *I am* guilty of: I've been posting a lot of questions on the list. Many of you guys are very well paid professionals and I ought not to be abusing your time for free. I apologize if I have. On the other hand let me make my position clearer: Buying a good system for my institution is a goal, sure. But a lot of this benchmarking that I do here is from a purely scientific curiosity I have in understanding what I am buying or working with. Building, selling, or maintaining Beowulf systems is *not* my primary occupation. But all through my graduate student stint I've worked on these and I cannot help that I have a curiosity for this stuff. Clusters, linux and Beowulf and all the related stuff is fascinating. I just want to know more about the systems I am using. I am willing to do my homework but sometimes post to get opinions from other experts that know better. Of course, I do realize that this is free advice from peers and I have no expectation of more. Again, I apologize if I hurt any feelings. That wasn't my intention at all. -- Rahul _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Fri Sep 25 11:34:42 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 25 Sep 2009 13:34:42 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: <433093DF7AD7444DA65EFAFE3987879C90F7C7@orca.penguincomputing.com> References: <4ABC25A9.6060907@scalableinformatics.com> <4ABCCB13.3070807@att.net> <433093DF7AD7444DA65EFAFE3987879C90F7C7@orca.penguincomputing.com> Message-ID: On Fri, Sep 25, 2009 at 1:27 PM, Michael Will wrote: > Some will have quite technically skilled sales engineering resources, and > Jeff Layton for sure is somebody with deep > technical experience. Absolutely. Jeff has been a great help so far. Even before he responded to my questions here I've been an active follower of his articles on the Linux Magazine. Great stuff! > > Then consolidate the designs and base your choice of vendor not just on the > cheapest price of components but on how qualified > the design was, because a lot of value is in getting the right solution and > ongoing support, not just the right parts. Let all vendors > know why you choose design / vendor X so they can adjust their offerings in > the future to fit your needs. > Great advice Michael! It is very much appreciated. I've been getting such good advice on this list that I wouldn't know what to do without it. Advice from guys like yourself, Jeff Layton and Joe Landman have helped me a lot and I am myself amazed at how much more (I think) I know now than just a few months ago! Thanks again! -- Rahul From mwill at penguincomputing.com Fri Sep 25 11:35:45 2009 From: mwill at penguincomputing.com (Michael Will) Date: Fri, 25 Sep 2009 11:35:45 -0700 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? References: <4ABC25A9.6060907@scalableinformatics.com><4ABCCB13.3070807@att.net><433093DF7AD7444DA65EFAFE3987879C90F7C7@orca.penguincomputing.com> Message-ID: <433093DF7AD7444DA65EFAFE3987879C90F7C9@orca.penguincomputing.com> By the way, if you have to work through a purchasing departement that only looks at the cheapest price, but you want to go with a specific vendor because of the superior solution and technical expertise, then you can engage the sales engineer / sales person in defining a uniqueness about the solution that cannot be met by the other vendors, it's called writing a 'sole source justification' letter. That way you can make sure to get the best value in terms of performance / technical ability and support. even if your institutions purchasing departments metrics are not taking that into consideration. Michaeld -----Original Message----- From: Rahul Nabar [mailto:rpnabar at gmail.com] Sent: Fri 9/25/2009 11:34 AM To: Michael Will Cc: Jeff Layton; Beowulf Mailing List Subject: Re: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? On Fri, Sep 25, 2009 at 1:27 PM, Michael Will wrote: > Some will have quite technically skilled sales engineering resources, and > Jeff Layton for sure is somebody with deep > technical experience. Absolutely. Jeff has been a great help so far. Even before he responded to my questions here I've been an active follower of his articles on the Linux Magazine. Great stuff! > > Then consolidate the designs and base your choice of vendor not just on the > cheapest price of components but on how qualified > the design was, because a lot of value is in getting the right solution and > ongoing support, not just the right parts. Let all vendors > know why you choose design / vendor X so they can adjust their offerings in > the future to fit your needs. > Great advice Michael! It is very much appreciated. I've been getting such good advice on this list that I wouldn't know what to do without it. Advice from guys like yourself, Jeff Layton and Joe Landman have helped me a lot and I am myself amazed at how much more (I think) I know now than just a few months ago! Thanks again! -- Rahul -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Fri Sep 25 11:47:59 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 25 Sep 2009 13:47:59 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: <433093DF7AD7444DA65EFAFE3987879C90F7C9@orca.penguincomputing.com> References: <4ABC25A9.6060907@scalableinformatics.com> <4ABCCB13.3070807@att.net> <433093DF7AD7444DA65EFAFE3987879C90F7C7@orca.penguincomputing.com> <433093DF7AD7444DA65EFAFE3987879C90F7C9@orca.penguincomputing.com> Message-ID: On Fri, Sep 25, 2009 at 1:35 PM, Michael Will wrote: > By the way, if you have to work through a purchasing departement that only > looks at the cheapest price, but > you want to go with a specific vendor because of the superior solution and > technical expertise, then you can Thanks! I guess I have not faced the problem that you describe (yet). My problem is still easier. It is convincing *myself* which is the right solution. Or more of " how to compare various vendor solutions?" It was easy for the compute nodes. But the storage and backbone are still not clear (to me). One of the reasons is that it is easy to buy several compute nodes piecewise and test the actual codes. But how does one simulate scaleup for estimating performance of switches and backbone. Of course, over the past weeks several useful suggestions have been offered but I am still not sure. Perhaps I am just not doing my homework! :) -- Rahul From rpnabar at gmail.com Fri Sep 25 12:38:51 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 25 Sep 2009 14:38:51 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: <4ABC25A9.6060907@scalableinformatics.com> References: <4ABC25A9.6060907@scalableinformatics.com> Message-ID: On Thu, Sep 24, 2009 at 9:06 PM, Joe Landman wrote: > I've found fio (http://freshmeat.net/projects/fio/) to be an excellent > testing tool for disk systems. ?To use it, compile it (requires libaio), and > then run it as > > ? ? ? ?fio input.fio > > For a nice simple IOP test, try this: > > [random] > rw=randread > size=4g > directory=/data > iodepth=32 > blocksize=8k > numjobs=16 > nrfiles=1 > group_reporting > ioengine=sync > loops=1 > > > We use this to model bonnie++ and other types of workloads. ?It provides a > great deal of useful information. Thanks for this "fio" lead Joe. The test seems to run great. I tried this exact same input suite from one of my NFS clients. Detailed results below but I think it reports a IOPS of around 600 (if I read it correctly!) Not bad huh? This is a 5 disk RAID5. 10k RPM 300 GB SAS drives. I'm about to kill the test prematurely , though, since it seems to indicate that it wants to run for 6 more hours! :) I hope this still gave me a good benchmark, though, since I have already run it for about 20 minutes. One mysterios aspect: Is it really showing a 4.6 Gbits/sec throughput. That'd be way too high, I'd have thought! I'm going to do another similar run from the NFS server to isolate out the effect of the disk versus network. I'll post those soon in case it interests any other users. #################################################################### random: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=32 ... random: (g=0): rw=randread, bs=8K-8K/8K-8K, ioengine=sync, iodepth=32 Starting 16 processes random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) random: Laying out IO file(s) (1 file(s) / 4096MiB) Jobs: 16 (f=16): [rrrrrrrrrrrrrrrr] [1.4% done] [4,647K/0K /s] [567/0 iops] [eta 06h:23m:08s] ########################################################################### -- Rahul From rpnabar at gmail.com Fri Sep 25 13:54:32 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 25 Sep 2009 15:54:32 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: References: <4ABC25A9.6060907@scalableinformatics.com> Message-ID: On Fri, Sep 25, 2009 at 2:38 PM, Rahul Nabar wrote: >> We use this to model bonnie++ and other types of workloads. ?It provides a >> great deal of useful information. > > More details from the fio benchmark..... ################################################################################################## fio: terminating on signal 2 random: (groupid=0, jobs=16): err= 0: pid=26509 read : io=12,391MiB, bw=3,732KiB/s, iops=466, runt=3399938msec clat (msec): min=3, max=1,122, avg=23.06, stdev= 4.38 bw (KiB/s) : min= 7, max= 621, per=9.34%, avg=348.67, stdev=29.11 cpu : usr=0.04%, sys=0.17%, ctx=3736755, majf=0, minf=33036 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued r/w: total=1586090/0, short=0/0 lat (msec): 4=0.01%, 10=0.38%, 20=17.11%, 50=69.74%, 100=12.01% lat (msec): 250=0.73%, 500=0.02%, 750=0.01%, 1000=0.01%, 2000=0.01% Run status group 0 (all jobs): READ: io=12,391MiB, aggrb=3,732KiB/s, minb=3,732KiB/s, maxb=3,732KiB/s, mint=3399938msec, maxt=3399938msec ################################################################################################## -- Rahul From landman at scalableinformatics.com Fri Sep 25 13:59:37 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Fri, 25 Sep 2009 16:59:37 -0400 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: References: <4ABC25A9.6060907@scalableinformatics.com> Message-ID: <4ABD2F39.1050005@scalableinformatics.com> Rahul Nabar wrote: > On Fri, Sep 25, 2009 at 2:38 PM, Rahul Nabar wrote: > >>> We use this to model bonnie++ and other types of workloads. It provides a >>> great deal of useful information. >> > > More details from the fio benchmark..... > > ################################################################################################## > fio: terminating on signal 2 > > random: (groupid=0, jobs=16): err= 0: pid=26509 > read : io=12,391MiB, bw=3,732KiB/s, iops=466, runt=3399938msec > clat (msec): min=3, max=1,122, avg=23.06, stdev= 4.38 > bw (KiB/s) : min= 7, max= 621, per=9.34%, avg=348.67, stdev=29.11 > cpu : usr=0.04%, sys=0.17%, ctx=3736755, majf=0, minf=33036 > IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% > issued r/w: total=1586090/0, short=0/0 > > lat (msec): 4=0.01%, 10=0.38%, 20=17.11%, 50=69.74%, 100=12.01% > lat (msec): 250=0.73%, 500=0.02%, 750=0.01%, 1000=0.01%, 2000=0.01% Looks like your IOP latency is around 50ms. If you think about this, it seems a little high, even for a RAID5. I'll do some measurements here on our big units and we can compare. 466 IOP is ok ... basically 4 drives will give you this (5 RAID5 drives -> 4 drives of data). If you are IOP bound, you can do better, if this matters. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Fri Sep 25 14:08:00 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Fri, 25 Sep 2009 16:08:00 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: <4ABD2F39.1050005@scalableinformatics.com> References: <4ABC25A9.6060907@scalableinformatics.com> <4ABD2F39.1050005@scalableinformatics.com> Message-ID: On Fri, Sep 25, 2009 at 3:59 PM, Joe Landman wrote: > Looks like your IOP latency is around 50ms. ?If you think about this, it > ?seems a little high, even for a RAID5. ?I'll do some measurements here on > our big units and we can compare. Thanks again Joe! About the latency: It could have something to do with the fact that the overall latency on our network has been pretty crappy (I believe). Round trip ping-pong times in the 150 microsec range. Not sure. > 466 IOP is ok ... basically 4 drives will give you this (5 RAID5 drives -> 4 > drives of data). ?If you are IOP bound, you can do better, if this matters. Sure, I'd have to figure out *iff* I (or rather all my apps.) am IOP driven. Not sure, I don't have a good answer for you. Of course, maybe I am just missing a crucial simple step in figuring that out. Do let me know. On the other hand regarding our expansion plans: Our code is working OK on this client that has the 466 IOPS. Ergo any larger storage+network solution that can provide *at least* 466 IOPS ought to work for us. Of course, assuming IOP is a good metric in the first place. But that's back to the same point. -- Rahul From hahn at mcmaster.ca Fri Sep 25 15:09:11 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 25 Sep 2009 18:09:11 -0400 (EDT) Subject: [Beowulf] integrating node disks into a cluster filesystem? Message-ID: Hi all, I'm sure you've noticed that disks are incredibly cheap, obscenely large and remarkably fast (at least in bandwidth). the "cheap" part is the only one of these that's really an issue, since the question becomes: how to keep storage infrastructure cost (overhead) dominating the system cost? the backblaze people took a great swing at this - their solution is really centered on the 5-disk port-multiplier backplanes. (I would love to hear from anyone who has experience with PM's, btw.) but since 1U nodes are still the most common HPC building block, and most of them support 4 LFF SATA disks with very little added cost (esp using the chipset's integrated controller), is there a way to integrate them into a whole-cluster filesystem? - obviously want to minimize the interference of remote IO to a node's jobs. for serial jobs, this is almost moot. for loosely-coupled parallel jobs (whether threaded or cross-node), this is probably non-critical. even for tight-coupled jobs, perhaps it would be enough to reserve a core for admin/filesystem overhead. - iscsi/ataoe approach: export the local disks via a low-level block protocol and raid them together on dedicated fileserving node(s). not only does this address the probability of node failure, but a block protocol might be simple enough to avoid deadlock (ie, job does IO, allocating memory for pagecache them network packets, which may by chance wind up triggering network activity back to the same node, and more allocations for the underlying disk IO.) - distributed filesystem (ceph? gluster? please post any experience!) I know it's possible to run oss+ost services on a lustre client, but not recommended because of the deadlock issue. - this is certainly related to more focused systems like google/mapreduce. but I'm mainly looking for more general-purpose clusters - the space would be used for normal files, and definitely mixed read/write with something close to normal POSIX semantics... thanks, mark hahn. From jlb17 at duke.edu Fri Sep 25 15:32:36 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Fri, 25 Sep 2009 18:32:36 -0400 (EDT) Subject: [Beowulf] integrating node disks into a cluster filesystem? In-Reply-To: References: Message-ID: On Fri, 25 Sep 2009 at 6:09pm, Mark Hahn wrote > but since 1U nodes are still the most common HPC building block, and most of > them support 4 LFF SATA disks with very little added cost (esp using the > chipset's integrated controller), is there a way to integrate them into a > whole-cluster filesystem? This is something I've considered/toyed-with/lusted after for a long while. I haven't pursued it as much as I could have because the clusters I've run to this point have generally run embarrassingly parallel jobs, and I train the users to cache data-in-progress to scratch space on the nodes. But there's a definite draw to a single global scratch space that scales automatically with the cluster itself. > - obviously want to minimize the interference of remote IO to a node's jobs. > for serial jobs, this is almost moot. for loosely-coupled parallel jobs > (whether threaded or cross-node), this is probably non-critical. even for > tight-coupled jobs, perhaps it would be enough to reserve a core for > admin/filesystem overhead. I'd also strongly consider a separate network for filesystem I/O. > - distributed filesystem (ceph? gluster? please post any experience!) I > know it's possible to run oss+ost services on a lustre client, but not > recommended because of the deadlock issue. I played with PVFS1 a bit back in the day. My impression at the time was they they were focused on MPI-IO, and the POSIX layer was a bit of an afterthought -- access with "regular" tools (tar, cp, etc) was pretty slow. I don't know what the situation is with PVFS2. Anyone? > - this is certainly related to more focused systems like google/mapreduce. > but I'm mainly looking for more general-purpose clusters - the space would > be used for normal files, and definitely mixed read/write with something > close to normal POSIX semantics... It seems we're after the same thing. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From cousins at umit.maine.edu Fri Sep 25 15:47:22 2009 From: cousins at umit.maine.edu (Steve Cousins) Date: Fri, 25 Sep 2009 18:47:22 -0400 (EDT) Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: <200909251614.n8PGE62V011185@bluewest.scyld.com> References: <200909251614.n8PGE62V011185@bluewest.scyld.com> Message-ID: Hi Rahul, I went through a fair amount of work with this sort of thing (specifying performance and then getting the vendor to bring it up to expectations when performance didn't come close) and I was happiest with Bonnie++ in terms of simplicity of use and the range of stats you get. I haven't kept up with benchmark tools over the last year though. They are benchmarks so as you often hear here: "it all depends". As in, it depends on what sort of applications you are running and whether you want to tune for IOPS or throughput. Sequential or Random. Etc. First thing is that I'd concentrate on the local (server side) performance and then once that is where you expect it work on the NFS side. One thing to try with bonnie++ is to run multiple instances at the same time. For our tests, one single instance of bonnie showed 560 MB/sec writes and 524 MB/sec reads. Going to 4 instances at the same time brought it up to an aggregate of ~600 MB/sec writes and ~950 MB/sec reads. One note about bonding/trunking, check it closely to see that it is working the way you expect. We have a cluster with 14 racks of 20 nodes each rack with a 24 port switch at the top. Each of these switches has four ports trunked together back to the core switch. All nodes have two GbE ports but only eth0 was being used. It turns out that all eth0 MAC addresses in this cluster are even. The hashing algorithm on these switches (HP) only uses the last two bits of the MAC address for a total of four paths. Since all MAC's were even it went from four choices to two so we were only getting half the bandwidth. Once the server has the performance you want, I'd use Netcat from a number of clients at the same time to see if your network is doing what you want. Use netcat and bypass any disks (writing to /dev/null on the server and reading from /dev/zero on the client and vica versa) in order to test that bonding is working. You should be able to fill up the network pipes with aggregate tests from multiple nodes using netcat. Then, test out NFS. You can do this with netcat or with bonnie++ but again I'd recommend running it on multiple nodes at the same time. Good luck. It can be quite a process sorting through it all. I really just meant to comment on your use of only one instance of Bonnie++ on the server. Sorry to go beyond the scope of your question. You probably have already done these other things in a different way. Steve Rahul Nabar wrote: > I now ran bonnie++ but have trouble figuring out if my perf. stats are > up to the mark or not. My original plan was to only estimate the IOPS > capabilities of my existing storage setup. But then again I am quite > ignorant about the finer nuances. Hence I thought maybe I should post > the stats. here and if anyone has comments I'd very much appreciate > hearing them. In any case, maybe my stats help someone else sometime! > I/O stats on live HPC systems seem hard to find. > > Data posted below. Since this is an NFS store I ran bonnie++ from both > a NFS client compute node and the server. (head node) > > Server side bonnie++ > http://dl.getdropbox.com/u/118481/io_benchmarks/bonnie_op.html > > Client side bonnie++ > http://dl.getdropbox.com/u/118481/io_benchmarks/bonnie_op_node25.html > > > Caveat: The cluster was in production so there is a chance of > externalities affecting my data. (am trying it hard to explain why > some stats seem better on the client run than the server run) > > Subsidary Goal: This setup had 23 clients for NFS. In a new cluster > that I am setting up we want to scale this up about 250 clients. Hence > want to estimate what sort of performance I'll be looking for in the > Storage. (I've found most conversations with vendors pretty > non-productive with them weaving vague terms and staying as far away > from quantitative estimates as is possible.) > > (Other specs: Gigabit ethernet. RAID5 array of 5 total SAS 10k RPM > disks. Total storage ~ 1.5 Terabyte; both server and client have 16GB > RAM; Dell 6248 switches. Port bonding on client servers) > > -- > Rahul From hahn at mcmaster.ca Fri Sep 25 15:59:49 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 25 Sep 2009 18:59:49 -0400 (EDT) Subject: [Beowulf] integrating node disks into a cluster filesystem? In-Reply-To: References: Message-ID: > users to cache data-in-progress to scratch space on the nodes. But there's a > definite draw to a single global scratch space that scales automatically with > the cluster itself. using node-local storage is fine, but really an orthogonal issue. if people are willing to do it, it's great and scales nicely. it doesn't really address the question of how to make use of 3-8 TB per node. we suggest that people use node-local /tmp, and like that name because it emphasizes the nature of the space. currently we don't sweat the cleanup of /tmp (in fact we merely have the distro-default 10-day tmpwatch). >> - obviously want to minimize the interference of remote IO to a node's >> jobs. >> for serial jobs, this is almost moot. for loosely-coupled parallel jobs >> (whether threaded or cross-node), this is probably non-critical. even for >> tight-coupled jobs, perhaps it would be enough to reserve a core for >> admin/filesystem overhead. > > I'd also strongly consider a separate network for filesystem I/O. why? I'd like to see some solid numbers on how often jobs are really bottlenecked on the interconnect (assuming something reasonable like DDR IB). I can certainly imagine it could be so, but how often does it happen? is it only for specific kinds of designs (all-to-all users?) >> - distributed filesystem (ceph? gluster? please post any experience!) I >> know it's possible to run oss+ost services on a lustre client, but not >> recommended because of the deadlock issue. > > I played with PVFS1 a bit back in the day. My impression at the time was yeah, I played with it too, but forgot to mention it because it is afaik still dependent on all nodes being up. admittedly, most of the alternatives also assume all servers are up... From alscheinine at tuffmail.us Fri Sep 25 17:05:36 2009 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Fri, 25 Sep 2009 19:05:36 -0500 Subject: [Beowulf] integrating node disks into a cluster filesystem? In-Reply-To: References: Message-ID: <4ABD5AD0.4050107@tuffmail.us> I have done only a few experiments with parallel file systems but I've run some benchmarks on each one I've encountered. With regard to Joshua Baker-LePain's comment > I played with PVFS1 a bit back in the day. My impression at the time was > they they were focused on MPI-IO, and the POSIX layer was a bit of an > afterthought -- access with "regular" tools (tar, cp, etc) was pretty slow. > I don't know what the situation is with PVFS2. Of the file systems I tested, PVFS2 with Myrinet, but just 8 nodes, was one of the best. I have the impression that all file systems have bugs; so when using a parallel file system that has not had a decade of development, you should only use it for scratch space. I was on the PVFS developer's mailing list for many years, the unending reports of bugs is scarey. My guess is that other file systems have similar problems. Filesystems have subtle complexity. From what little I read, you cannot have both POSIX and an efficient parallel file system. If you plan on using the cluster for jobs that are not embarrassingly parallel, but really need parallelism, then it would be a good idea to not have the filesystem on the compute nodes, in order to avoid unbalanced computation -- for domain decomposition, just one laggard subdomain can slowdown all the entire calculation. > But there's a definite draw to a single global scratch space that > scales automatically with the cluster itself. Using a parallel filesystem efficiently is difficult, for example, avoiding hotspots. I've read that for large parallel jobs the "hits" on each storage node can be effectively random with collisions resulting in inefficient use of the HDDs. So for any parallel filesystem the development of the program needs to use MPI-IO in a way that is flexible enough to deal with the specifics of the filesystem: block size, number of stripes and interconnection topology. Alan -- Alan Scheinine 200 Georgann Dr., Apt. E6 Vicksburg, MS 39180 Email: alscheinine at tuffmail.us Mobile phone: 225 288 4176 From jellogum at gmail.com Sat Sep 26 10:32:38 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Sat, 26 Sep 2009 10:32:38 -0700 Subject: [Beowulf] What is best OS for GRID or clusters? Message-ID: Any significant reason to use SOLARIS over a Linux distro for development of software? Does the same C/C++ file compile well on both systems? If there are differences where can I find suggested reading material on the topic? Advice would be appreciated. Baker -------------- next part -------------- An HTML attachment was scrubbed... URL: From landman at scalableinformatics.com Sat Sep 26 11:09:04 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sat, 26 Sep 2009 14:09:04 -0400 Subject: [Beowulf] What is best OS for GRID or clusters? In-Reply-To: References: Message-ID: <4ABE58C0.3080309@scalableinformatics.com> Jeremy Baker wrote: > Any significant reason to use SOLARIS over a Linux distro for > development of software? Most of our customers, if they are still using Solaris and haven't retired it, have marked it as a legacy platform. No new deployments, and a gradual phase out of existing ones. There are a few point solutions (Nexentastor) which are self contained appliances that don't factor into many of these discussions, but for the most part, Solaris use is on the decline. We wouldn't recommend it for clusters/grids/clouds, unless you have a hard dependency upon it, and no other choice. > Does the same C/C++ file compile well on both systems? If there are Well ... as long as it is well written C/C++, not using OS specific hacks, and the compilers are doing the right thing ... it should compile well on both. This said, we've seen (with Fortran anyway) some pretty gnarly things compile (which should have thrown errors) on different platforms. YMMV. > differences where can I find suggested reading material on the topic? > Advice would be appreciated. > Baker > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From jellogum at gmail.com Sat Sep 26 20:34:29 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Sat, 26 Sep 2009 20:34:29 -0700 Subject: [Beowulf] Re: How do I work around this patent? In-Reply-To: References: Message-ID: Linux patent friendly group "Open Invention NetworkSM is an intellectual property company that was formed to promote Linux ..." http://www.openinventionnetwork.com/about.php -------------- next part -------------- An HTML attachment was scrubbed... URL: From jellogum at gmail.com Sun Sep 27 09:09:33 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Sun, 27 Sep 2009 09:09:33 -0700 Subject: [Beowulf] Bonsai cluster list Message-ID: Share your story about building a small cluster built with simple parts using modern hardware and/or software, recycled legacy technology, or unusual hacks*. *[The term "hacks" (aka hacker) is used in the traditional sense of lawful innovation, in contrast to the term "cracker" which refers to unlawful innovation.] -- Jeremy Baker PO 297 Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jellogum at gmail.com Sun Sep 27 09:55:08 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Sun, 27 Sep 2009 09:55:08 -0700 Subject: Fwd: [Beowulf] What is best OS for GRID or clusters? In-Reply-To: References: Message-ID: ---------- Forwarded message ---------- From: Jeremy Baker Date: Sun, Sep 27, 2009 at 9:54 AM Subject: Re: [Beowulf] What is best OS for GRID or clusters? To: Mark Hahn Sun Solaris Express Developer Edition, w/ Sun Studio compilers, Netbeans ... is a 4G DVD I obtained in the pulp Linux User & Developer (mag issue 76). For legacy hardware, it would work. Time is finite for me, I must pick a system, and presently I have plans to focus on GCC and BSD Linux, that is until a reasonable argument is made to switch, if there is a reason. Baker On Sat, Sep 26, 2009 at 1:05 PM, Mark Hahn wrote: > Any significant reason to use SOLARIS over a Linux distro for development >> of >> software? >> > > in the absence of any other info, it would be absurd to choose solaris. > solaris is merely yet another proprietary version of unix, used by a > vanishingly small fraction of people. generally only those with a fetish > for "commercial support". experienced users know that "support" is always > far short of the PhB dream of a warm cocoon of "everything just > works or someone else fixes it instantly". > > Does the same C/C++ file compile well on both systems? If there are >> > > it depends on the code. any code can be non-portable, but any portable > code will (by definition) compile and run on either OS. > > differences where can I find suggested reading material on the topic? >> > > well, the beowulf list isn't a good place to start. linux is the defacto > standard unix; solaris is merely something you can get from sun. there are > some bits of software which sun still provides only on solaris - the only > possible rational reason to use solaris is to get those (ZFS is really the > only one considered significant.) > -- Jeremy Baker SBN 634 337 College Hill Johnson, VT 05656 -- Jeremy Baker SBN 634 337 College Hill Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jellogum at gmail.com Sun Sep 27 09:57:05 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Sun, 27 Sep 2009 09:57:05 -0700 Subject: [Beowulf] Re: What is best OS for GRID or clusters? In-Reply-To: References: Message-ID: EROS (Extremely Reliable Operating System) > http://www.eros-os.org/eros.html > > > -- Jeremy Baker PO 297 Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From gerry.creager at tamu.edu Sun Sep 27 16:21:00 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Sun, 27 Sep 2009 18:21:00 -0500 Subject: [Beowulf] Re: What is best OS for GRID or clusters? In-Reply-To: References: Message-ID: <4ABFF35C.7000707@tamu.edu> Jeremy, I think you'll discover that the Beowulf list tends to comprise a number of folks who are engaged in high performance, or high throughput, computing already, or are coming into the fray, now interested in learning what is composed of the art of the possible. We've a nice assortment of knowledgable folk here, who offer their expertise freely, and whose knowledge is often complementary and extensible, in that one person's experiences and knowledge are often building blocks for another's explanation. We tend to run Linux, as a core OS of choice, for a variety of reasons. These include familiarity, experience and comfort levels, and in a number of cases, a systematic determination that it's the best choice for what we're doing. In this post, while apparently asking for opinions about the best OS for grid or cluster computing, you point out "yet another academic OS project" (which is not to dismiss it, but simply to categorize it). EROS, from an academic perspective, looks interesting, but currently impractical. You see, like you, I've a finite temporal resource, and am limited in my current job to a 168 hour work week (and by my wife and family to an even shorter one). I have invested a lot of time in *nix over the years, and have decided to my satisfaction that Linux is the best fit for my scientific efforts. Further (or better|worse, depending on outlook), I prefer CentOS these days for stability. You see, I've isolated clusters that have been running without updates for half a decade, because they're up and stable. I tend to create cluster environments that meet a particular need for performance or throughput, and which can then be administered as efficiently as possible... preferably meaning that neither I, nor my other administrators, have to spend much time with 'em. My real job isn't to play with clusters, OS's or administration, it's to obtain funding and do research using computational models. Please don't take this as a slight. Instead, I'm trying to give you a flavor of *some* of the folks here, and a basis for several of the replies. We're interested, and there are almost certainly folks on this list who've investigated all aspects of what you are asking about. I trust these to answer your queries much better than I can. And don't stop asking. But do realize that we tend to spend a lot of our time trying to get the work out the door rather than searching for the next great tool that could consume all our time learning whether it's practical. Finally, getting back to the query that started all of this, I suspect Linux, and NOT Solaris, would prove easier, by some margin. I recommend you spend a little time investigating NPACI Rocks (yes, I do use them for some clusters) as they have implementations using either Linux or Solaris, and someone's developing a Rocks Roll for grid use, or so I'm told. That could give you a fairly simple implementation path if that's what you're looking for. At first glance, EROS does not look like it's ready for prime time, so I'd not be looking that way. Of course, SOMEONE needs to try it in the cluster world, someday, but I don't have the time to be that person. Good luck in your studies, and welcome to the group! gerry Jeremy Baker wrote: > EROS (Extremely Reliable Operating System) > > > http://www.eros-os.org/eros.html > > > > > > -- > Jeremy Baker > PO 297 > Johnson, VT > 05656 > > > ------------------------------------------------------------------------ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Gerry Creager -- gerry.creager at tamu.edu Texas Mesonet -- AATLT, Texas A&M University Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 From stuartb at 4gh.net Mon Sep 28 08:28:56 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Mon, 28 Sep 2009 11:28:56 -0400 (EDT) Subject: [Beowulf] Re: What is best OS for GRID or clusters? In-Reply-To: References: Message-ID: > EROS (Extremely Reliable Operating System) > > http://www.eros-os.org/eros.html Looks like abandonware to me. There appears to have been no activity for well over 5 years. If I don't see any activity on a status page for over 2 years then I assume no one is caring for the code. Sometimes it is just mature code, but I expect that there should be some occasional updates. On the other hand, frequent updates (or never ending beta cycles) indicates immature code that I'm hesitant to put into production use. Stuart Barkley -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From tjrc at sanger.ac.uk Mon Sep 28 10:16:42 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Mon, 28 Sep 2009 18:16:42 +0100 Subject: [Beowulf] Re: What is best OS for GRID or clusters? In-Reply-To: References: Message-ID: On 28 Sep 2009, at 4:28 pm, Stuart Barkley wrote: >> EROS (Extremely Reliable Operating System) >> >> http://www.eros-os.org/eros.html > > Looks like abandonware to me. There appears to have been no activity > for well over 5 years. If I don't see any activity on a status page > for over 2 years then I assume no one is caring for the code. > > Sometimes it is just mature code, but I expect that there should be > some occasional updates. > > On the other hand, frequent updates (or never ending beta cycles) > indicates immature code that I'm hesitant to put into production use. It sounds like "I've completed my PhD now" code maintenance, to me. :-) I wrote a piece of software like that. In my case, it wasn't the basis of my thesis, rather it was a distraction from my PhD studies. But all the same, once my PhD was completed, I never really worked on that piece of software much again... Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From atp at piskorski.com Mon Sep 28 14:03:27 2009 From: atp at piskorski.com (Andrew Piskorski) Date: Mon, 28 Sep 2009 17:03:27 -0400 Subject: [Beowulf] Re: What is best OS for GRID or clusters? In-Reply-To: References: Message-ID: <20090928210327.GA11172@piskorski.com> On Sun, Sep 27, 2009 at 09:57:05AM -0700, Jeremy Baker wrote: > EROS (Extremely Reliable Operating System) > > > http://www.eros-os.org/eros.html Is that a joke? EROS and object-capability operating systems in general are indeed interesting and potentially very useful, but what does it have to do with Beowulf clusters? I haven't heard of anyone using any capability-secure OS whatsoever on a Beowulf cluster. Any counter-examples would be interesting. Also, Jonathan Shapiro, the head of the EROS project, long ago switched to its successor Coyotos and BitC projects, and then earlier this year, left both Coyotos and academia entirely to work for Microsoft on their Midori project: http://www.coyotos.org/pipermail/bitc-dev/2009-April/001784.html http://www.coyotos.org/pipermail/coyotos-dev/2009-April/001867.html http://www.coyotos.org/pipermail/coyotos-dev/2009-July/001872.html I suppose CapROS (another EROS successor) might still be a live project: http://www.capros.org/ -- Andrew Piskorski http://www.piskorski.com/ From bmcnally at u.washington.edu Fri Sep 25 17:02:56 2009 From: bmcnally at u.washington.edu (Brian McNally) Date: Fri, 25 Sep 2009 17:02:56 -0700 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: References: <200909251614.n8PGE62V011185@bluewest.scyld.com> Message-ID: <4ABD5A30.1080908@u.washington.edu> > One note about bonding/trunking, check it closely to see that it is > working the way you expect. We have a cluster with 14 racks of 20 nodes > each rack with a 24 port switch at the top. Each of these switches has > four ports trunked together back to the core switch. All nodes have two > GbE ports but only eth0 was being used. It turns out that all eth0 MAC > addresses in this cluster are even. The hashing algorithm on these > switches (HP) only uses the last two bits of the MAC address for a total > of four paths. Since all MAC's were even it went from four choices to > two so we were only getting half the bandwidth. I'd second testing to make sure bonding/trunking is working before you base other performance numbers on it. You may also want to consider different bonding modes if you have problems with balancing the traffic out. See: /usr/share/doc/kernel-doc-/Documentation/networking/bonding.txt Just getting bonding working in an optimal way can take some time. Use the port counters on your switches in conjunction with counters on your hosts to make sure traffic is going where you'd expect it to. > Once the server has the performance you want, I'd use Netcat from a > number of clients at the same time to see if your network is doing what > you want. Use netcat and bypass any disks (writing to /dev/null on the > server and reading from /dev/zero on the client and vica versa) in order > to test that bonding is working. You should be able to fill up the > network pipes with aggregate tests from multiple nodes using netcat. You may also consider using iperf for network testing. I used to do raw network tests like this but discovered that iperf is often easier to set up and use. -- Brian McNally From dzaletnev at yandex.ru Fri Sep 25 17:32:28 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Sat, 26 Sep 2009 04:32:28 +0400 Subject: [Beowulf] integrating node disks into a cluster filesystem? Message-ID: <232261253925148@webmail48.yandex.ru> Mark, I use to make experiments with my toy cluster of PS3. and I'm interested in your ideas. PS3 has two network interfaces - GLAN NIC and Wi-fi. Available for running with my firmware 2.70 distros are: Yellow Dog Linux 6.1 NEW, PSUBUNTU (Ubuntu 9.04), Fedora 11. There're Allied Telesyn AT-GS900/8E switch and D-Link DIR-320 Wi-fi router with USB. Except this systems there're Core2Duo E8400/ 8GB RAM/ 1.5 TB HDD and Celeron 1.8/ 1 GB RAM/ 80 GB HDD. While I'm waiting when my partner-programmer realize ILP64-scheme in his CFD-package, PS3 stay without any work and they are ready for any experiments with their HDD's of 80 GB. I would prefer not to load their GLAN NICs with something except MPI, but may be it's possible to use wi-fi? There're two PS3's, but it's suffice for an experiment. Dmitry Zaletnev > > users to cache data-in-progress to scratch space on the nodes. But there's a > > definite draw to a single global scratch space that scales automatically with > > the cluster itself. > using node-local storage is fine, but really an orthogonal issue. > if people are willing to do it, it's great and scales nicely. > it doesn't really address the question of how to make use of > 3-8 TB per node. we suggest that people use node-local /tmp, > and like that name because it emphasizes the nature of the space. > currently we don't sweat the cleanup of /tmp (in fact we merely > have the distro-default 10-day tmpwatch). > > > - obviously want to minimize the interference of remote IO to a node's > > > jobs. > > > for serial jobs, this is almost moot. for loosely-coupled parallel jobs > > > (whether threaded or cross-node), this is probably non-critical. even for > > > tight-coupled jobs, perhaps it would be enough to reserve a core for > > > admin/filesystem overhead. > > I'd also strongly consider a separate network for filesystem I/O. > why? I'd like to see some solid numbers on how often jobs are really > bottlenecked on the interconnect (assuming something reasonable like DDR IB). > I can certainly imagine it could be so, but how often does it happen? > is it only for specific kinds of designs (all-to-all users?) > > > - distributed filesystem (ceph? gluster? please post any experience!) I > > > know it's possible to run oss+ost services on a lustre client, but not > > > recommended because of the deadlock issue. > > I played with PVFS1 a bit back in the day. My impression at the time was > yeah, I played with it too, but forgot to mention it because it is afaik > still dependent on all nodes being up. admittedly, most of the alternatives > also assume all servers are up... > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From worringen at googlemail.com Sat Sep 26 11:55:13 2009 From: worringen at googlemail.com (Joachim Worringen) Date: Sat, 26 Sep 2009 20:55:13 +0200 Subject: [Beowulf] What is best OS for GRID or clusters? In-Reply-To: References: Message-ID: <981e81f00909261155q7808e905vd147d83249af453d@mail.gmail.com> On Sat, Sep 26, 2009 at 7:32 PM, Jeremy Baker wrote: > Any significant reason to use SOLARIS over a Linux distro for development > of software? > Yes, dtrace, documentation, interface stability, quality of community and many things more. I've worked on both platforms a lot, and dtrace alone can save you days when debugging and analysing software, no matter if kernel or userspace. > Does the same C/C++ file compile well on both systems? If there are > differences where can I find suggested reading material on the topic? > Which topic exactly? "Software development" is pretty wide a topic. But just to give you something: I recently bought "The Developer's Edge" and enjoyed it very much (see http://my.safaribooksonline.com/0595352510). Joachim -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Mon Sep 28 15:00:23 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 28 Sep 2009 17:00:23 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: References: <200909251614.n8PGE62V011185@bluewest.scyld.com> Message-ID: On Fri, Sep 25, 2009 at 5:47 PM, Steve Cousins wrote: > > Hi Rahul, Thanks for all those comments Steve! > > One thing to try with bonnie++ is to run multiple instances at the same > time. For our tests, one single instance of bonnie showed 560 MB/sec writes > and 524 MB/sec reads. Going to 4 instances at the same time brought it up to > an aggregate of ~600 MB/sec writes and ~950 MB/sec reads. That's interesting. Multiple bonnie++ instances boost the aggregate performance? Why is that? Just curious. > > Once the server has the performance you want, I'd use Netcat from a number Thanks! I've never tried using netcat. That's a good lead for a tool to try. > > Good luck. It can be quite a process sorting through it all. I really just > meant to comment on your use of only one instance of Bonnie++ on the server. > Sorry to go beyond the scope of your question. You probably have already > done these other things in a different way. Not at all. What you suggest is very much within the scope of my current investigation. And no, I don't think I've tried the ideas you mention. So this is new and helpful. Thanks again! -- Rahul From rpnabar at gmail.com Mon Sep 28 15:02:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 28 Sep 2009 17:02:14 -0500 Subject: [Beowulf] posting bonnie++ stats from our cluster: any comments about my I/O performance stats? In-Reply-To: <4ABD5A30.1080908@u.washington.edu> References: <200909251614.n8PGE62V011185@bluewest.scyld.com> <4ABD5A30.1080908@u.washington.edu> Message-ID: On Fri, Sep 25, 2009 at 7:02 PM, Brian McNally wrote: > I'd second testing to make sure bonding/trunking is working before you base > other performance numbers on it. You may also want to consider different > bonding modes if you have problems with balancing the traffic out. See: > > /usr/share/doc/kernel-doc-/Documentation/networking/bonding.txt Thanks Brian! I'll double check but I have tested in the past to make sure bonding is multiplying my traffic B/W. I am using bonding in the Adaptive Load Balancing (ALB) mode. I forget what tool I had used but I believe it was indeed netperf. Of course the advantage only comes when talking to two+ peers at the same time. -- Rahul From rpnabar at gmail.com Mon Sep 28 15:42:51 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 28 Sep 2009 17:42:51 -0500 Subject: [Beowulf] Re: What is best OS for GRID or clusters? In-Reply-To: <4ABFF35C.7000707@tamu.edu> References: <4ABFF35C.7000707@tamu.edu> Message-ID: > shorter one). I have invested a lot of time in *nix over the years, and have > decided to my satisfaction that Linux is the best fit for my scientific > efforts. Linux has worked great for us for 10+ years in various flavors. My apps are mainly computational Chemistry. >?Further (or better|worse, depending on outlook), I prefer CentOS > these days for stability. ?You see, I've isolated clusters that have been We've used CentOS. Great. No problems. Also toyed with Red Hat Enterprise in the past. Good too but the packages tend to be older. Some would argue that's the point so that one gets a stable system but sometimes it is irritating. Also you need to pay for the Licenses. I've had an OK experience with Fedora too. But some say this is too "unstable" for mainline HPC usage. Just my 2 cents. -- Rahul From cousins at umit.maine.edu Mon Sep 28 15:43:30 2009 From: cousins at umit.maine.edu (Steve Cousins) Date: Mon, 28 Sep 2009 18:43:30 -0400 (EDT) Subject: [Beowulf] ISO-8859-1?Q? about_?= my I/O In-Reply-To: References: <200909251614.n8PGE62V011185@bluewest.scyld.com> Message-ID: On Mon, 28 Sep 2009, Rahul Nabar wrote: > On Fri, Sep 25, 2009 at 5:47 PM, Steve Cousins wrote: >>> One thing to try with bonnie++ is to run multiple instances at the same >>> time. For our tests, one single instance of bonnie showed 560 MB/sec writes >>> and 524 MB/sec reads. Going to 4 instances at the same time brought it up to >>> an aggregate of ~600 MB/sec writes and ~950 MB/sec reads. >> > That's interesting. Multiple bonnie++ instances boost the aggregate > performance? Why is that? Just curious. In our case we have a number of LUNS (RAID5 volumes) that are striped together to make a single large volume (99 TB). By doing multiple tests at the same time it can take advantage of hitting more LUNS at the same time. So, this may not help you on your current server but it is worth a try, especially if you get a new storage server. >>> Once the server has the performance you want, I'd use Netcat from a >>> number >> > Thanks! I've never tried using netcat. That's a good lead for a tool to try. If you want, I can supply you with the scripts that I used to set up the ports on the clients and the server and run the tests. Steve From jorg.sassmannshausen at strath.ac.uk Tue Sep 29 07:09:13 2009 From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-15?q?J=F6rg_Sa=DFmannshausen?=) Date: Tue, 29 Sep 2009 15:09:13 +0100 Subject: [Beowulf] large scratch space on cluster Message-ID: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> Dear all, I was wondering if somebody could help me here a bit. For some of the calculations we are running on our cluster we need a significant amount of disc space. The last calculation crashed as the ~700 GB which I made available were not enough. So, I want to set up a RAID0 on one 8 core node with 2 1.5 TB discs. So far, so good. However, I was wondering whether it does make any sense to somehow 'export' that scratch space to other nodes (4 cores only). So, the idea behind that is, if I need a vast amount of scratch space, I could use the one in the 8 core node (the one I mentioned above). I could do that with nfs but I got the feeling it will be too slow. Also, I only got GB ethernet at hand, so I cannot use some other networks here. Is there a good way of doing that? Some words like i-scsi and cluster-FS come to mind but to be honest, up to now I never really worked with them. Any ideas? All the best J?rg -- ************************************************************* J?rg Sa?mannshausen Research Fellow University of Strathclyde Department of Pure and Applied Chemistry 295 Cathedral St. Glasgow G1 1XL email: jorg.sassmannshausen at strath.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From john.hearns at mclaren.com Tue Sep 29 09:29:22 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Tue, 29 Sep 2009 17:29:22 +0100 Subject: [Beowulf] large scratch space on cluster In-Reply-To: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> References: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D524F0A@milexchmb1.mil.tagmclarengroup.com> I was wondering if somebody could help me here a bit. For some of the calculations we are running on our cluster we need a significant amount of disc space. The last calculation crashed as the ~700 GB which I made available were not enough. So, I want to set up a RAID0 on one 8 core node with 2 1.5 TB discs. So far, so good. Sounds like a cluster I might have had something to do with in a past life... 700 gbytes! My advice - look closely at your software and see why it needs this scratch space, and what you can do to cut down on this. Also, let us know what code this is please. You're right about network transfer of scratch files like that - if at all possible, you should aim to use local scratch space on the nodes. $VENDOR (I think in Warwick!) should be very happy to help you there! The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From landman at scalableinformatics.com Tue Sep 29 10:08:45 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 29 Sep 2009 13:08:45 -0400 Subject: [Beowulf] large scratch space on cluster In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D524F0A@milexchmb1.mil.tagmclarengroup.com> References: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> <68A57CCFD4005646957BD2D18E60667B0D524F0A@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4AC23F1D.6000903@scalableinformatics.com> Hearns, John wrote: > I was wondering if somebody could help me here a bit. > For some of the calculations we are running on our cluster we need a > significant amount of disc space. The last calculation crashed as the ~700 GB > which I made available were not enough. So, I want to set up a RAID0 on one 8 > core node with 2 1.5 TB discs. So far, so good. > > > Sounds like a cluster I might have had something to do with in a past life... > > > 700 gbytes! My advice - look closely at your software and see why it needs this scratch space, > and what you can do to cut down on this. Heh... some of the coupled cluster GAMESS tests we have seen/run have used this much or more in scratch space. Single threaded readers/writers ... you either need a very fast IO device, or like John suggested, you need to examine what is getting read/written. 700GB @ 1GB/s takes 700 seconds, roughly 11m40s +/- some. 700GB @ 0.1GB/s takes 7000 seconds, roughly 116m40s +/- some (~2 hours). A RAID0 stripe of two drives off the motherboard will be closer to the second than the first ... > > Also, let us know what code this is please. > You're right about network transfer of scratch files like that - if at all possible, > you should aim to use local scratch space on the nodes. > $VENDOR (I think in Warwick!) should be very happy to help you there! I know those guys! (and they are good). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From atchley at myri.com Tue Sep 29 10:13:37 2009 From: atchley at myri.com (Scott Atchley) Date: Tue, 29 Sep 2009 13:13:37 -0400 Subject: [Beowulf] large scratch space on cluster In-Reply-To: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> References: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> Message-ID: <62275941-E2AC-46E4-A0A5-F4520FC32C2B@myri.com> On Sep 29, 2009, at 10:09 AM, J?rg Sa?mannshausen wrote: > However, I was wondering whether it does make any sense to somehow > 'export' > that scratch space to other nodes (4 cores only). So, the idea > behind that > is, if I need a vast amount of scratch space, I could use the one in > the 8 > core node (the one I mentioned above). I could do that with nfs but > I got the > feeling it will be too slow. Also, I only got GB ethernet at hand, > so I > cannot use some other networks here. Is there a good way of doing > that? Some > words like i-scsi and cluster-FS come to mind but to be honest, up > to now I > never really worked with them. > > Any ideas? > > All the best > > J?rg I am under the impression that NFS can saturate a gigabit link. If for some reason that it cannot, you might want to try PVFS2 (http://www.pvfs.org ) over Open-MX (http://www.open-mx.org). Scott From atchley at myri.com Tue Sep 29 10:39:05 2009 From: atchley at myri.com (Scott Atchley) Date: Tue, 29 Sep 2009 13:39:05 -0400 Subject: [Beowulf] large scratch space on cluster In-Reply-To: <62275941-E2AC-46E4-A0A5-F4520FC32C2B@myri.com> References: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> <62275941-E2AC-46E4-A0A5-F4520FC32C2B@myri.com> Message-ID: <92933B44-8A80-49E0-86E3-4B1A19CD08FA@myri.com> On Sep 29, 2009, at 1:13 PM, Scott Atchley wrote: > On Sep 29, 2009, at 10:09 AM, J?rg Sa?mannshausen wrote: > >> However, I was wondering whether it does make any sense to somehow >> 'export' >> that scratch space to other nodes (4 cores only). So, the idea >> behind that >> is, if I need a vast amount of scratch space, I could use the one >> in the 8 >> core node (the one I mentioned above). I could do that with nfs but >> I got the >> feeling it will be too slow. Also, I only got GB ethernet at hand, >> so I >> cannot use some other networks here. Is there a good way of doing >> that? Some >> words like i-scsi and cluster-FS come to mind but to be honest, up >> to now I >> never really worked with them. >> >> Any ideas? >> >> All the best >> >> J?rg > > I am under the impression that NFS can saturate a gigabit link. > > If for some reason that it cannot, you might want to try PVFS2 (http://www.pvfs.org > ) over Open-MX (http://www.open-mx.org). I should add that PVFS2 is meant to separate the metdata from IO and have mulitple IO servers. You can run it a single server with both metadata and IO, but it may not be much different than NFS. Scott From Craig.Tierney at noaa.gov Tue Sep 29 11:37:50 2009 From: Craig.Tierney at noaa.gov (Craig Tierney) Date: Tue, 29 Sep 2009 12:37:50 -0600 Subject: [Beowulf] large scratch space on cluster In-Reply-To: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> References: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> Message-ID: <4AC253FE.4080300@noaa.gov> J?rg Sa?mannshausen wrote: > Dear all, > > I was wondering if somebody could help me here a bit. > For some of the calculations we are running on our cluster we need a > significant amount of disc space. The last calculation crashed as the ~700 GB > which I made available were not enough. So, I want to set up a RAID0 on one 8 > core node with 2 1.5 TB discs. So far, so good. > > However, I was wondering whether it does make any sense to somehow 'export' > that scratch space to other nodes (4 cores only). So, the idea behind that > is, if I need a vast amount of scratch space, I could use the one in the 8 > core node (the one I mentioned above). I could do that with nfs but I got the > feeling it will be too slow. Also, I only got GB ethernet at hand, so I > cannot use some other networks here. Is there a good way of doing that? Some > words like i-scsi and cluster-FS come to mind but to be honest, up to now I > never really worked with them. > You could do something crazy like dynamically create distributed filesystems using GlusterFS (or other Open Source FS) using the local storage of each node that the job is using. This way it is dedicated to your job, share it in your job, and not impact other jobs. Each node needs a disk, but that isn't too expensive. Also, you can skip the RAID part (unless it is for performance) because if the disk dies, it only affects that one node. We tried this for awhile. It worked ok (with GlusterFS), but then we got a good Lustre setup and the performance of the dynamic version didn't justify the effort and maintenance. However, on a smaller system where I don't have that many resources, I might try this again. Craig > Any ideas? > > All the best > > J?rg > -- Craig Tierney (craig.tierney at noaa.gov) From jellogum at gmail.com Tue Sep 29 13:56:50 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Tue, 29 Sep 2009 16:56:50 -0400 Subject: [Beowulf] Re: What is best OS for GRID or clusters? In-Reply-To: <4ABFF35C.7000707@tamu.edu> References: <4ABFF35C.7000707@tamu.edu> Message-ID: The quality of writing and thoughtful insight presented on this board has kept me coming back over the years as a reader, as I have observed and learned a great deal by reading various posts... I really appreciate the feed back. Thank you. Is this forum an appropriate place to discuss software concepts, issues, and questions related to modeling a problem to be implemented by a cluster, or is it mostly a place for shop talk to address hardware specs...? My goal is to spend less time with the hardware and more time modeling problems with software, but I am directing some effort to understand the mechanics of the involved hardware components so as to write good code... by trying to understand the nature of the machine, how it works, and how it fits together. This endeavor has spread my time thin, sometimes yielding information that I can not use nor understand, so I welcome criticism to help me focus. I'll try to keep my fuzzy CS questions limited. A little about me: For most of my twenties I lived out of a backpack hitching the country working various terrestrial and maritime jobs from coast to coast, and recently completed my college degree. Currently, I work odd jobs to make ends meet, and with my freetime I enjoy the topics of CS and science, illustrate art, and practise classical guitar. During my undergraduate studies at JSC VT, a college professor, Martin A. Walker now teaching chemistry at SUNY Potsdam, influenced me to install Linux on a system and to work on chemistry problems. I decided to pick an interesting problem that I could spend a long time developing, and choose a focus related to molecular biology and hard sciences with a smattering of math classes. I love the rain forest jungle of Linux, but I have become lost in it too... My first cluster will be constructed using recycled legacy x86 pcs, classical cluster w/ a Linux kernel, and perhaps I will start with a failover cluster before parallel high-throughput... Have a nice day, Jeremy On Sun, Sep 27, 2009 at 7:21 PM, Gerry Creager wrote: > Jeremy, > > I think you'll discover that the Beowulf list tends to comprise a number of > folks who are engaged in high performance, or high throughput, computing > already, or are coming into the fray, now interested in learning what is > composed of the art of the possible. > > We've a nice assortment of knowledgable folk here, who offer their > expertise freely, and whose knowledge is often complementary and extensible, > in that one person's experiences and knowledge are often building blocks for > another's explanation. > > We tend to run Linux, as a core OS of choice, for a variety of reasons. > These include familiarity, experience and comfort levels, and in a number of > cases, a systematic determination that it's the best choice for what we're > doing. > > In this post, while apparently asking for opinions about the best OS for > grid or cluster computing, you point out "yet another academic OS project" > (which is not to dismiss it, but simply to categorize it). EROS, from an > academic perspective, looks interesting, but currently impractical. > > You see, like you, I've a finite temporal resource, and am limited in my > current job to a 168 hour work week (and by my wife and family to an even > shorter one). I have invested a lot of time in *nix over the years, and have > decided to my satisfaction that Linux is the best fit for my scientific > efforts. Further (or better|worse, depending on outlook), I prefer CentOS > these days for stability. You see, I've isolated clusters that have been > running without updates for half a decade, because they're up and stable. I > tend to create cluster environments that meet a particular need for > performance or throughput, and which can then be administered as efficiently > as possible... preferably meaning that neither I, nor my other > administrators, have to spend much time with 'em. My real job isn't to play > with clusters, OS's or administration, it's to obtain funding and do > research using computational models. > > Please don't take this as a slight. Instead, I'm trying to give you a > flavor of *some* of the folks here, and a basis for several of the replies. > We're interested, and there are almost certainly folks on this list who've > investigated all aspects of what you are asking about. I trust these to > answer your queries much better than I can. And don't stop asking. But do > realize that we tend to spend a lot of our time trying to get the work out > the door rather than searching for the next great tool that could consume > all our time learning whether it's practical. > > Finally, getting back to the query that started all of this, I suspect > Linux, and NOT Solaris, would prove easier, by some margin. I recommend you > spend a little time investigating NPACI Rocks (yes, I do use them for some > clusters) as they have implementations using either Linux or Solaris, and > someone's developing a Rocks Roll for grid use, or so I'm told. That could > give you a fairly simple implementation path if that's what you're looking > for. At first glance, EROS does not look like it's ready for prime time, so > I'd not be looking that way. Of course, SOMEONE needs to try it in the > cluster world, someday, but I don't have the time to be that person. > > Good luck in your studies, and welcome to the group! > gerry > > Jeremy Baker wrote: > >> EROS (Extremely Reliable Operating System) >> >> >> http://www.eros-os.org/eros.html >> >> >> >> >> >> -- >> Jeremy Baker >> PO 297 >> Johnson, VT >> 05656 >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing >> To change your subscription (digest mode or unsubscribe) visit >> http://www.beowulf.org/mailman/listinfo/beowulf >> > > -- > Gerry Creager -- gerry.creager at tamu.edu > Texas Mesonet -- AATLT, Texas A&M University > Cell: 979.229.5301 Office: 979.458.4020 FAX: 979.862.3983 > Office: 1700 Research Parkway Ste 160, TAMU, College Station, TX 77843 > -- Jeremy Baker PO 297 Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jellogum at gmail.com Tue Sep 29 17:56:08 2009 From: jellogum at gmail.com (Jeremy Baker) Date: Tue, 29 Sep 2009 20:56:08 -0400 Subject: [Beowulf] What is best OS for GRID or clusters? In-Reply-To: <981e81f00909261155q7808e905vd147d83249af453d@mail.gmail.com> References: <981e81f00909261155q7808e905vd147d83249af453d@mail.gmail.com> Message-ID: Re: dtrace, used for dynamic tracing of Solaris real time kernel and software behavior, as Wikipedia states it has been ported to other unix-like systems, I wonder if would one of these systems be a Linux kernel? Dtrace looks very useful. I'll look for it and will check my local library for the suggested reading material. Jeremy On Sat, Sep 26, 2009 at 2:55 PM, Joachim Worringen wrote: > On Sat, Sep 26, 2009 at 7:32 PM, Jeremy Baker wrote: > >> Any significant reason to use SOLARIS over a Linux distro for development >> of software? >> > > Yes, dtrace, documentation, interface stability, quality of community and > many things more. I've worked on both platforms a lot, and dtrace alone can > save you days when debugging and analysing software, no matter if kernel or > userspace. > > >> Does the same C/C++ file compile well on both systems? If there are >> differences where can I find suggested reading material on the topic? >> > > Which topic exactly? "Software development" is pretty wide a topic. But > just to give you something: I recently bought "The Developer's Edge" and > enjoyed it very much (see http://my.safaribooksonline.com/0595352510). > > Joachim > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -- Jeremy Baker PO 297 Johnson, VT 05656 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jorg.sassmannshausen at strath.ac.uk Tue Sep 29 12:15:19 2009 From: jorg.sassmannshausen at strath.ac.uk (=?iso-8859-1?q?J=F6rg_Sa=DFmannshausen?=) Date: Tue, 29 Sep 2009 20:15:19 +0100 Subject: [Beowulf] large scratch space on cluster In-Reply-To: <4AC23F1D.6000903@scalableinformatics.com> References: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk> <68A57CCFD4005646957BD2D18E60667B0D524F0A@milexchmb1.mil.tagmclarengroup.com> <4AC23F1D.6000903@scalableinformatics.com> Message-ID: <200909292015.19830.jorg.sassmannshausen@strath.ac.uk> Hi Joe, thanks for the prompt reply. Actually, it is not GAMESS which is causing the problem but Molpro. The reason why it needs that much space is simply the size of the molecule and the functional ( CCSD(T) ). Both are not in favour of a fast job with little scratch space. I don't think there is much I can do in terms of the program side. If I want to run the job, I need more scratch space. I always thought that a RAID0 stripe is the best solution for fast and large scratch space? That is the reason why I thought of that. Besides, these are a large number of small files, so you don't read 700 GB at once. Else it would be an impossible task. ;-) I did that once before, and that was over a NFS share, and it acutally was working not too bad... until somebody triggered the power-switch and did not put it back quick enough so the UPS was running out of battery power :-( Besides, I have already contacted the $VENDORs ;-) All the best J?rg On Dienstag 29 September 2009 Joe Landman wrote: > Hearns, John wrote: > > I was wondering if somebody could help me here a bit. > > For some of the calculations we are running on our cluster we need a > > significant amount of disc space. The last calculation crashed as the > > ~700 GB which I made available were not enough. So, I want to set up a > > RAID0 on one 8 core node with 2 1.5 TB discs. So far, so good. > > > > > > Sounds like a cluster I might have had something to do with in a past > > life... > > > > > > 700 gbytes! My advice - look closely at your software and see why it > > needs this scratch space, and what you can do to cut down on this. > > Heh... some of the coupled cluster GAMESS tests we have seen/run have > used this much or more in scratch space. > > Single threaded readers/writers ... you either need a very fast IO > device, or like John suggested, you need to examine what is getting > read/written. > > 700GB @ 1GB/s takes 700 seconds, roughly 11m40s +/- some. > 700GB @ 0.1GB/s takes 7000 seconds, roughly 116m40s +/- some (~2 hours). > > A RAID0 stripe of two drives off the motherboard will be closer to the > second than the first ... > > > Also, let us know what code this is please. > > You're right about network transfer of scratch files like that - if at > > all possible, you should aim to use local scratch space on the nodes. > > $VENDOR (I think in Warwick!) should be very happy to help you there! > > I know those guys! (and they are good). > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics Inc. > email: landman at scalableinformatics.com > web : http://scalableinformatics.com > http://scalableinformatics.com/jackrabbit > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 -- ************************************************************* J?rg Sa?mannshausen Research Fellow University of Strathclyde Department of Pure and Applied Chemistry 295 Cathedral St. Glasgow G1 1XL email: jorg.sassmannshausen at strath.ac.uk web: http://sassy.formativ.net Please avoid sending me Word or PowerPoint attachments. See http://www.gnu.org/philosophy/no-word-attachments.html From brs at usf.edu Tue Sep 29 19:00:26 2009 From: brs at usf.edu (Brian Smith) Date: Tue, 29 Sep 2009 22:00:26 -0400 Subject: [Beowulf] What is best OS for GRID or clusters? In-Reply-To: References: <981e81f00909261155q7808e905vd147d83249af453d@mail.gmail.com> Message-ID: <750CC6F4-AE53-46D4-9878-B72B9C83417C@usf.edu> Not to my knowledge. Linux has systemtap instead: http://sourceware.org/systemtap/ Also, see here for a comparison: http://sourceware.org/systemtap/wiki/SystemtapDtraceComparison -Brian On Sep 29, 2009, at 8:56 PM, Jeremy Baker wrote: > Re: dtrace, used for dynamic tracing of Solaris real time kernel and > software behavior, as Wikipedia states it has been ported to other > unix-like systems, I wonder if would one of these systems be a Linux > kernel? Dtrace looks very useful. I'll look for it and will check my > local library for the suggested reading material. > > Jeremy > > > > On Sat, Sep 26, 2009 at 2:55 PM, Joachim Worringen > wrote: > On Sat, Sep 26, 2009 at 7:32 PM, Jeremy Baker > wrote: > Any significant reason to use SOLARIS over a Linux distro for > development of software? > > Yes, dtrace, documentation, interface stability, quality of > community and many things more. I've worked on both platforms a lot, > and dtrace alone can save you days when debugging and analysing > software, no matter if kernel or userspace. > > Does the same C/C++ file compile well on both systems? If there are > differences where can I find suggested reading material on the topic? > > Which topic exactly? "Software development" is pretty wide a topic. > But just to give you something: I recently bought "The Developer's > Edge" and enjoyed it very much (see http://my.safaribooksonline.com/0595352510 > ). > > Joachim > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > > > -- > Jeremy Baker > PO 297 > Johnson, VT > 05656 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Tue Sep 29 22:23:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Sep 2009 00:23:14 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? Message-ID: Any good recommendation on a crash cart for a cluster room? My last cluster was small and we had the luxury of having a KVM + SIP connecting to each compute node. I doubt that will be feasible this time around , now that I have 200+ nodes. How do other sys admins handle this? Just a simple crash cart? Or there any other options that make life easier in the long run. Note that for my head nodes etc. I do plan on having a small (4 port) KVM in the main rack with its console and rackmount keyboard. I guess the crash cart will cause a duplication of this keyboard + monitor but that can't be avoided. Or can it? I'm not trying to be cheap here but just splurge a bit and get a solution that will make late hours of debugging a little more palatable for the sys-admins! -- Rahul From landman at scalableinformatics.com Tue Sep 29 22:38:14 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 30 Sep 2009 01:38:14 -0400 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: <4AC2EEC6.4000902@scalableinformatics.com> Rahul Nabar wrote: > I'm not trying to be cheap here but just splurge a bit and get a > solution that will make late hours of debugging a little more > palatable for the sys-admins! Most decent nodes will have IPMI and kvm over IP built in. That and a reasonable serial concentrator will make your admins lives *much* easier. We help manage clusters/storage hundreds, thousands, and often continents away from us. A crash cart makes it infeasible to consider this. And if the admin gets paged by someone at 3am, I am sure they won't want to drive in and use the crash cart to find/fix the problem. Keep the crash cart, but don't spend more than $300 on it (monitor + keyboard/mouse and a cart that rolls). It isn't for management, its for last resort. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Tue Sep 29 23:26:55 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 30 Sep 2009 02:26:55 -0400 (EDT) Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: > Any good recommendation on a crash cart for a cluster room? My last I've got a rolling cart, 15" vga and keyboard, not a big deal. it's technically an AV cart - provides two other shelves, and enough space for a pad or some tools, even a node sometimes. > nodes. How do other sys admins handle this? Just a simple crash cart? > Or there any other options that make life easier in the long run. thank goodness for lan IPMI (bios, serial console redirection) - it keeps me out of the machineroom 99% of the time. heck, it keeps me out of the dozen other machinerooms that our ~30 clusters are in. > Note that for my head nodes etc. I do plan on having a small (4 port) > KVM in the main rack with its console and rackmount keyboard. I guess not worth it imo - I'd use the crash cart, since the need is so rare. > I'm not trying to be cheap here but just splurge a bit and get a > solution that will make late hours of debugging a little more > palatable for the sys-admins! ipmi ipmi ipmi. try to avoid the obnoxious nonstandard proprietary POS's that vendors push. you want remote power on/off/reset, then serial redirect and hopefully bios redirect too. then temperature monitoring, and SEL access. this should be in every entry-level server IMO. you know the vendor has jumped the shark if/when they provide xml-based scripting and offer licensed extended features rather than standard ipmi support... From beat at 0x1b.ch Tue Sep 29 23:30:09 2009 From: beat at 0x1b.ch (Beat Rubischon) Date: Wed, 30 Sep 2009 08:30:09 +0200 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <4AC2EEC6.4000902@scalableinformatics.com> Message-ID: Hello! Quoting (30.09.09 07:38): > Most decent nodes will have IPMI and kvm over IP built in. That and a > reasonable serial concentrator will make your admins lives *much* easier. +1 vote from me. > Keep the crash cart, but don't spend more than $300 on it (monitor + > keyboard/mouse and a cart that rolls). It isn't for management, its for > last resort. One of the best solutions I saw so far was a KVM drawer in one rack with a small KVM switch and a single, long cable per rack. Open the rack, attach the cable and you're done. Moving a cart through a typical datacenter with cables, boxes and other crap on the floor is usually a mess. Beat -- \|/ Beat Rubischon ( 0^0 ) http://www.0x1b.ch/~beat/ oOO--(_)--OOo--------------------------------------------------- Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/ From john.hearns at mclaren.com Wed Sep 30 01:24:31 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 30 Sep 2009 09:24:31 +0100 Subject: [Beowulf] large scratch space on cluster In-Reply-To: <200909292015.19830.jorg.sassmannshausen@strath.ac.uk> References: <200909291509.13812.jorg.sassmannshausen@strath.ac.uk><68A57CCFD4005646957BD2D18E60667B0D524F0A@milexchmb1.mil.tagmclarengroup.com><4AC23F1D.6000903@scalableinformatics.com> <200909292015.19830.jorg.sassmannshausen@strath.ac.uk> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D525194@milexchmb1.mil.tagmclarengroup.com> Actually, it is not GAMESS which is causing the problem but Molpro. The reason why it needs that much space is simply the size of the molecule and the functional ( CCSD(T) ). Both are not in favour of a fast job with little scratch space. I don't think there is much I can do in terms of the program side. If I want to run the job, I need more scratch space. Yes, but do these scratch files have to be on a central disk server, or is it possible to have them local to the nodes? If local, you have a much easier problem. I can say that $VENDOR put in one very whizzy cluster which had four fast SCSI drives striped together for local scratch storage, for running a finite element code. It went like gangbusters. In your case, you could either put pairs of striped drives in each compute node. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From john.hearns at mclaren.com Wed Sep 30 01:27:27 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 30 Sep 2009 09:27:27 +0100 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: <4AC2EEC6.4000902@scalableinformatics.com> References: <4AC2EEC6.4000902@scalableinformatics.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> Most decent nodes will have IPMI and kvm over IP built in. That and a reasonable serial concentrator will make your admins lives *much* easier. I agree wholeheartedly with what Joe says. And Rahul, are you not talking to vendors who are telling you about their remote management and node imaging capabilities? By vendors, I do not mean your local Tier 1 salesman, who sells servers to normal businesses and corporations. I mean either a specialised cluster vendor, such as those on this list, or the HPC specialist team within (Sun, Dell, IBM, HP...) The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From lynesh at Cardiff.ac.uk Wed Sep 30 03:34:46 2009 From: lynesh at Cardiff.ac.uk (Huw Lynes) Date: Wed, 30 Sep 2009 11:34:46 +0100 Subject: [Beowulf] cables all over the floor In-Reply-To: References: Message-ID: <1254306886.2268.9.camel@w609.insrv.cf.ac.uk> On Wed, 2009-09-30 at 08:30 +0200, Beat Rubischon wrote: > Hello! > > One of the best solutions I saw so far was a KVM drawer in one rack with a > small KVM switch and a single, long cable per rack. Open the rack, attach > the cable and you're done. Moving a cart through a typical datacenter with > cables, boxes and other crap on the floor is usually a mess. ...counts to 10.... It sounds like these "typical" datacentres might benefit from an investment in cable management. Really expensive items like cable ties and labels cost less than KVMs. And then maybe some shelving to store all that "crap all over the floor". Thanks, Huw -- Huw Lynes | Advanced Research Computing HEC Sysadmin | Cardiff University | Redwood Building, Tel: +44 (0) 29208 70626 | King Edward VII Avenue, CF10 3NB From rpnabar at gmail.com Wed Sep 30 05:43:54 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Sep 2009 07:43:54 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> References: <4AC2EEC6.4000902@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D5251A1@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Wed, Sep 30, 2009 at 3:27 AM, Hearns, John wrote: > > And Rahul, are you not talking to vendors who are telling you about > their remote management and > node imaging capabilities? By vendors, I do not mean your local Tier 1 > salesman, who sells servers to normal businesses > and corporations. Thanks for all the suggestions guys! I am aware of IPMI and my hardware does support it (I think). Its just that I've never had much use for it all my past clusters being very small. -- Rahul From rpnabar at gmail.com Wed Sep 30 05:48:28 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Sep 2009 07:48:28 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <4AC2EEC6.4000902@scalableinformatics.com> Message-ID: On Wed, Sep 30, 2009 at 1:30 AM, Beat Rubischon wrote: > One of the best solutions I saw so far was a KVM drawer in one rack with a > small KVM switch and a single, long cable per rack. Open the rack, attach > the cable and you're done. Moving a cart through a typical datacenter with > cables, boxes and other crap on the floor is usually a mess. > Thanks! This is exactly what I will shop for then. I used the term "crash cart" in a more generic sense. I've seen crash-carts parked in cluster rooms before and they do look unwieldy. What you suggest seems a better option. -- Rahul From rpnabar at gmail.com Wed Sep 30 05:58:38 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Sep 2009 07:58:38 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: On Wed, Sep 30, 2009 at 1:26 AM, Mark Hahn wrote: > thank goodness for lan IPMI (bios, serial console redirection) - it keeps me > out of the machineroom 99% of the time. ?heck, it keeps me out of the dozen > other machinerooms that our ~30 clusters are in. I am still a bit confused about the exact config. Trying to clarify the picture! Question: Do you have: (a) a separate eth cable coming out of each server that takes the IPMI packets (b) or are the packets pushed out over the primary eth cable already consolidated at the eth card? In case my option-(a)-picture is correct, then it means a doubling of switch ports needed which wouldn't be so nice. Sorry, I probably sound a total luddite but no point pretending I know about the typical setup. [The stuff that I *have* used on servers in the past looked like this: a dongle that connects over the monitor-out and USB port; traps signals; converts them to I/P--> plugs into a separate switch--> console; But I suspect the solution you guys are all so happy about is not this but a better version! Yes, I know, I am a caveman! ] --- Rahul From rpnabar at gmail.com Wed Sep 30 06:13:52 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Sep 2009 08:13:52 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: On Wed, Sep 30, 2009 at 1:26 AM, Mark Hahn wrote: > >> Note that for my head nodes etc. I do plan on having a small (4 port) >> KVM in the main rack with its console and rackmount keyboard. I guess > > not worth it imo - I'd use the crash cart, since the need is so rare. Good point! If "IPMI over LAN" works so well I might as well get rid of this mini KVM. > > ipmi ipmi ipmi. ?try to avoid the obnoxious nonstandard proprietary POS's > that vendors push. Thanks for the comments Mark. I'll try and not offend any vendors this time around! :) I had stayed away from "remote manage" precisely because most of what I had heard seemed vendor specific, proprietary systems. I wasn't aware of this public implementation of remote management! >you want remote power on/off/reset, then serial > redirect and hopefully bios redirect too. ?then temperature monitoring, > and SEL access. ?this should be in every entry-level server IMO. > > you know the vendor has jumped the shark if/when they provide xml-based > scripting and offer licensed extended features rather than standard ipmi > support... There are indeed a large number of vendor specific GUI solutions out there. I had played with one of those and it sort of put me off "remote manage" [Again, no offense vendors.] But I'll definitely toy with getting "IPMI over LAN" especially since all of you are so unanimously happy about it!! -- Rahul From john.hearns at mclaren.com Wed Sep 30 06:19:48 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 30 Sep 2009 14:19:48 +0100 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> > thank goodness for lan IPMI (bios, serial console redirection) - it keeps me > out of the machineroom 99% of the time. ?heck, it keeps me out of the dozen > other machinerooms that our ~30 clusters are in. I am still a bit confused about the exact config. Trying to clarify the picture! Question: Do you have: (a) a separate eth cable coming out of each server that takes the IPMI packets (b) or are the packets pushed out over the primary eth cable already consolidated at the eth card? It depends. Supermicro use the shared-socket approach (actually it is a bridge somewhere on the motherboard), or with Supermicro you can have a separate socket using a little cable with a minu-USB connector onto the IPMI card. Other manufacturers use (a) or (b). On a blade setup the IPMI is carried over the backplane Ethernet links. In case my option-(a)-picture is correct, then it means a doubling of switch ports needed which wouldn't be so nice. If you have a separate IPMI network (ILOM, DRAC, whatever they call it) you do not need the same type of switches. What you need is some cheap 10/100 switches, one in each rack. Say Netgear or D-Link. Not a central switch with a huge backbone capacity. Then you just connect the switches together in a loop. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From rpnabar at gmail.com Wed Sep 30 06:23:52 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Sep 2009 08:23:52 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Wed, Sep 30, 2009 at 8:19 AM, Hearns, John wrote: > It depends. Supermicro use the shared-socket approach (actually it is a bridge > somewhere on the motherboard), or with Supermicro you can have a separate > socket using a little cable with a minu-USB connector onto the IPMI card. > Other manufacturers use (a) or (b). > On a blade setup the IPMI is carried over the backplane Ethernet links. > > > If you have a separate IPMI network (ILOM, DRAC, whatever they call it) you > do not need the same type of switches. What you need is some cheap 10/100 switches, > one in each rack. Say Netgear or D-Link. Not a central switch with a huge backbone capacity. > Then you just connect the switches together in a loop. > I like the shared socket approach. Building a separate IPMI network seems a lot of extra wiring to me. Admittedly the IPMI switches can be configured to be dirt cheap but it still feels like building a extra tiny road for one car a day when a huge highway with spare capacity exists right next door carrying thousands of cars. (Ok, cheesy analogy!) -- Rahul From john.hearns at mclaren.com Wed Sep 30 06:30:50 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 30 Sep 2009 14:30:50 +0100 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> I like the shared socket approach. Building a separate IPMI network seems a lot of extra wiring to me. Admittedly the IPMI switches can be configured to be dirt cheap but it still feels like building a extra tiny road for one car a day when a huge highway with spare capacity exists right next door carrying thousands of cars. (Ok, cheesy analogy!) Errrr.... you missed all my Beowulf posts about the clashes with the IPMI ports and the ports used for 'rsh' connections on a cluster then? And all the shenanigans with setting sunrpc.min_resvport etc.? Having a separate, simple IPMI network which comes up when you power the racks up has a lot of advantages. 10/100 Netgear switches cost almost nothing, and getting another loom of Cat5 cables configured when the racks are being built is relatively easy. By the way, which hardware do you use? The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From gerry.creager at tamu.edu Wed Sep 30 06:53:01 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed, 30 Sep 2009 08:53:01 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4AC362BD.4020800@tamu.edu> Hearns, John wrote: > I like the shared socket approach. Building a separate IPMI network > seems a lot of extra wiring to me. Admittedly the IPMI switches can be > configured to be dirt cheap but it still feels like building a extra > tiny road for one car a day when a huge highway with spare capacity > exists right next door carrying thousands of cars. (Ok, cheesy > analogy!) > > > Errrr.... you missed all my Beowulf posts about the clashes with the > IPMI ports > and the ports used for 'rsh' connections on a cluster then? And all the > shenanigans > with setting sunrpc.min_resvport etc.? > > Having a separate, simple IPMI network which comes up when you power the > racks up > has a lot of advantages. 10/100 Netgear switches cost almost nothing, > and getting > another loom of Cat5 cables configured when the racks are being built is > relatively easy. > > By the way, which hardware do you use? We've been down both paths. On our recent acquisition, we ended up with separate, dedicated IPMI ports, despite our spec stating we wanted shared socked ports. I bought 4 Netgear switches and added infrastructure cabling. Having been down both paths, now, in the last year (nothing is too old to have the memory clear in my mind) I definitely have decided the completely separate IPMI network plan is superior overall. I wish I could retrofit the Dell cluster to accomplish this, but it ain't gonna happen. It's a much cleaner (from a cluster management view) approach, IMNSHO. gerry From rpnabar at gmail.com Wed Sep 30 06:58:22 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 30 Sep 2009 08:58:22 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Wed, Sep 30, 2009 at 8:30 AM, Hearns, John wrote: > > By the way, which hardware do you use? Dell in the past. This time around still comparing vendors. -- Rahul From john.hearns at mclaren.com Wed Sep 30 07:07:02 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 30 Sep 2009 15:07:02 +0100 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D5AE1D2@milexchmb1.mil.tagmclarengroup.com> Dell in the past. This time around still comparing vendors. Dell refer to IPMI management as 'DRAC', not to be confused with http://www.drac.org.uk/ BTW, if you really don't like extra network cabling and choosing Ethernet switches, why not go for a blade solution? The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From landman at scalableinformatics.com Wed Sep 30 07:09:23 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 30 Sep 2009 10:09:23 -0400 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4AC36693.3040801@scalableinformatics.com> Rahul Nabar wrote: > On Wed, Sep 30, 2009 at 8:19 AM, Hearns, John wrote: > >> It depends. Supermicro use the shared-socket approach (actually it is a bridge >> somewhere on the motherboard), or with Supermicro you can have a separate >> socket using a little cable with a minu-USB connector onto the IPMI card. >> Other manufacturers use (a) or (b). >> On a blade setup the IPMI is carried over the backplane Ethernet links. >> >> >> If you have a separate IPMI network (ILOM, DRAC, whatever they call it) you >> do not need the same type of switches. What you need is some cheap 10/100 switches, >> one in each rack. Say Netgear or D-Link. Not a central switch with a huge backbone capacity. >> Then you just connect the switches together in a loop. >> > > > I like the shared socket approach. Building a separate IPMI network > seems a lot of extra wiring to me. Admittedly the IPMI switches can be Allow me to point out the contrary view. After years of configuring and helping run/manage both, we recommend strongly *against* the shared physical connector approach. The extra cost/hassle of the extra cheap switch and wires is well worth the money. Why do we take this view? Many reasons, but some of the bigger ones are a) when the OS takes the port down, your IPMI no longer responds to arp requests. Which means ping, and any other service (IPMI) will fail without a continuous updating of the arp tables, or a forced hardwire of those ips to those mac addresses. b) IPMI stack bugs (what ... you haven't seen any? you must not be using IPMI ...). My favorite in recent memory (over the last year) was one where IPMI did some a DHCP and got itself wedged into a strange state. To unwedge it, we had to disconnect the IPMI network port, issue an mc reset cold, wait, and the plug it back in. Hard to do when the eth0 and IPMI share the same port. Of course I could also talk about the SOL (serial over lan) which didn't (grrrrrrrrrr) Short version, we advise everyone, including some on this list, to always use a second independent IPMI network. We make sure that anyone insisting upon one really truly understands what they are in for. I want to emphasize this. It is, in my opinion, one of the many false savings you can make in cluster design, to pull out the extra switch and wires for IPMI. Its false savings, in that you will likely eat up the cost/effort difference between the two variants in terms of excess labor, self-hair removal, ... Really ... its not worth the pain. Go with two nets. FWIW: most of the server class Supermicro boards (the Nehalems) now come with IPMI and kvm over IP built in, on a separate NIC. Some do share the NIC, we simply avoid using those boards in most cases. Note also: for real lights out capability, we configure alternative management paths. Again, it saves you time/effort/resources down the road for a modest/minimal investment up front. Switched PDUs and a serial port concentrator (or our management node with lots of serial ports ...). It makes life *sooo* much better when "b" strikes, and you need to de-wedgify a node or three, and you are too far to drive in. There is lots to be said for real lights out capability. Park one crash cart in a corner, and hope you will never have to use it. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From tjrc at sanger.ac.uk Wed Sep 30 07:19:17 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed, 30 Sep 2009 15:19:17 +0100 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> Message-ID: On 30 Sep 2009, at 2:23 pm, Rahul Nabar wrote: > I like the shared socket approach. Building a separate IPMI network > seems a lot of extra wiring to me. Admittedly the IPMI switches can be > configured to be dirt cheap but it still feels like building a extra > tiny road for one car a day when a huge highway with spare capacity > exists right next door carrying thousands of cars. (Ok, cheesy > analogy!) Yes, but the tiny road is still useable by the emergency services when there's been a pileup on the main cariageway, and there are wreckage and bodies everywhere! Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From eugen at leitl.org Wed Sep 30 07:20:18 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 30 Sep 2009 16:20:18 +0200 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> <68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> Message-ID: <20090930142018.GA27331@leitl.org> On Wed, Sep 30, 2009 at 08:58:22AM -0500, Rahul Nabar wrote: > > By the way, which hardware do you use? > > Dell in the past. This time around still comparing vendors. Are you happy with the IPMI on Supermicro? I've been unable to personally test a KVM + remote media IPMI from them, especially the one integrated into their newer motherboards. I was quite happy with Sun's eLOM. Less happy with HP and Dell (and Fujitsu-Siemens) which require extra licenses to unlock anything above baseline capability IPMI (oh, and don't get the started on them not selling empty drive caddies). -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From john.hearns at mclaren.com Wed Sep 30 07:31:15 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Wed, 30 Sep 2009 15:31:15 +0100 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: <20090930142018.GA27331@leitl.org> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com><68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> <20090930142018.GA27331@leitl.org> Message-ID: <68A57CCFD4005646957BD2D18E60667B0D5AE210@milexchmb1.mil.tagmclarengroup.com> Are you happy with the IPMI on Supermicro? I've been unable to personally test a KVM + remote media IPMI from them, Eugen, I think you're aksing that one of me? I'm very, very happy with IPMI on SGI equipment. Rock solid reliability, and you can power cycle blades/IRUs/entire racks when you're sitting in your pyjamas. The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From beat at 0x1b.ch Wed Sep 30 07:39:14 2009 From: beat at 0x1b.ch (Beat Rubischon) Date: Wed, 30 Sep 2009 16:39:14 +0200 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: Message-ID: Hello! Quoting (30.09.09 16:19): > Yes, but the tiny road is still useable by the emergency services when > there's been a pileup on the main cariageway, and there are wreckage > and bodies everywhere! I allready saw a lot of Beowulfs where the management network is so rarely used that nobody detects errors. When an emergencie arrives, the operators struggles over the screwed up switches and unplugged cables. So one vote for shared nic. Using Intel boards, this works well since Summer 2006 (Woodcrest, S5000 chipsets), other platforms started to be usable during 2008/2009. Older boards have usually BMCs which are too old to be really stable. But as always: YMMV. Beat -- \|/ Beat Rubischon ( 0^0 ) http://www.0x1b.ch/~beat/ oOO--(_)--OOo--------------------------------------------------- Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/ From hahn at mcmaster.ca Wed Sep 30 07:54:54 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 30 Sep 2009 10:54:54 -0400 (EDT) Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: > (a) a separate eth cable coming out of each server that takes the IPMI packets yes. > (b) or are the packets pushed out over the primary eth cable already > consolidated at the eth card? I don't have any of these shared configs, but they exist. I'm not clear on how well they work. obviously IPMI is low-traffic, so I would think a shared config could work well. > In case my option-(a)-picture is correct, then it means a doubling of > switch ports needed which wouldn't be so nice. switch ports are cheap. if yours are not, you're doing something wrong. especially since IPMI is always 100bT (afai seen) so we're talking any old commodity thing: isn't $2-3/port worth it? > Sorry, I probably sound a total luddite but no point pretending I know > about the typical setup. [The stuff that I *have* used on servers in > the past looked like this: a dongle that connects over the monitor-out > and USB port; traps signals; converts them to I/P--> plugs into a > separate switch--> console; But I suspect the solution you guys are I've never used/seen a real kvm-over-ip (have some kvm-over-cat5 though.) From gerry.creager at tamu.edu Wed Sep 30 08:26:43 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed, 30 Sep 2009 10:26:43 -0500 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D5AE210@milexchmb1.mil.tagmclarengroup.com> References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com><68A57CCFD4005646957BD2D18E60667B0D5AE171@milexchmb1.mil.tagmclarengroup.com> <20090930142018.GA27331@leitl.org> <68A57CCFD4005646957BD2D18E60667B0D5AE210@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4AC378B3.4090206@tamu.edu> Hearns, John wrote: > Are you happy with the IPMI on Supermicro? I've been unable to > personally test a KVM + remote media IPMI from them, > > > Eugen, I think you're aksing that one of me? > I'm very, very happy with IPMI on SGI equipment. Rock solid reliability, > and you can power cycle blades/IRUs/entire racks when you're sitting in > your pyjamas. I've really been happy with the SuperMicro IPMI functionality. I'm still learning a few things about it but it's been working fine. gc From landman at scalableinformatics.com Wed Sep 30 08:34:16 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 30 Sep 2009 11:34:16 -0400 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: <68A57CCFD4005646957BD2D18E60667B0D5AE14C@milexchmb1.mil.tagmclarengroup.com> Message-ID: <4AC37A78.9070604@scalableinformatics.com> Tim Cutts wrote: > > On 30 Sep 2009, at 2:23 pm, Rahul Nabar wrote: > >> I like the shared socket approach. Building a separate IPMI network >> seems a lot of extra wiring to me. Admittedly the IPMI switches can be >> configured to be dirt cheap but it still feels like building a extra >> tiny road for one car a day when a huge highway with spare capacity >> exists right next door carrying thousands of cars. (Ok, cheesy >> analogy!) > > Yes, but the tiny road is still useable by the emergency services when > there's been a pileup on the main cariageway, and there are wreckage and > bodies everywhere! If this is happening in your computer room, you have bigger issues to worry about than crash carts ... just saying ... :) > > Tim > > -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Wed Sep 30 12:13:29 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 30 Sep 2009 15:13:29 -0400 (EDT) Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: >> Yes, but the tiny road is still useable by the emergency services when >> there's been a pileup on the main cariageway, and there are wreckage >> and bodies everywhere! > > I allready saw a lot of Beowulfs where the management network is so rarely > used that nobody detects errors. When an emergencie arrives, the operators > struggles over the screwed up switches and unplugged cables. I think that's insane (no offense intended). same with the comment about crash carts being hard to maneuver. the management network should be constantly used for monitoring; never testing or using it is a serious admin failure IMO. having enough obstructions in your machineroom that you can't move a small cart around is also a serious failure: such a room would necessarily also have disasterous airflow, etc. From lindahl at pbm.com Wed Sep 30 14:53:56 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 30 Sep 2009 14:53:56 -0700 Subject: [Beowulf] recommendation on crash cart for a cluster room: fullcluster KVM is not an option I suppose? In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D5AE210@milexchmb1.mil.tagmclarengroup.com> References: <20090930142018.GA27331@leitl.org> <68A57CCFD4005646957BD2D18E60667B0D5AE210@milexchmb1.mil.tagmclarengroup.com> Message-ID: <20090930215356.GC9237@bx9.net> On Wed, Sep 30, 2009 at 03:31:15PM +0100, Hearns, John wrote: > Are you happy with the IPMI on Supermicro? I've been unable to > personally test a KVM + remote media IPMI from them, I have 200+ nodes of SuperMicro IPMI, and its suckage is not as much as the average I've seen over the years. I've had a couple replaced because they were misbehaving (some weird configuration problem), I've had to update firmware once (which was easy) to fix some bugs, and I've only had 1 mysteriously lock up. That's over more than a year. I'm also a fan of a dedicated ethernet for ipmi. -- greg From beat at 0x1b.ch Wed Sep 30 23:32:29 2009 From: beat at 0x1b.ch (Beat Rubischon) Date: Thu, 01 Oct 2009 08:32:29 +0200 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: Message-ID: Hi Mark! Quoting (30.09.09 21:13): >> I allready saw a lot of Beowulfs where the management network is so rarely >> used that nobody detects errors. When an emergencie arrives, the operators >> struggles over the screwed up switches and unplugged cables. > I think that's insane (no offense intended). same with the comment about > crash carts being hard to maneuver. You're absolutely right. And the fact that you regularly post to this list is probably a sign that you keep your serverrooms clean :-) During the last years I saw more then 100 datacenters - And I remember less then 10 which were a good place to work. Empty boxes, rails and cables on the floor, keyboards with PS2 plugs and USB only servers or even no monitor at all. Broken double floor plates, leaking cooling pipes, cascaded connector strips without free connectors. You remeber that cutted cables are always too short? Once installed they are often not replaced by a longer one which doesn't need to be spanned through the air. Ah yes, the clip on the RJ-45 plugs could break. Even on uplinks of large servers. As long as a single person or a motivated team handles a datacenter, everything looks good. But when spare time operators or even worse several groups are using a spare room in the basement, bad things are common. A coworker brought the situation to the point: Entropy is a natural state and it takes force to bring in order. And many people out there are not willing or able to invest this piece of force. Beat -- \|/ Beat Rubischon ( 0^0 ) http://www.0x1b.ch/~beat/ oOO--(_)--OOo--------------------------------------------------- Meine Erlebnisse, Gedanken und Traeume: http://www.0x1b.ch/blog/ From andrew.robbie at gmail.com Wed Sep 30 06:22:14 2009 From: andrew.robbie at gmail.com (Andrew Robbie (Gmail)) Date: Wed, 30 Sep 2009 23:22:14 +1000 Subject: [Beowulf] recommendation on crash cart for a cluster room: full cluster KVM is not an option I suppose? In-Reply-To: References: Message-ID: <9F2D83EA-0681-46E8-9DDE-BD48ACE9B7A8@gmail.com> On 30/09/2009, at 3:23 PM, Rahul Nabar wrote: > Any good recommendation on a crash cart for a cluster room? My last > cluster was small and we had the luxury of having a KVM + SIP > connecting to each compute node. > > I doubt that will be feasible this time around , now that I have 200+ > nodes. How do other sys admins handle this? Just a simple crash cart? > Or there any other options that make life easier in the long run. I suggest getting some KVM over IP boxes. You plug them in to the computers you need console access to then go somewhere quieter. If you have IPMI that works (rare in my experience) great. For when you absolutely have to have local access (i.e. everything is broken) make sure you have it at the right height. There is nothing worse than trying to type standing up at a KVM drawer at the wrong height (and unless all your administrators are all the same height it will always be wrong), so make sure there is always a rolly chair. I actually have an UPS on my cart to power the monitor, which saves draping power across the floor; you brain filters out the UPS beeping pretty quick. Andrew