From lindahl at pbm.com Tue Sep 1 08:35:26 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 1 Sep 2009 08:35:26 -0700 Subject: [Beowulf] petabyte for $117k Message-ID: <20090901153526.GC4682@bx9.net> http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ Kinda neat -- how does the price compare to the various 48-drive systems available? -- g From laytonjb at att.net Tue Sep 1 08:52:58 2009 From: laytonjb at att.net (Jeff Layton) Date: Tue, 1 Sep 2009 08:52:58 -0700 (PDT) Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090901153526.GC4682@bx9.net> References: <20090901153526.GC4682@bx9.net> Message-ID: <63149.9536.qm@web80708.mail.mud.yahoo.com> I saw that as well (storagemojo blog). Looks interesting but I need to read the pdf since there are some pieces I'm missing. Cool concept. There are some others like it as well - low performance storage but lots of capacity ("cheap and deep"). I think it can make alot of sense in my situations (just my 2 cents). Jeff ________________________________ From: Greg Lindahl To: beowulf at beowulf.org Sent: Tuesday, September 1, 2009 11:35:26 AM Subject: [Beowulf] petabyte for $117k http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ Kinda neat -- how does the price compare to the various 48-drive systems available? -- g _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Tue Sep 1 09:03:52 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Tue, 1 Sep 2009 11:03:52 -0500 Subject: [Beowulf] Vendor terms and conditions for a typical Beouwulf expansion contract Message-ID: We are planning a cluster expansion that is quite larger than the one's I've handled previously. I was wondering about the vendor-terms and if there was anything that was good to be specifically requesting them for. In the past we've stuck to standard vendor contracts; something like: "1 year warranty; 2 year extended warranty. Next Business Day on site." But, in view of the larger size of the current order I wanted to be sure I covered all my bases. Are there any other terms that people try to add or negotiate into the contract? We assemble our own cluster in-house so there is no "performance guarantee" etc. from the vendor side in the OS, packages etc. In the past I've heard rumors of other arrangements: e.g. having the vendor stock some spares on-site, delayed incremental payment terms subject to conditions on performance, pre-certifying a person on our side to bypass the routine helpdesk procedures for warranties. Are such, non-standard modifications common? Anybody want to jog my memory about items that I might want to add? Of course, the vendor may not actually agree to them all but I'm just looking for items to put on the discussion table. -- Rahul From eugen at leitl.org Tue Sep 1 09:10:29 2009 From: eugen at leitl.org (Eugen Leitl) Date: Tue, 1 Sep 2009 18:10:29 +0200 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090901153526.GC4682@bx9.net> References: <20090901153526.GC4682@bx9.net> Message-ID: <20090901161029.GM4508@leitl.org> On Tue, Sep 01, 2009 at 08:35:26AM -0700, Greg Lindahl wrote: > http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ > > Kinda neat -- how does the price compare to the various 48-drive > systems available? "Seagate ST31500341AS 1.5TB Barracuda 7200.11 SATA 3Gb/s 3.5? Aargh! Should be definitely substituted by 2 TByte WD RE4 drive. Today I've built a 32 TByte raw storage Supermicro box with X8DDAi (dual-socket Nehalem, 24 GByte RAM, IPMI), two LSI SAS3081E-R and OpenSolaris sees all (WD2002FYPS) drives so far (the board refuses to boot from DVD when more than 12 drives are in though probably to some BIOS brain damage, so you have to manually build a raidz-2 with all 16 drives in it once Solaris has booted up). The drives are about 3170 EUR sans VAT total for all 16, the box itself around 3000 EUR sans VAT. I presume Linux with RAID 6 would work (haven't checked yet), too, and if you need more you can use a cluster FS. Maybe not as cheap as a Backblaze, but off-shelf (BTO) and you get what you pay for. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From landman at scalableinformatics.com Tue Sep 1 09:23:56 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Tue, 01 Sep 2009 12:23:56 -0400 Subject: [Beowulf] petabyte for $117k In-Reply-To: <63149.9536.qm@web80708.mail.mud.yahoo.com> References: <20090901153526.GC4682@bx9.net> <63149.9536.qm@web80708.mail.mud.yahoo.com> Message-ID: <4A9D4A9C.6010300@scalableinformatics.com> Jeff Layton wrote: > I saw that as well (storagemojo blog). Looks interesting but I need to > read the pdf since there are some pieces I'm missing. > > Cool concept. There are some others like it as well - low performance > storage but lots of capacity ("cheap and deep"). I think it can make > alot of sense in my situations (just my 2 cents). Cool! We've been doing something like this (concept) for a while with Delta-V (http://scalableinformatics.com/delta-v), though we add higher performance layers atop this, and a number of automation bits, not to mention iSCSI, SRP, AoE, iSER, ..., NFS/SMB/... They go the minimal price route on everything. Desktop drives, non-ecc memory, desktop motherboard, etc. Low performance, but probably good for cold data of large size. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From bill at cse.ucdavis.edu Tue Sep 1 16:28:10 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue, 01 Sep 2009 16:28:10 -0700 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090901153526.GC4682@bx9.net> References: <20090901153526.GC4682@bx9.net> Message-ID: <4A9DAE0A.1010500@cse.ucdavis.edu> Greg Lindahl wrote: > http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/ > > Kinda neat -- how does the price compare to the various 48-drive > systems available? I'm very curious to hear how they are in production. I've had vibration of large sets of drives basically render the consumer drives useless. Timeouts, highly variable performance, drives constantly dropping out of raids. It became especially fun when the heavy I/O of a rebuild knocks additional drives out of the array. I'd also worry that running the consumer drives well out of spec (both in duty cycle and vibration) might significantly shorten their lives. Are the 7200.11 1.5TB seagate's particularly vibration resistant? Maybe those $0.23 nylon standoffs work better than I'd expect. From eugen at leitl.org Wed Sep 2 01:23:46 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 2 Sep 2009 10:23:46 +0200 Subject: [Beowulf] petabyte for $117k In-Reply-To: <4A9DAE0A.1010500@cse.ucdavis.edu> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> Message-ID: <20090902082346.GN4508@leitl.org> On Tue, Sep 01, 2009 at 04:28:10PM -0700, Bill Broadley wrote: > I'm very curious to hear how they are in production. I've had vibration of My thoughts exactly. > large sets of drives basically render the consumer drives useless. Timeouts, > highly variable performance, drives constantly dropping out of raids. It > became especially fun when the heavy I/O of a rebuild knocks additional drives > out of the array. Also my experience down to a T. > I'd also worry that running the consumer drives well out of spec (both in duty > cycle and vibration) might significantly shorten their lives. > > Are the 7200.11 1.5TB seagate's particularly vibration resistant? No. They're awful, as the entire 7200.11 line (I've had failures in 750 GByte, 1 TByte, 1.5 TByte, everywhere, even with reasonably small drive populations). > Maybe those $0.23 nylon standoffs work better than I'd expect. -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From bill at cse.ucdavis.edu Wed Sep 2 02:10:26 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 02 Sep 2009 02:10:26 -0700 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090902082346.GN4508@leitl.org> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> Message-ID: <4A9E3682.4050006@cse.ucdavis.edu> Eugen Leitl wrote: > On Tue, Sep 01, 2009 at 04:28:10PM -0700, Bill Broadley wrote: > >> I'm very curious to hear how they are in production. I've had vibration of > > My thoughts exactly. The lid screws down to apply pressure to a piece of foam. Foam presses down on 45 drives. 5 drives (1.4 lb each) sit on each port multipliers. 6 nylon mounts support each multiplier supporting 7 pounds of drives. Seems like the damping from the nylon mounts would be minimal under that much pressure. So on the bottom of the case you have 63 pounds of drives, a significant fraction of which is rotating mass. I wonder if the port multipliers are really designed to actually support the drives. Doesn't seem like the SATA power/data connects I've seen are designed for that. Not to mention it's hard to imagine the typical flexible thin sheet metal lid applying even pressure to 45 drives through the foam. Seems like most of the pressure would be on the drives on the outside edge, leaving the inside drives relatively undamped. My experience is that nylon mounts don't help much. Sure poor manufacturing tolerances, and near zero load/tension often prevent things like fans from tightly coupling with a chassis. But to decouple drive vibration from a chassis seems to require something much more aggressive. Something like a very soft/gooey rubber with a fair bit of travel and give under minimal pressure (4-6 ounces). >> large sets of drives basically render the consumer drives useless. Timeouts, >> highly variable performance, drives constantly dropping out of raids. It >> became especially fun when the heavy I/O of a rebuild knocks additional drives >> out of the array. > > Also my experience down to a T. Strangely their design works out to 0.11 per GB. I tweaked their design to my liking. I upgraded to the $200 baracuda 2TB drive, 6GB of DDR3-1333 ECC memory (from 4GB ddr2), ECC capable motherboard (with dual gigE), and a Nehalem based xeon (ECC capable). The result was $0.13 per GB. Seems like a rather worthwhile investment from a reliability perspective, let alone performance. Anyone familiar with what the sun thumper does to minimize vibration? >> I'd also worry that running the consumer drives well out of spec (both in duty >> cycle and vibration) might significantly shorten their lives. >> >> Are the 7200.11 1.5TB seagate's particularly vibration resistant? > > No. They're awful, as the entire 7200.11 line (I've had failures > in 750 GByte, 1 TByte, 1.5 TByte, everywhere, even with reasonably > small drive populations). A rather scary 26% of over 1500 reviews on newegg give that seagate 1 star out of 5. The consumer 1TB WD drive has 1100 reviews and 74% are 5 stars and only 8% are 1 star. From mm at yuhu.biz Wed Sep 2 02:44:12 2009 From: mm at yuhu.biz (Marian Marinov) Date: Wed, 2 Sep 2009 12:44:12 +0300 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090902082346.GN4508@leitl.org> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> Message-ID: <200909021244.12736.mm@yuhu.biz> On Wednesday 02 September 2009 11:23:46 Eugen Leitl wrote: > On Tue, Sep 01, 2009 at 04:28:10PM -0700, Bill Broadley wrote: > > I'm very curious to hear how they are in production. I've had vibration > > of > > My thoughts exactly. > > > large sets of drives basically render the consumer drives useless. > > Timeouts, highly variable performance, drives constantly dropping out of > > raids. It became especially fun when the heavy I/O of a rebuild knocks > > additional drives out of the array. > > Also my experience down to a T. > > > I'd also worry that running the consumer drives well out of spec (both in > > duty cycle and vibration) might significantly shorten their lives. > > > > Are the 7200.11 1.5TB seagate's particularly vibration resistant? > > No. They're awful, as the entire 7200.11 line (I've had failures > in 750 GByte, 1 TByte, 1.5 TByte, everywhere, even with reasonably > small drive populations). I have the same experience with those drives. Not really reliable. > > > Maybe those $0.23 nylon standoffs work better than I'd expect. -- Best regards, Marian Marinov From carsten.aulbert at aei.mpg.de Wed Sep 2 02:50:56 2009 From: carsten.aulbert at aei.mpg.de (Carsten Aulbert) Date: Wed, 2 Sep 2009 11:50:56 +0200 Subject: [Beowulf] petabyte for $117k In-Reply-To: <4A9E3682.4050006@cse.ucdavis.edu> References: <20090901153526.GC4682@bx9.net> <20090902082346.GN4508@leitl.org> <4A9E3682.4050006@cse.ucdavis.edu> Message-ID: <200909021150.56529.carsten.aulbert@aei.mpg.de> On Wednesday 02 September 2009 11:10:26 Bill Broadley wrote: > > Anyone familiar with what the sun thumper does to minimize vibration? Each disk is contained in a cage and this one is secured per slot. Pretty standard layout, but then I've never really checked if there were vibrational "hot" spots in these boxes. But there is not much rubber inside the box IIRC. Carsten From landman at scalableinformatics.com Wed Sep 2 04:28:24 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 02 Sep 2009 07:28:24 -0400 Subject: [Beowulf] petabyte for $117k In-Reply-To: <4A9E3682.4050006@cse.ucdavis.edu> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> <4A9E3682.4050006@cse.ucdavis.edu> Message-ID: <4A9E56D8.8070103@scalableinformatics.com> Bill Broadley wrote: > The lid screws down to apply pressure to a piece of foam. Foam presses down > on 45 drives. 5 drives (1.4 lb each) sit on each port multipliers. 6 nylon > mounts support each multiplier supporting 7 pounds of drives. Seems like the > damping from the nylon mounts would be minimal under that much pressure. So > on the bottom of the case you have 63 pounds of drives, a significant fraction > of which is rotating mass. ... Which is being driven at multiple points by a 120Hz signal (7200RPM). It seems like a basic physics calculation to get the resulting eigen-modes and eigen-frequencies. Looking at the design, I was concerned about the nylon standoffs (high frequency coupling, including octaves of 120Hz) coupling enough vibration into the units. > I wonder if the port multipliers are really designed to actually support the > drives. Doesn't seem like the SATA power/data connects I've seen are designed > for that. Technically, they should not be used for structural support. For supporting cables? Sure. > Not to mention it's hard to imagine the typical flexible thin sheet metal lid > applying even pressure to 45 drives through the foam. Seems like most of the > pressure would be on the drives on the outside edge, leaving the inside drives > relatively undamped. > > My experience is that nylon mounts don't help much. Sure poor manufacturing > tolerances, and near zero load/tension often prevent things like fans from > tightly coupling with a chassis. But to decouple drive vibration from a > chassis seems to require something much more aggressive. Something like a > very soft/gooey rubber with a fair bit of travel and give under minimal > pressure (4-6 ounces). Yeah. Looking at this unit, the big issue looks to be vibration. In which case, you probably want better (more vibration tolerant) drives than the ones spec'ed. And a better mounting design. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From amjad11 at gmail.com Wed Sep 2 05:14:15 2009 From: amjad11 at gmail.com (amjad ali) Date: Wed, 2 Sep 2009 17:14:15 +0500 Subject: [Beowulf] CPU shifts?? and time problems Message-ID: <428810f20909020514k66fe7905i80a0d17867b2d715@mail.gmail.com> Hi All, I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on ROCKS-5 with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel on it. Please reply following: 1) When I run my serial code on the dual-core head node (or parallel code with -np 1); it gives results in about 2 minutes. What I observe is that "System Monitor" application show that some times CPU1 become busy 80+% and CPU2 around 10% busy. After some time CPU1 gets share around 10% busy while the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue till end. Why this is so? Does this busy-ness shifts/swaping harms performance/speed? 2) When I run my parallel code with -np 2 on the dual-core headnode only; it gives results in about 1 minute. What I observe is that "System Monitor" application show that all the time CPU1 and CPU2 remain busy 100%. 3) When I run my parallel code with "-np 4" and "-np 8" on the dual-core headnode only; it gives results in about 2 and 3.20 minutes respectively. What I observe is that "System Monitor" application show that all the time CPU1 and CPU2 remain busy 100%. 4) When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8 cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I observe is that "System Monitor" application show CPU usage fluctuations somewhat as in point number 1 above (CPU1 remains dominant busy most of the time), in case of -np 4. Does this means that an MPI-process is shifting to different cores/cpus/nodes? Does these shiftings harm performance/speed? 5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare to -np 2 on the headnode? Obviously its due to communication overhead! but how to get better performance--lesser run time? My code is not too complicated only 2 values are sent and 2 values are received by each process after each stage. Regards, Amjad Ali. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hahn at mcmaster.ca Wed Sep 2 06:57:38 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 2 Sep 2009 09:57:38 -0400 (EDT) Subject: [Beowulf] CPU shifts?? and time problems In-Reply-To: <428810f20909020514k66fe7905i80a0d17867b2d715@mail.gmail.com> References: <428810f20909020514k66fe7905i80a0d17867b2d715@mail.gmail.com> Message-ID: On Wed, 2 Sep 2009, amjad ali wrote: > Hi All, > I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on ROCKS-5 > with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel > on it. > Please reply following: > > 1) When I run my serial code on the dual-core head node (or parallel code > with -np 1); it gives results in about 2 minutes. What I observe is that > "System Monitor" application show that some times CPU1 become busy 80+% and > CPU2 around 10% busy. After some time CPU1 gets share around 10% busy while > the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue > till end. Why this is so? Does this busy-ness shifts/swaping harms > performance/speed? the kernel decides where to run processes based on demand. if the machine were otherwise idle, your process would stay on the same CPU. depending on the particular kernel release, the kernel uses various heuristics to decide how much to "resist" moving the process among cpus. the cost of moving among cpus depends entirely on how much your code depends on the resources tied to one cpu or the other. for instance, if your code has a very small memory footprint, moving will have only trivial cost. if your process has a larger working set size, but fits in onchip cache, it may be relatively expensive to move to a different processor in the system that doesn't share cache. consider a 6M L3 in a 2-socket system, for instance: the inter-socket bandwidth will be approximately memory speed, which on a core2 system is something like 6 GB/s. so migration will incur about a 1ms overhead (possibly somewhat hidden by concurrency.) in your case (if I have the processor spec right), you have 2 cores sharing a single 4M L2. L1 cache is unshared, but trivial in size, so migration cost should be considered near-zero. the numactl command lets you bind a cpu to a processor if you wish. this is normally valuable on systems with more complex topologies, such as combinations of shared and unshared caches, especially when divided over multiple sockets, and with NUMA memory (such as opterons and nehalems.) > 2) When I run my parallel code with -np 2 on the dual-core headnode only; > it gives results in about 1 minute. What I observe is that "System Monitor" > application show that all the time CPU1 and CPU2 remain busy 100%. no problem there. normally, though, it's best to _not_ run extraneous processes, and instead only look at the elapsed time that the job takes to run. that is the metric that you should care about. > 3) When I run my parallel code with "-np 4" and "-np 8" on the dual-core > headnode only; it gives results in about 2 and 3.20 minutes respectively. > What I observe is that "System Monitor" application show that all the time > CPU1 and CPU2 remain busy 100%. sure. with 4 cpus, you're overloading the cpus, but they timeslice fairly efficiently, so you don't lose. once you get to 8 cpus, you lose because the overcommitted processes start interfering (probably their working set is blowing the L2 cache.) > 4) When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8 > cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I well, then I think it's a bit hyperbolic to call it a parallel code ;) seriously, all you've learned here is that your interconnect is causing your code to not scale. the problem could be your code or the interconnect. > observe is that "System Monitor" application show CPU usage fluctuations > somewhat as in point number 1 above (CPU1 remains dominant busy most of the > time), in case of -np 4. Does this means that an MPI-process is shifting to > different cores/cpus/nodes? Does these shiftings harm performance/speed? MPI does not shift anything. the kernel may rebalance runnable processes within a single node, but not across nodes. it's difficult to tell how much your monitoring is harming the calculation or perturbing the load-balance. > 5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare to > -np 2 on the headnode? Obviously its due to communication overhead! but how > to get better performance--lesser run time? My code is not too complicated > only 2 values are sent and 2 values are received by each process after each > stage. then do more work between sends and receives. hard to say without knowing exactly what the communication pattern is. I think you should first validate your cluster to see that the Gb is running as fast as expected. actually, that everything is running right. that said, Gb is almost not a cluster interconnect at all, since it's so much slower than the main competitors (IB mostly, to some extent 10GE). fatter nodes (dual-socket quad-core, for instance) would at least decrease the effect of slow interconnect. you might also try instaling openMX, which is an ethernet protocol optimized for MPI (rather than your current MPI which is presumably layered on top of the usual TCP stack, which is optimized for wide-area streaming transfers.) heck, you can probably obtain some speedup by tweaking your coalesce settings via ethtool. From kilian.cavalotti.work at gmail.com Wed Sep 2 08:07:14 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Wed, 2 Sep 2009 17:07:14 +0200 Subject: [Beowulf] Vendor terms and conditions for a typical Beouwulf expansion contract In-Reply-To: References: Message-ID: Hi Rahul, On Tue, Sep 1, 2009 at 6:03 PM, Rahul Nabar wrote: > In the past we've stuck to standard vendor contracts; something like: > "1 year warranty; 2 year extended warranty. Next Business Day on > site." You could also consider H+4 on-site intervention for critical parts, like switches, master nodes, or whatever piece of hardware your whole cluster operation depends on. > In the past I've heard rumors of other arrangements: e.g. having the > vendor stock some spares on-site, This one is pretty common, I think. And a very good idea generally speaking. You can ask for a quote including a couple hard drives of each type, a few NICs/HBAs and memory DIMMs. You'll probably pay for those spares anyway, but having them handy will be a nice time saver in case of an emergency, rather than having to wait for the delivery of a replacement part. As long as you make sure to resplenish your spares stock as it is being used: you call support to have them ship a replacement, as you would normally do, although you already have the part. Cheers, -- Kilian From coutinho at dcc.ufmg.br Wed Sep 2 10:43:46 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Wed, 2 Sep 2009 14:43:46 -0300 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090902082346.GN4508@leitl.org> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> Message-ID: 2009/9/2 Eugen Leitl > On Tue, Sep 01, 2009 at 04:28:10PM -0700, Bill Broadley wrote: > > > I'm very curious to hear how they are in production. I've had vibration > of > > My thoughts exactly. > > > large sets of drives basically render the consumer drives useless. > Timeouts, > > highly variable performance, drives constantly dropping out of raids. It > > became especially fun when the heavy I/O of a rebuild knocks additional > drives > > out of the array. > > Also my experience down to a T. > > > I'd also worry that running the consumer drives well out of spec (both in > duty > > cycle and vibration) might significantly shorten their lives. > > > > Are the 7200.11 1.5TB seagate's particularly vibration resistant? > > No. They're awful, as the entire 7200.11 line (I've had failures > in 750 GByte, 1 TByte, 1.5 TByte, everywhere, even with reasonably > small drive populations). > > According to this site, the main difference between Seagate desktop and ES series is that the latter are more vibration resistant. http://techreport.com/articles.x/10748 -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Wed Sep 2 13:02:57 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 2 Sep 2009 13:02:57 -0700 Subject: [Beowulf] petabyte for $117k In-Reply-To: References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> Message-ID: <20090902200257.GF7504@bx9.net> On Wed, Sep 02, 2009 at 02:43:46PM -0300, Bruno Coutinho wrote: > According to this site, the main difference between Seagate desktop and ES > series is that the latter are more vibration resistant. > http://techreport.com/articles.x/10748 This is interesting -- a non-firmware difference between normal and "enterprise" disks. The Barracuda.ES2 datasheet confirms the 12.5 rad/sec^2 number, but the 5.5 rad/sec^2 number is much harder to find; this doc has it: http://www.seagate.com/docs/pdf/whitepaper/mb578_7200.pdf but you'll have to look at it cached. As for people's vibrations comments: they own a bunch of them and they work... but that is only a single point of evidence and not a history of working with a variety of disks models over time. The guy said he could write a whole post about vibration; I think it would be very interesting. -- greg From bill at cse.ucdavis.edu Wed Sep 2 13:28:18 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 02 Sep 2009 13:28:18 -0700 Subject: [Beowulf] petabyte for $117k In-Reply-To: <20090902200257.GF7504@bx9.net> References: <20090901153526.GC4682@bx9.net> <4A9DAE0A.1010500@cse.ucdavis.edu> <20090902082346.GN4508@leitl.org> <20090902200257.GF7504@bx9.net> Message-ID: <4A9ED562.8080609@cse.ucdavis.edu> Greg Lindahl wrote: > As for people's vibrations comments: they own a bunch of them and they > work... For now, I've seen similar setups last 6-12 months before a drive drops, then a rebuild triggers drop #2. > but that is only a single point of evidence and not a history > of working with a variety of disks models over time. The guy said he > could write a whole post about vibration; I think it would be very > interesting. Indeed, very. If they were significantly cheaper than a better design I could see the justification. But for $0.11 vs $0.13 per GB I don't see it. Certainly as a potential customer for N copies of my data I'd certainly rather pay $0.13 + overhead for reliable (ecc + raid edition drives) storage then $0.11 + overhead for unreliable storage (no ecc and consumer drives) for my precious bits. It's especially scary since they don't seem to have any replication, or at least that replication is incompatible with their statement "In rough terms, every time one of our customers buys a hard drive, Backblaze needs another hard drive." From rpnabar at gmail.com Wed Sep 2 14:17:36 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 2 Sep 2009 16:17:36 -0500 Subject: [Beowulf] Vendor terms and conditions for a typical Beouwulf expansion contract In-Reply-To: References: Message-ID: On Wed, Sep 2, 2009 at 10:07 AM, Kilian CAVALOTTI wrote: > > You could also consider H+4 on-site intervention for critical parts, > like switches, master nodes, or whatever piece of hardware your whole > cluster operation depends on. Good idea! I will do that for some of the critical, non-replicated items. > This one is pretty common, I think. And a very good idea generally > speaking. You can ask for a quote including a couple hard drives of > each type, a few NICs/HBAs and memory DIMMs. You'll probably pay for > those spares anyway, but having them handy will be a nice time saver > in case of an emergency, rather than having to wait for the delivery > of a replacement part. As long as you make sure to resplenish your > spares stock as it is being used: you call support to have them ship a > replacement, as you would normally do, although you already have the > part. Thanks for those comments. Will help me get a better setup in the contract. -- Rahul From rpnabar at gmail.com Wed Sep 2 14:25:00 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 2 Sep 2009 16:25:00 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes Message-ID: What are good choices for a switch in a Beouwulf setup currently? The last time we went in for a Dell PowerConnect and later realized that this was pretty basic. I only have gigabit on the compute nodes. So no Infiniband / Myrinet etc. issues. The point is that I will have about 300 compute nodes. Should I go for one large switch or several stacked ones? In the past I had resorted to just interconnecting two or more 48 port switches with multiple ethernet cables but this is quite crude I believe. Of course, the smaller switches tend to be cheaper so in the past it was making more sense to hook them up together even at the price of taking a performance hit. The main traffic sources are MPI and NFS. (NFS is quite inefficient so this time around I might play with another FS but still something that allows global cross mounts from ~300 compute nodes) There is a variety of codes we run; some latency sensitive and others bandwidth sensitive. Finally, what are the switch parameters I ought to be comparing. If 300 eth ports are chattering at once do I look at the max rated switching capacity or something similar? -- Rahul From hahn at mcmaster.ca Wed Sep 2 15:41:07 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 2 Sep 2009 18:41:07 -0400 (EDT) Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: > allows global cross mounts from ~300 compute nodes) There is a variety > of codes we run; some latency sensitive and others bandwidth > sensitive. if you're sensitive either way, you're going to be unhappy with Gb. IMO, you'd be best to configure your scheduler to never spread an MPI job across switches, and then just match the backbone to the aggregate IO bandwidth your NFS storage can support. something like 10G uplinks from 48pt switches would probably work well. From rpnabar at gmail.com Wed Sep 2 20:29:07 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 2 Sep 2009 22:29:07 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: On Wed, Sep 2, 2009 at 5:41 PM, Mark Hahn wrote: >> allows global cross mounts from ~300 compute nodes) There is a variety >> of codes we run; some latency sensitive and others bandwidth >> sensitive. > > if you're sensitive either way, you're going to be unhappy with Gb. I am still testing sensitivity but I suspect I am sensitive either way. > IMO, you'd be best to configure your scheduler to never spread an MPI > job across switches, Good idea. I was thinking about it. Might need to tweak my PBS scheduler. >and then just match the backbone to the aggregate IO > bandwidth your NFS storage can support. That brings me to another important question. Any hints on speccing the head-node? Especially the kind of storage I put in on the head node. I need around 1 Terabyte of storage. In the past I've uses RAID5+SAS in the server. Mostly for running jobs that access their I/O via files stored centrally. For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I boost the RAM up? Or any other comments. It is tricky to spec the central node. Or is it more advisable to go for storage-box external to the server for NFS-stores and then figure out a fast way of connecting it to the server. Fiber perhaps? -- Rahul From amjad11 at gmail.com Wed Sep 2 20:33:38 2009 From: amjad11 at gmail.com (amjad ali) Date: Thu, 3 Sep 2009 08:33:38 +0500 Subject: [Beowulf] CPU shifts?? and time problems In-Reply-To: References: <428810f20909020514k66fe7905i80a0d17867b2d715@mail.gmail.com> Message-ID: <428810f20909022033t24c3c421na25d13ec1b2c0697@mail.gmail.com> Hi, please see below On Wed, Sep 2, 2009 at 6:57 PM, Mark Hahn wrote: > On Wed, 2 Sep 2009, amjad ali wrote: > > Hi All, >> I have 4-Nodes ( 4 CPUs Xeon3085, total 8 cores) Beowulf cluster on >> ROCKS-5 >> with GiG-Ethernet. I tested runs of a 1D CFD code both serial and parallel >> on it. >> Please reply following: >> >> 1) When I run my serial code on the dual-core head node (or parallel code >> with -np 1); it gives results in about 2 minutes. What I observe is that >> "System Monitor" application show that some times CPU1 become busy 80+% >> and >> CPU2 around 10% busy. After some time CPU1 gets share around 10% busy >> while >> the CPU2 becomes 80+% busy. Such fluctuations/swap-of-busy-ness continue >> till end. Why this is so? Does this busy-ness shifts/swaping harms >> performance/speed? >> > > the kernel decides where to run processes based on demand. if the machine > were otherwise idle, your process would stay on the same CPU. depending on > the particular kernel release, the kernel uses various heuristics to decide > how much to "resist" moving the process among cpus. > > the cost of moving among cpus depends entirely on how much your code > depends > on the resources tied to one cpu or the other. for instance, if your code > has a very small memory footprint, moving will have only trivial cost. > if your process has a larger working set size, but fits in onchip cache, > it may be relatively expensive to move to a different processor in the > system that doesn't share cache. consider a 6M L3 in a 2-socket system, > for instance: the inter-socket bandwidth will be approximately memory > speed, > which on a core2 system is something like 6 GB/s. so migration will incur > about a 1ms overhead (possibly somewhat hidden by concurrency.) > > in your case (if I have the processor spec right), you have 2 cores > sharing a single 4M L2. L1 cache is unshared, but trivial in size, > so migration cost should be considered near-zero. > > the numactl command lets you bind a cpu to a processor if you wish. > this is normally valuable on systems with more complex topologies, > such as combinations of shared and unshared caches, especially when divided > over multiple sockets, and with NUMA memory (such as opterons and nehalems.) > > 2) When I run my parallel code with -np 2 on the dual-core headnode only; >> it gives results in about 1 minute. What I observe is that "System >> Monitor" >> application show that all the time CPU1 and CPU2 remain busy 100%. >> > > no problem there. normally, though, it's best to _not_ run extraneous > processes, and instead only look at the elapsed time that the job takes to > run. that is the metric that you should care about. > > 3) When I run my parallel code with "-np 4" and "-np 8" on the dual-core >> headnode only; it gives results in about 2 and 3.20 minutes respectively. >> What I observe is that "System Monitor" application show that all the time >> CPU1 and CPU2 remain busy 100%. >> > > sure. with 4 cpus, you're overloading the cpus, but they timeslice fairly > efficiently, so you don't lose. once you get to 8 cpus, you lose because > the overcommitted processes start interfering (probably their working set > is blowing the L2 cache.) > > 4) When I run my parallel code with "-np 4" and "-np 8" on the 4-node (8 >> cores) cluster; it gives results in about 9 (NINE) and 12 minutes. What I >> > > well, then I think it's a bit hyperbolic to call it a parallel code ;) > seriously, all you've learned here is that your interconnect is causing > your code to not scale. the problem could be your code or the > interconnect. > > observe is that "System Monitor" application show CPU usage fluctuations >> somewhat as in point number 1 above (CPU1 remains dominant busy most of >> the >> time), in case of -np 4. Does this means that an MPI-process is shifting >> to >> different cores/cpus/nodes? Does these shiftings harm performance/speed? >> > > MPI does not shift anything. the kernel may rebalance runnable processes > within a single node, but not across nodes. it's difficult to tell how much > your monitoring is harming the calculation or perturbing the load-balance. > > 5) Why "-np 4" and "-np 8" on cluster is taking too much time as compare >> to >> -np 2 on the headnode? Obviously its due to communication overhead! but >> how >> to get better performance--lesser run time? My code is not too complicated >> only 2 values are sent and 2 values are received by each process after >> each >> stage. >> > > then do more work between sends and receives. hard to say without knowing > exactly what the communication pattern is. > Here is my subroutine: IF (myrank /= p-1) CALL MPI_ISEND(u_local(Np,E),1,MPI_REAL8,myrank+1,55,MPI_COMM_WORLD,right(1), ierr) IF (myrank /= 0) CALL MPI_ISEND(u_local(1,B),1,MPI_REAL8,myrank-1,66,MPI_COMM_WORLD,left(1), ierr) IF (myrank /= 0) CALL MPI_IRECV(u_left_exterior,1, MPI_REAL8, myrank-1, 55, MPI_COMM_WORLD,right(2), ierr) IF (myrank /= p-1) CALL MPI_IRECV(u_right_exterior,1, MPI_REAL8, myrank+1, 66, MPI_COMM_WORLD,left(2), ierr) u0_local=RESHAPE(u_local,(/Np*K_local/)) du0_local=RESHAPE(du_local,(/Nfp*Nfaces*K_local/)) q_local = rx_local*MATMUL(Dr,u_local) DO I = shift*Nfp*Nfaces+1+1 , shift*Nfp*Nfaces+Nfp*Nfaces*K_local-1 du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 ENDDO I = shift*Nfp*Nfaces+1 IF (myrank == 0) du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 IF (myrank /= p-1) CALL MPI_WAIT(right(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(left(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(right(2), status, ierr) IF (myrank /= p-1) CALL MPI_WAIT(left(2), status, ierr) IF (myrank /= 0) THEN du0_local(I) = (u0_local(vmapM_local(I))-u_left_exterior)/2.0_8 ENDIF I = shift*Nfp*Nfaces+Nfp*Nfaces*K_local IF (myrank == p-1) du0_local(I) = (u0_local(vmapM_local(I))-u0_local(vmapP_local(I)))/2.0_8 IF (myrank /= p-1) THEN du0_local(I) = (u0_local(vmapM_local(I))-u_right_exterior)/2.0_8 ENDIF IF (myrank == 0) du0_local(mapI) = 0.0_8 IF (myrank == p-1) du0_local(mapO) = 0.0_8 du_local = RESHAPE(du0_local,(/Nfp*Nfaces,K_local/)) q_local = q_local-MATMUL(LIFT,Fscale_local*(nx_local*du_local)) IF (myrank /= p-1) CALL MPI_ISEND(q_local(Np,E),1,MPI_REAL8,myrank+1,551,MPI_COMM_WORLD,right(1), ierr) IF (myrank /= 0) CALL MPI_ISEND(q_local(1,B),1,MPI_REAL8,myrank-1,661,MPI_COMM_WORLD,left(1), ierr) IF (myrank /= 0) CALL MPI_IRECV(q_left_exterior,1, MPI_REAL8, myrank-1, 551, MPI_COMM_WORLD,right(2), ierr) IF (myrank /= p-1) CALL MPI_IRECV(q_right_exterior,1, MPI_REAL8, myrank+1, 661, MPI_COMM_WORLD,left(2), ierr) q0_local=RESHAPE(q_local,(/Np*K_local/)) dq0_local=RESHAPE(dq_local,(/Nfp*Nfaces*K_local/)) rhsu_local= rx_local*MATMUL(Dr,q_local) + u_local*(u_local-a1)*(1.0_8-u_local) - v_local DO I = shift*Nfp*Nfaces+1+1 , shift*Nfp*Nfaces+Nfp*Nfaces*K_local-1 dq0_local(I) = (q0_local(vmapM_local(I))-q0_local(vmapP_local(I)))/2.0_8 ENDDO I = shift*Nfp*Nfaces+1 IF (myrank /= p-1) CALL MPI_WAIT(right(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(left(1), status, ierr) IF (myrank /= 0) CALL MPI_WAIT(right(2), status, ierr) IF (myrank /= p-1) CALL MPI_WAIT(left(2), status, ierr) IF (myrank /= 0) THEN dq0_local(I) = (q0_local(vmapM_local(I))-q_left_exterior)/2.0_8 ENDIF I = shift*Nfp*Nfaces+Nfp*Nfaces*K_local IF (myrank /= p-1) THEN dq0_local(I) = (q0_local(vmapM_local(I))-q_right_exterior)/2.0_8 ENDIF IF (myrank == 0) dq0_local(mapI) = q0_local(vmapI)+0.225_8 IF (myrank == p-1) dq0_local(mapO) = q0_local(vmapO) dq_local = RESHAPE(dq0_local,(/Nfp*Nfaces,K_local/)) rhsu_local= rhsu_local-MATMUL(LIFT,(Fscale_local*(nx_local*dq_local))) END SUBROUTINE ===================================================================== Is it suffieciently goog or there are some serious problems with the communication pattern. Here Arrays are not very big because Nfp*Nfaces = 2 and K_local = < 30. > > I think you should first validate your cluster to see that the Gb is > running as fast as expected. actually, that everything is running right. > that said, Gb is almost not a cluster interconnect at all, since it's so > much slower than the main competitors (IB mostly, to some extent 10GE). > fatter nodes (dual-socket quad-core, for instance) would at least decrease > the effect of slow interconnect. > > you might also try instaling openMX, which is an ethernet protocol > optimized for MPI (rather than your current MPI which is presumably layered > on top of the usual TCP stack, which is optimized for wide-area > streaming transfers.) heck, you can probably obtain some speedup by > tweaking your coalesce settings via ethtool. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jlb17 at duke.edu Wed Sep 2 20:54:17 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 2 Sep 2009 23:54:17 -0400 (EDT) Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: On Wed, 2 Sep 2009 at 10:29pm, Rahul Nabar wrote > That brings me to another important question. Any hints on speccing > the head-node? Especially the kind of storage I put in on the head > node. I need around 1 Terabyte of storage. In the past I've uses > RAID5+SAS in the server. Mostly for running jobs that access their I/O > via files stored centrally. > > For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I > boost the RAM up? Or any other comments. It is tricky to spec the > central node. > > Or is it more advisable to go for storage-box external to the server > for NFS-stores and then figure out a fast way of connecting it to the > server. Fiber perhaps? Speccing storage for a 300 node cluster is a non-trivial task and is heavily dependent on your expected access patterns. Unless you anticipate vanishingly little concurrent access, you'll be very hard pressed to service a cluster that large with a basic Linux NFS server. About a year ago I had ~300 nodes pointed at a NetApp FAS3020 with 84 spindles of 10K RPM FC-AL disks. A single user could *easily* flatten the NetApp (read: 100% CPU and multi-second/minute latencies for everybody else) without even using the whole cluster. Whatever you end up with for storage, you'll need to be vigilant regarding user education. Jobs should store as much in-process data as they can on the nodes (assuming you're not running diskless nodes) and large jobs should stagger their access to the central storage as best they can. -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From landman at scalableinformatics.com Wed Sep 2 21:15:41 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 03 Sep 2009 00:15:41 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: <4A9F42ED.5050600@scalableinformatics.com> Rahul Nabar wrote: > That brings me to another important question. Any hints on speccing > the head-node? Especially the kind of storage I put in on the head For a cluster of this size, divide and conquer. Head node to handle cluster admin. Create login nodes for users to access to handle builds, job submission, etc. > node. I need around 1 Terabyte of storage. In the past I've uses > RAID5+SAS in the server. Mostly for running jobs that access their I/O > via files stored centrally. Hmmm... We don't recommend burdening the head node with storage apart for very small clusters, where it is a bit more cost effective. Depending upon how your nodes do IO for your jobs, this will dictate how you need your IO designed. If all nodes will do IO, then you need something that can handle *huge* transients from time to time. If one node does IO, you need just a good fast connection. Is GbE enough? How much IO are we talking about? Bad storage design can make a nice new 300 node cluster seem very slow. > For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I > boost the RAM up? Or any other comments. It is tricky to spec the > central node. Head node: from a management perspective (name service, dhcp/tftp/pxe, authentication/gateway, status monitor, etc) can be relatively light weight. Login node(s): should have sufficient RAM/CPU for builds. Storage node(s): should be built with thought towards the IO patterns expected. > Or is it more advisable to go for storage-box external to the server > for NFS-stores and then figure out a fast way of connecting it to the > server. Fiber perhaps? Start with your IO patterns, your IO volume, and how many are running at once. Once you have this, move on to figuring out capacity needs, availability needs (replication, fast home vs fast scratch + slow home) Avoid worrying about the technologies you should consider until you have a better handle on how it will be used. The use cases will suggest the technologies you should consider. We are biased (given what we build, sell and support) of course. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hahn at mcmaster.ca Wed Sep 2 23:18:20 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 3 Sep 2009 02:18:20 -0400 (EDT) Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: > That brings me to another important question. Any hints on speccing > the head-node? I think you imply a single, central admin/master/head node. this is a very bad idea. first, it's generally a bad idea to have users on a fileserver. next, it's best to keep cluster-infrastructure (monitoring, management, pxe, scheduling) on a dedicated admin machine. for 300 compute nodes, it might be a good idea to provide more than one login node (for editing, compilation, etc). > Especially the kind of storage I put in on the head > node. I need around 1 Terabyte of storage. In the past I've uses > RAID5+SAS in the server. 1 TB is, I assume you know, half a disk these days (ie, trivial). for a 300-node cluster, I'd configure at least 10x and probably 100x that much. (my user community is pretty diverse, though, and with a wide range of IO habits.) > Mostly for running jobs that access their I/O > via files stored centrally. it would be wise to get some sort of estimates of the actual numbers - even the total size of all files accessed by a job and its average runtime would let you figure an average data rate. > For muscle I was thinking of a Nehalem E5520 with 16 GB RAM. Should I I don't think I'd use such a nice machine for any of fileserver, admin or login nodes. for admin, it's not needed. for login it'll be unused a lot of the time. for fileservers, you want to sweat the IO system, not the CPU or memory. > boost the RAM up? Or any other comments. It is tricky to spec the > central node. spec'ing a single one may be, but a single one is a bad idea... > Or is it more advisable to go for storage-box external to the server > for NFS-stores and then figure out a fast way of connecting it to the > server. Fiber perhaps? 10G (Cu or SiO2, doesn't matter) is the right choice for an otherwise-gigabit cluster. From rpnabar at gmail.com Thu Sep 3 03:59:37 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 05:59:37 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: On Thu, Sep 3, 2009 at 1:18 AM, Mark Hahn wrote: Thanks a lot for all the great comments, guys! > I think you imply a single, central admin/master/head node. ?this is a very > bad idea. ?first, it's generally a bad idea to have users on a fileserver. > ?next, it's best to keep cluster-infrastructure > (monitoring, management, pxe, scheduling) on a dedicated admin machine. > for 300 compute nodes, it might be a good idea to provide more than one > login node (for editing, compilation, etc). Absolutely. I ought to use the term "head node(s)" What I want to spec and figure out is how many central machines are warranted and how I should differentially configure each. > > 1 TB is, I assume you know, half a disk these days (ie, trivial). > for a 300-node cluster, I'd configure at least 10x and probably 100x that > much. ?(my user community is pretty diverse, though, > and with a wide range of IO habits.) We have a different long term store. So this machine is only holding running, staging and other jobs. Users are warned that data is not backed up and subject to periodic flushing. Yet, you are right. I was being overly stingy. Bad estimate. I have a similar smaller cluster and I double checked usage on that one just now. If I scale that up to 300 nodes I should probably be shooting for 4.5 to 5 Terabytes of storage. > > I don't think I'd use such a nice machine for any of fileserver, admin or > login nodes. ?for admin, it's not needed. ?for login it'll be unused a lot > of > the time. ?for fileservers, you want to sweat the IO system, not the CPU or > memory. Yes, I used it for lack of knowledge of a more suitable but puny candidate. Any suggestions on a more puny machine? Besides ovevrspeccing the proc central node doesn't change my cost much relative to entire cluster. > > 10G (Cu or SiO2, doesn't matter) is the right choice for an > otherwise-gigabit cluster. > 10 G storage node to switch, and alternatively 10 G storage-box-switch, correct? -- Rahul From rpnabar at gmail.com Thu Sep 3 04:06:05 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 06:06:05 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: On Wed, Sep 2, 2009 at 10:54 PM, Joshua Baker-LePain wrote: > On Wed, 2 Sep 2009 at 10:29pm, Rahul Nabar wrote >> > Speccing storage for a 300 node cluster is a non-trivial task and is heavily > dependent on your expected access patterns. ?Unless you anticipate > vanishingly little concurrent access, you'll be very hard pressed to service > a cluster that large with a basic Linux NFS server. Thanks Joshua! Question is, what's my alternatives: Software: Change from NFS to xxx? Hardware: Go for a external Netapp storage box? Others.......? > > Whatever you end up with for storage, you'll need to be vigilant regarding > user education. ?Jobs should store as much in-process data as they can on > the nodes (assuming you're not running diskless nodes) and large jobs should > stagger their access to the central storage as best they can. Nope. Not diskless nodes. Nodes have local OS and /scratch space. But userfiles and executable installations reside on a central NFS store. Luckily my usage patterns are such that there are no new code development on this particular cluster. I have a tight control over what exact codes are running (DACAPO, VASP, GPAW) Thus so long as I compile, wrapper-script and optimize each of the codes the users can do no harm. -- Rahul From rpnabar at gmail.com Thu Sep 3 04:14:10 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 06:14:10 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <4A9F42ED.5050600@scalableinformatics.com> References: <4A9F42ED.5050600@scalableinformatics.com> Message-ID: On Wed, Sep 2, 2009 at 11:15 PM, Joe Landman wrote: > Rahul Nabar wrote: > For a cluster of this size, divide and conquer. ?Head node to handle cluster > admin. ?Create login nodes for users to access to handle builds, job > submission, etc. > Hmmm... We don't recommend burdening the head node with storage apart for > very small clusters, where it is a bit more cost effective. Thanks Joe! My total number of users is relatively small. ~50 with rarely more than 20 concurrent logged in users. Of course, each user might have multiple shell sessions. So the experts would recommend three separate central nodes? Loginnode Management node (dhcp / schedulers etc.) Storage node Or more? > Depending upon how your nodes do IO for your jobs, this will dictate how you > need your IO designed. ?If all nodes will do IO, then you need something > that can handle *huge* transients from time to time. ?If one node does IO, > you need just a good fast connection. ?Is GbE enough? ?How much IO are we > talking about? I did my economics and on the compute nodes I am stuck to GbE nothing more. If this becomes a totally unworkable proposition I'll be forced to split into smaller clusters. 10GbE, Myrinet, Infiniband just do not make economic sense for us. On the central nodes, though, I can afford to have better interconnects. Should I? Of what type? -- Rahul From landman at scalableinformatics.com Thu Sep 3 05:58:14 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 03 Sep 2009 08:58:14 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <4A9F42ED.5050600@scalableinformatics.com> Message-ID: <4A9FBD66.1060608@scalableinformatics.com> Rahul Nabar wrote: > On Wed, Sep 2, 2009 at 11:15 PM, Joe > Landman wrote: >> Rahul Nabar wrote: > >> For a cluster of this size, divide and conquer. Head node to handle cluster >> admin. Create login nodes for users to access to handle builds, job >> submission, etc. > >> Hmmm... We don't recommend burdening the head node with storage apart for >> very small clusters, where it is a bit more cost effective. > > Thanks Joe! My total number of users is relatively small. ~50 with > rarely more than 20 concurrent logged in users. Of course, each user > might have multiple shell sessions. > > So the experts would recommend three separate central nodes? > > Loginnode > Management node (dhcp / schedulers etc.) > Storage node You can add more login nodes as you need. Management nodes for the cluster stack (if any) can be fairly simple. The storage node is a function of your IO patterns. For really large clusters, you'd separate out the scheduler and some of the other functions as well. Mark Hahn and some of the other folks on the list run some of the really large clusters out there. They have some good advice for those scaling up. > Or more? > >> Depending upon how your nodes do IO for your jobs, this will dictate how you >> need your IO designed. If all nodes will do IO, then you need something >> that can handle *huge* transients from time to time. If one node does IO, >> you need just a good fast connection. Is GbE enough? How much IO are we >> talking about? > > I did my economics and on the compute nodes I am stuck to GbE nothing > more. If this becomes a totally unworkable proposition I'll be forced > to split into smaller clusters. 10GbE, Myrinet, Infiniband just do not > make economic sense for us. On the central nodes, though, I can afford > to have better interconnects. Should I? Of what type? It might be worth asking what your targeted per node budget is. 24 port SDR IB switches are available, and relatively inexpensive. 24 port SDR PCIe cards are available and relatively inexpensive. Jeff Layton (a great resource BTW) wrote about them last year. We've used them in a number of designs. Not the rock bottom in latency, but we have customers using our storage over NFS over RDMA at 500+ MB/s with them, so its not too bad. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Thu Sep 3 06:14:04 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 08:14:04 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D056731@milexchmb1.mil.tagmclarengroup.com> References: <4A9F42ED.5050600@scalableinformatics.com> <68A57CCFD4005646957BD2D18E60667B0D056731@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Thu, Sep 3, 2009 at 6:27 AM, Hearns, John wrote: > In that case, consider that most motherboards have dual gig Ethernet > onboard. Yes. I've made sure mine do. Twin gigabit sockets. > You could at least specify that the cabling plant is put in for the > second Ethernet network I've always used both but in the past via bonding. > and aim to use that for the storage traffic (or the MPI traffic). A Would it be better to use one exclusively for MPI? I'm not sure how one goes about this yet! The separation is maintained at switches too? An MPI switch separate from the rest-of-traffic switch? > second stack of > Ethernet switches should not stretch your budget too much. True. I am very tempted to go that route. > Then on the main storage node you could put in a 10gig interface - or > indeed several 10gig > interfaces and spread the load Yes. That is exactly the sort of thing I am wanting to do. That's why I am asking around. e.g. how many could I put in a reasonable config.? It is easier to go fancy on the main node since the cost does not get a multiplier. -- Rahul From rpnabar at gmail.com Thu Sep 3 06:23:36 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 08:23:36 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <4A9FBD66.1060608@scalableinformatics.com> References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> Message-ID: On Thu, Sep 3, 2009 at 7:58 AM, Joe Landman wrote: > For really large clusters, you'd separate out the scheduler and some of the > other functions as well. ?Mark Hahn and some of the other folks on the list > run some of the really large clusters out there. ?They have some good advice > for those scaling up. Thanks Joe! I really look forward to Mark Hahn and the others guiding me. This expansion is on the larger side for me. > It might be worth asking what your targeted per node budget is. I do not have a target per node but more of a $/performance budget. And for my codes I've found that Infy etc. just don't cut it. The additional cost does not squeeze out the extra performance. I've benchmarked several chips and configs and the current winner for our codes seems to be a Intel Nehalem E5520. Less than 3000 $/node. >24 port SDR > IB switches are available, and relatively inexpensive. Is there a approximate $$$ figure someone can throw out? These numbers have been pretty hard to get. >24 port SDR PCIe > cards are available and relatively inexpensive. Ditto. Any $ figures? All my calculations boosted up the $ price of a node to a point where the performance would have to be very stellar to warrant the spending. And really, the plain-vanilla Nehalem ethernet config is not doing too badly for us yet. My main concern now is scaling. -- Rahul From gus at ldeo.columbia.edu Thu Sep 3 08:19:33 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 03 Sep 2009 11:19:33 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> Message-ID: <4A9FDE85.5020008@ldeo.columbia.edu> Rahul Nabar wrote: >> 24 port SDR >> IB switches are available, and relatively inexpensive. > > Is there a approximate $$$ figure someone can throw out? These numbers > have been pretty hard to get. > >> 24 port SDR PCIe >> cards are available and relatively inexpensive. > > Ditto. Any $ figures? > > All my calculations boosted up the $ price of a node to a point where > the performance would have to be very stellar to warrant the spending. > And really, the plain-vanilla Nehalem ethernet config is not doing too > badly for us yet. My main concern now is scaling. > Hi Rahul See these small SDR switches: http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 And SDR HCA card: http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=12 We bought DDR though, but our cluster is small, one 36-port switch only. For a 300-node cluster you need to consider optical fiber for the IB uplinks, and switches with that capability, or buy the appropriate adapters. The regular IB cables are length-challenged, most likely can only be used for node-to-switch connections. Also, for Opteron, Supermicro (and probably others) has motherboards with onboard IB adapters, on 1U dual-node chassis. I wonder if there is something similar for Nehalem. I don't know about your computational chemistry codes, but for climate/oceans/atmosphere (and probably for CFD) IB makes a real difference w.r.t. Gbit Ethernet. For us there was no point on trading a larger number of nodes for IB. OTOH, if your codes run mostly intra-node, there is no advantage in buying a fast interconnect, but I would doubt your Chem codes are happy with 8 processes per job only. Also, with IB, you could dedicate one of your nodes' Gbit Ether ports to I/O only, with all MPI traffic using IB. My $0.02 Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From rpnabar at gmail.com Thu Sep 3 09:28:39 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 11:28:39 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <4A9FDE85.5020008@ldeo.columbia.edu> References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> <4A9FDE85.5020008@ldeo.columbia.edu> Message-ID: On Thu, Sep 3, 2009 at 10:19 AM, Gus Correa wrote: > See these small SDR switches: > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13 > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 > > And SDR HCA card: > Thanks Gus! This info was very useful. A 24port switch is $2400 and the card $125. Thus each compute node would be approximately $300 more expensive. (How about infiniband cables? Are those special and how expensive. I did google but was overwhelmed by the variety available.) This isn't bad at all I think. If I base it on my curent node price it would require only about a 20% performance boost to justify this investment. I feel Infy could deliver that. When I had calculated it the economics was totally off; maybe I had wrong figures. The price-scaling seems tough though. Stacking 24 port switches might get a bit too cumbersome for 300 servers. But when I look at corresponding 48 or 96 port switches the per-port-price seems to shoot up. Is that typical? > For a 300-node cluster you need to consider > optical fiber for the IB uplinks, You mean compute-node-to-switch and switch-to-switch connections? Again, any $$$ figures, ballpark? > I don't know about your computational chemistry codes, > but for climate/oceans/atmosphere (and probably for CFD) > IB makes a real difference w.r.t. Gbit Ethernet. I have a hunch (just a hunch) that the computational chemistry codes we use haven't been optimized to get the full advantage of the latency benefits etc. Some of the stuff they do is pretty bizarre and inefficient if you look at their source codes (writing to large I/O files all the time eg.) I know this ought to be fixed but there that seems a problem for another day! -- Rahul From rpnabar at gmail.com Thu Sep 3 09:43:28 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 11:43:28 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <68A57CCFD4005646957BD2D18E60667B0D056A7B@milexchmb1.mil.tagmclarengroup.com> References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> <4A9FDE85.5020008@ldeo.columbia.edu> <68A57CCFD4005646957BD2D18E60667B0D056A7B@milexchmb1.mil.tagmclarengroup.com> Message-ID: On Thu, Sep 3, 2009 at 11:41 AM, Hearns, John wrote: > > At this point you really, really should be getting a vendor or a few > vendors > in to do these pricings for you. That's part and parcel of pitching for > a > cluster of this size - you really, really should not be doing the donkey > work like this. > You should say 'we want to run codes X, Y,Z' we prefer processors > 'A,B,C' > we need 'N to M' terabytes of storage. Let your vendor work for the > money. That's true. That's how I have specced the rest of the cluster anyways. I'll go to my vendors and check what I get. -- Rahul From Shainer at mellanox.com Thu Sep 3 09:51:01 2009 From: Shainer at mellanox.com (Gilad Shainer) Date: Thu, 3 Sep 2009 09:51:01 -0700 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes References: <4A9F42ED.5050600@scalableinformatics.com><4A9FBD66.1060608@scalableinformatics.com><4A9FDE85.5020008@ldeo.columbia.edu> Message-ID: <9FA59C95FFCBB34EA5E42C1A8573784F02053A65@mtiexch01.mti.com> > > See these small SDR switches: > > > > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct= 13 > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 > > > > And SDR HCA card: > > > > Thanks Gus! This info was very useful. A 24port switch is $2400 and > the card $125. Thus each compute node would be approximately $300 more > expensive. (How about infiniband cables? Are those special and how > expensive. I did google but was overwhelmed by the variety available.) You can find copper cables from around $30, so the $300 will include the cable too > This isn't bad at all I think. If I base it on my curent node price > it would require only about a 20% performance boost to justify this > investment. I feel Infy could deliver that. When I had calculated it > the economics was totally off; maybe I had wrong figures. You can always run your app on available users system and see the performance boost that you will be able to get. For example, you can use the center (free of charge) - http://www.hpcadvisorycouncil.com/cluster_center.php > The price-scaling seems tough though. Stacking 24 port switches might > get a bit too cumbersome for 300 servers. But when I look at > corresponding 48 or 96 port switches the per-port-price seems to shoot > up. Is that typical? It is the same as buying blades. If you get the switches fully populated, than it will be cost effective. There is a 324 port switch, which should be a good option too. > > For a 300-node cluster you need to consider > > optical fiber for the IB uplinks, > > You mean compute-node-to-switch and switch-to-switch connections? > Again, any $$$ figures, ballpark? It all depends on the speed. If you are using IB SDR or DDR, copper cables will be enough. For QDR you can use passive copper up 7-8 meters, and active up to 12m, before you need to move to fiber. > > I don't know about your computational chemistry codes, > > but for climate/oceans/atmosphere (and probably for CFD) > > IB makes a real difference w.r.t. Gbit Ethernet. > > I have a hunch (just a hunch) that the computational chemistry codes > we use haven't been optimized to get the full advantage of the latency > benefits etc. Some of the stuff they do is pretty bizarre and > inefficient if you look at their source codes (writing to large I/O > files all the time eg.) I know this ought to be fixed but there that > seems a problem for another day! On the same web site I have listed above, there are some best practices with apps performance, You can check them out and see if some of them are more relevant. From gus at ldeo.columbia.edu Thu Sep 3 10:25:01 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Thu, 03 Sep 2009 13:25:01 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <4A9F42ED.5050600@scalableinformatics.com> <4A9FBD66.1060608@scalableinformatics.com> <4A9FDE85.5020008@ldeo.columbia.edu> Message-ID: <4A9FFBED.6070300@ldeo.columbia.edu> Rahul Nabar wrote: > On Thu, Sep 3, 2009 at 10:19 AM, Gus Correa wrote: >> See these small SDR switches: >> >> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13 >> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 >> >> And SDR HCA card: >> > > Thanks Gus! This info was very useful. A 24port switch is $2400 and > the card $125. Thus each compute node would be approximately $300 more > expensive. (How about infiniband cables? Are those special and how > expensive. I did google but was overwhelmed by the variety available.) > Hi Rahul IB cables (0.5-8m,$40-$109): http://www.colfaxdirect.com/store/pc/viewCategories.asp?pageStyle=m&idCategory=2 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1&idcategory=2 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=2&idcategory=2 etc ... > This isn't bad at all I think. If I base it on my curent node price > it would require only about a 20% performance boost to justify this > investment. I feel Infy could deliver that. When I had calculated it > the economics was totally off; maybe I had wrong figures. > > The price-scaling seems tough though. Stacking 24 port switches might > get a bit too cumbersome for 300 servers. It probably will. I will defer any comments to the network pros on the list. Here is a suggestion. I would guess that if you don't intend to run the codes, say, on more than 24-36 nodes at once, you might as well not stack all the small IB switches. I.e., you could divide the cluster IB-wise into smaller units, of perhaps 36 nodes or so, with 2-3 switches serving each unit. Not sure how to handle the IB subnet(s) manager in such a configuration, but there may be ways around. This scheme may take some scheduler configuration to handle MPI job submission, but it may save you money and hardware/cabling complexity, and still let you run MPI programs with a substantial number of processes. You can still fully connect the 300 nodes through Gbit Ether, for admin and I/O purposes, stacking 48-port GigE switches. IB is a separate (set of) network(s), which I assume will be dedicated to MPI only. You may want to check the 36-port IB switches also, but IIRR they are only DDR and QDR, not SDR, and somewhat more expensive. > But when I look at > corresponding 48 or 96 port switches the per-port-price seems to shoot > up. Is that typical? > I was told the current IB switch price threshold is 36-port. Above that it gets too expensive, the cost-effective solution is stacking smaller switches. I'm just passing the information/gossip along. >> For a 300-node cluster you need to consider >> optical fiber for the IB uplinks, > > You mean compute-node-to-switch and switch-to-switch connections? > Again, any $$$ figures, ballpark? > I would guess you may need optical fiber for switch-switch connections. Depending on the distance, of course, say, across two racks, if this type of connection is needed. Regular IB cables are probably able handle the node-switch links, if the switches are distributed across the racks. >> I don't know about your computational chemistry codes, >> but for climate/oceans/atmosphere (and probably for CFD) >> IB makes a real difference w.r.t. Gbit Ethernet. > > I have a hunch (just a hunch) that the computational chemistry codes > we use haven't been optimized to get the full advantage of the latency > benefits etc. Some of the stuff they do is pretty bizarre and > inefficient if you look at their source codes (writing to large I/O > files all the time eg.) I know this ought to be fixed but there that > seems a problem for another day! > Not only your Chem codes. Brute force I/O is rampant here also. Some codes take pains to improve MPI communication on the domain decomposition side, with asynchronous communication, etc, then squander it all by letting everybody do I/O in unison. (Hence, keep in mind Joshua's posting about educating users and adjusting codes to do I/O gently.) I hope this helps. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA --------------------------------------------------------------------- From rpnabar at gmail.com Thu Sep 3 15:56:43 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 3 Sep 2009 17:56:43 -0500 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> References: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> Message-ID: On Thu, Sep 3, 2009 at 3:16 PM, Greg Kurtzer wrote: Thanks for the comments Greg! > If you were using Perceus..... No. I've never used Perceus before and although it sounds interesting this seems like a bad time to try something new! > > The file system needs to be built to handle the load of the apps. 300 > nodes means you can go from the low end (Linux RAID and NFS) to a > higher end NFS solution, or upper end of a parallel file system or > maybe even one of each (NFS and parallel) as they solve some different > requirements. What exactly do you mean by a "parallel" file system? Something like GPFS? That's IBM proprietory though isn't it? On the other hand NFS seems pretty archaic. I've seen quite a few installations use Lustre. I am planning to play with that. Something in the OpenSource world to keep costs down. -- Rahul From skylar at cs.earlham.edu Thu Sep 3 16:25:36 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Thu, 03 Sep 2009 16:25:36 -0700 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> Message-ID: <4AA05070.2030900@cs.earlham.edu> Rahul Nabar wrote: > What exactly do you mean by a "parallel" file system? Something like > GPFS? That's IBM proprietory though isn't it? On the other hand NFS > seems pretty archaic. I've seen quite a few installations use Lustre. > I am planning to play with that. Something in the OpenSource world to > keep costs down. > > GPFS actually supports exporting the filesystem using clustered NFS. You have to run your clients over NFSv4, though. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From eugen at leitl.org Fri Sep 4 01:17:22 2009 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 4 Sep 2009 10:17:22 +0200 Subject: [Beowulf] Some perspective to this DIY storage server mentioned at Storagemojo Message-ID: <20090904081722.GN4508@leitl.org> http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-this-DIY-storage-server-mentioned-at-Storagemojo.html Some perspective to this DIY storage server mentioned at Storagemojo Thursday, September 3. 2009 I've received yesterday some mails/tweets with hints to a "Thumper for poor" DIY chassis. Those mails asked me for an opinion towards this piece of hardware and if it's a competition to our X4500/X4540. Those questions arised after Robin Harris wrote his article "Cloud storage for $100 a terabyte", which referred to the company Backblaze, which constructed a storage server on its own and described it on their blog in the article "Petabytes on a budget: How to build cheap cloud storage". Sorry, that this article took so long and there may be a higher rate of typos, as my sinusitis came back with a vengeance ... right in the second week of my vacation. But now this rather long article is ready :-) At first: No, it isn't a system comparable to an X4540 ... even without the considerations of DIY versus Tier-1 vendor. I have a rather long opinion about it, but let's say one thing at first: I see several problems, but i think it fits their need, so it's an optimal design for them and they designed it to be the optimum for them. I assume, many problems are addressed in the application logic. The nice thing at custom-build is the fact, that you can build a system exactly for your needs. And the Backblaze system is a system reduced to the minimum. This device is that cheap because it cuts several corners. That's okay for them. But for general purpose this creates problems. I want to share my concerns just to show you, that you can't compare this to a X4540 device. And even more important: I have to deny the conclusions of the Backblaze people. This isn't a good design, even when you just need cheap storage, when you don't own a middleware that does a lot of stuff that ZFS would do in the filesystem for example in the hardware. On the other side it supports my arguments in regard of the waning importance of RAID controllers. The more intelligent your application is, the less intelligent your storage needs to be. So ... what are my objections to this DIY device: * The DIY Thumper has no power-distribution grid. So when one PSU fails, all devices connected at this power supply will fail. In the case PSU2 fails, the system board is away, thus the machine fails. Game over ... until power comes back. * Connected to the last problem: Given the disk layout, the power-distribution isn't correct. They use it with RAID6, but RAID6 just protects you against 2 failures. I don't see a sensible layout in three RAID6 groups, that would allow the system to loose 25 disks at once. A more reasonable RAID Level would be RAID10, but there you have 5 disks without a partner in the other PSU failure domain. * I don't know if i consider a foam sleeve around the disks and some nylon screws as enough vibration dampening, especially when your hard disks. I'm looking forward to the next article they announced which was announced to cover this topic. It will be even more interesting to hear more about it in the future because of the performance and the longevity of the disks in such an environment. Just an example for the real world: Once we found out that disks near to a fan were a tad slower than the ones far away from the fan. This led to changes to the vibration handling in that system. * This baby cries for ZFS. So much capacity, no battery backup RAID controller, only 10^14 disks. But i see the reason, why this choice wasn't feasible for them: Since a few weeks ago, the OpenSolaris SATA framework hadn't support for port multiplier. This was introduced with the putback of PSARC/2009/394 to OpenSolaris. But now it's integrated. And given, that this baby just speaks HTTPS to the outside and the software relies on Tomcat, it should be a piece of cake to move to Opensolaris and ZFS now. * This design isn't really performance oriented. As they use Port multiplier to couple their disks to cheap SATA PCIe/PCI controller, one 3 GBit/s interface has to feed 5 disks. One ST31500341AS delivers round about 120 MByte/s (saw several benchmarks suggesting such a value). Five of them deliver 600 MByte/s, a little bit less than 6 GBit/s. So each SATA channel is oversubscribed by a factor of two. * Even more important, three of the connections to the port extenders are coupled to a standard PCI-Port. One PCI-Conventional 3.0 port (didn't find an information what the board provides, thus i assumed the fastest, source is the german wikipedia page about PCI) is capable to deliver round-about 4 Gigabit/second (to be exact 4,226 GBit/s). Thus you connected 18 GBit/s worth of hard disks at 4 GBit/s worth of connectivity. * I have similar objections for the PCIe connection for SATA-cards. Those ports are PCIe at 1x. One PCIe 1x port has a theoretical throughput of 250 MByte/s. So such a port would be fully loaded by just two hard disks. But this baby connects ten disks to a single lane of PCIe. * Of course those hard disks doesn't run at max speed all the time, i assume the load pattern will be very random in the special use case of Backblaze. But this leads to a high mechanical load to the disks and to some additional objections. Based on the manual of the hard disk, i see two problems here: o The ST31500341AS is a desktop disk. Not even one of this nearline disks like we use in the X4500/X4540. When you look in the disk manual, all reliability calculations were done on the basis of 2400 hours of operation per year. But a year has 8760 hours. o The reliability considerations of Seagate assume a desktop usage pattern, not a server usage pattern. o o Seagate writes in their manual itself: "The AFR and MTBF will be degraded if used in an enterprise application". But given the long credits list at their end, i assume they've read the manual and considered this in their choice of hard disks. o There is another important point about the reliability of the disks: The AFR and the MTBF for the 7200.11 is valid for a surrounding temperature of 25 degrees celcius. Running it above this temperature reduces the MTBF and increases the AFR. Other harddisks build with enterprise usage in mind use another normal temperature vastly higher. * But due to the usage of RAID-6 those disks will see a high throughput in any case. RAID6 relies on a READ/MODIFY/WRITE cycle due to the nature of RAID6. So you write vastly more than just the modified data to disc. This may even interfere with the sparse throughput of the system. We've introduced RAIDZ, RAIDZ2 and RAIDZ3 to circumvent this kind of problems * No battery backup for the caches, but RAID6 ... well ... "Warning ... write holes ahead" * This system uses a Desktop Board, the DG43NB, thus system resources are a little bit sparse on this board. Just 1 processor and just 4 GB of RAM. I find the later one a little bit problematic. For general purpose a lot of more memory would be feasible. There are good reasons to have 32 GB or 64 GB in a X4540. Without a large amount of cache, you aren't able to shave off a little bit of the IOPS load to get back to a moderate load, thus the choice of a desktop disks gets even more problematic here. I think, Robin Harris is correct with his comment, that this system is a DC-3. It flies, it can transport goods and passengers from A to B in a reasonable, but not fast speed but don't forget your parachutes ;-) It's the same with this storage, this hw needs the parachute in form of the software in front of the device. But, and this is one of the key take aways for you ... even when other systems are more expensive, they are not overpriced. At first don't compare the mentioned list prices with the street prices for components. Second: Of course you can save an dollar at one or the other place, but: The seagate hard disk costs you 100 Euro at a big german computer online-shop, the HUA721010KLA330 (aka Hitachi Ultrastar A7K1000 1TB) costs you roundabout 200 Euro after a search at Google. Just using other (in my opinion correct for general purpose) disks, would double the price despite offering less storage. And even this price isn't indicative, as most often there are special agreements between drive manufacturers and system manufactures because of quality standards, quality management and conditions. The technical differences of the UltraStar: 1 errors in 10^15, qualified for 24/7 operations by the manufacturer, qualified for a enterprise work pattern (and even here only a lighter one) and 1.2 Million Hours MTBF normalized on 40 degrees (AFAIK) instead of 0.7 million Hours at 25 degrees. Quality costs. Period. The same for a desktop board in the DIY-"Thumper" instead of a custom build board for optimal performance (a SATA controller for each disk or using 8x lane PCIe for 8 disks instead of 1x lane PCIe for 10 disk e.g.). I'm pretty sure Sun could build an equally priced system, when you take the bare metal of the X4500 chassis and rip out all the specialities of the X4500/X4540 systems. But such a system with so many corners left wouldn't a be a system you expect from Sun. And yes, the X4540 has less capacity at the moment, but i think it's not far too fetched, that the X4540 gets 2TB drives as soon as they reached the same quality standards and qualification as the current drives givinh the X4540 a capacity of 96 TB. To close this article: It's about making decision. Application and hardware has to be seen as one. When your application is capable to overcome the limitations and problems of such ultra-cheap storage (and the software of Backblaze seems to have this capabilities), such a DIY thing may be a good solution for you. If you have to run normal applications without this capablities, the general-purpose system looks as a much better road in my opinion. Posted by Joerg Moellenkamp in English, Solaris, Sun, Technology, The IT Business at 15:22 | Comments (11) | Trackbacks (0) View as PDF: This entry | This month | Full blog From deadline at eadline.org Fri Sep 4 12:49:24 2009 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 4 Sep 2009 15:49:24 -0400 (EDT) Subject: [Beowulf] HPC for Dummies Message-ID: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> It is not a real "book" (it is short), but some people may be interested in: HPC For Dummies http://www.sun.com/x64/ebooks/hpc.jsp You have to register for a copy. The author is some HPC hack. -- Doug From gus at ldeo.columbia.edu Fri Sep 4 13:11:53 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 04 Sep 2009 16:11:53 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> Message-ID: <4AA17489.1080404@ldeo.columbia.edu> Douglas Eadline wrote: > It is not a real "book" (it is short), but some > people may be interested in: > > HPC For Dummies > > http://www.sun.com/x64/ebooks/hpc.jsp > > You have to register for a copy. > The author is some HPC hack. > I got interested, and I tried. Unfortunately it requires Windows or Mac OS to be downloaded and read. Weird requirement for an HPC audience. Linux folks are out. Why not a simple PDF file? Gus Correa From hahn at mcmaster.ca Fri Sep 4 13:24:17 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Fri, 4 Sep 2009 16:24:17 -0400 (EDT) Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA17489.1080404@ldeo.columbia.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> Message-ID: > Weird requirement for an HPC audience. s/Weird/Asinine/ From deadline at eadline.org Fri Sep 4 13:57:35 2009 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 4 Sep 2009 16:57:35 -0400 (EDT) Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA17489.1080404@ldeo.columbia.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> Message-ID: <33910.192.168.1.213.1252097855.squirrel@mail.eadline.org> I was not aware of that! I wrote this back in February and have not seen the final version. I will mention something to the people at AMD and Sun. I suspect it has more to with Wiley than with Sun and AMD. Maybe a printed/pdf version will show up. -- Doug > Douglas Eadline wrote: >> It is not a real "book" (it is short), but some >> people may be interested in: >> >> HPC For Dummies >> >> http://www.sun.com/x64/ebooks/hpc.jsp >> >> You have to register for a copy. >> The author is some HPC hack. >> > > I got interested, and I tried. > Unfortunately it requires Windows or Mac OS to be downloaded and read. > Weird requirement for an HPC audience. > Linux folks are out. > Why not a simple PDF file? > > Gus Correa > -- Doug From gus at ldeo.columbia.edu Fri Sep 4 14:31:40 2009 From: gus at ldeo.columbia.edu (Gus Correa) Date: Fri, 04 Sep 2009 17:31:40 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <33910.192.168.1.213.1252097855.squirrel@mail.eadline.org> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> <33910.192.168.1.213.1252097855.squirrel@mail.eadline.org> Message-ID: <4AA1873C.3080603@ldeo.columbia.edu> Hi Douglas I didn't mean to spoil your chances for the Pulitzer Prize, and I really hope your book will become a bestseller! :) Considering what the technical and textbook market is today, where the same books get a new edition every year, with the same or reduced content, a new cover, a password protected web site, and a doubled price tag, your guess about Wiley making the access to the book sample hard may be correct. Thank you for the book sample anyway, which I hope will become available to all as pdf file at some point. Gus Correa Douglas Eadline wrote: > I was not aware of that! > > I wrote this back in February and have not seen > the final version. I will mention something > to the people at AMD and Sun. I suspect it > has more to with Wiley than with Sun and AMD. > Maybe a printed/pdf version will show up. > > -- > Doug > > >> Douglas Eadline wrote: >>> It is not a real "book" (it is short), but some >>> people may be interested in: >>> >>> HPC For Dummies >>> >>> http://www.sun.com/x64/ebooks/hpc.jsp >>> >>> You have to register for a copy. >>> The author is some HPC hack. >>> >> I got interested, and I tried. >> Unfortunately it requires Windows or Mac OS to be downloaded and read. >> Weird requirement for an HPC audience. >> Linux folks are out. >> Why not a simple PDF file? >> >> Gus Correa >> > > From deadline at eadline.org Fri Sep 4 17:51:30 2009 From: deadline at eadline.org (Douglas Eadline) Date: Fri, 4 Sep 2009 20:51:30 -0400 (EDT) Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA1873C.3080603@ldeo.columbia.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> <33910.192.168.1.213.1252097855.squirrel@mail.eadline.org> <4AA1873C.3080603@ldeo.columbia.edu> Message-ID: <50870.192.168.1.213.1252111890.squirrel@mail.eadline.org> > Hi Douglas > > I didn't mean to spoil your chances for the Pulitzer Prize, > and I really hope your book will become a bestseller! :) > > Considering what the technical and textbook market is today, > where the same books get a new edition every year, > with the same or reduced content, a new cover, > a password protected web site, and a doubled price tag, > your guess about Wiley making the access to the book sample > hard may be correct. > > Thank you for the book sample anyway, > which I hope will become available > to all as pdf file at some point. Thanks, but I should mention, it is not a "real" book. Or, not the book I would have written if I had more pages. It is a promotional "book" which I was given approximately 40 pages to explain HPC. I tried to cram as much as I could into it and make it interesting to non-HPC people (i.e. industrial HPC) It is similar to the "Virtualization For Dummies" book AMD produced a year or so ago. I should say that if there is enough interest in this topic Wiley may want me to do a "full book." As I never got a final copy, I to will have to figure out how to download and read it :) -- Doug > > Gus Correa > > > > Douglas Eadline wrote: >> I was not aware of that! >> >> I wrote this back in February and have not seen >> the final version. I will mention something >> to the people at AMD and Sun. I suspect it >> has more to with Wiley than with Sun and AMD. >> Maybe a printed/pdf version will show up. >> >> -- >> Doug >> >> >>> Douglas Eadline wrote: >>>> It is not a real "book" (it is short), but some >>>> people may be interested in: >>>> >>>> HPC For Dummies >>>> >>>> http://www.sun.com/x64/ebooks/hpc.jsp >>>> >>>> You have to register for a copy. >>>> The author is some HPC hack. >>>> >>> I got interested, and I tried. >>> Unfortunately it requires Windows or Mac OS to be downloaded and read. >>> Weird requirement for an HPC audience. >>> Linux folks are out. >>> Why not a simple PDF file? >>> >>> Gus Correa >>> >> >> > -- Doug From richard.walsh at comcast.net Sat Sep 5 11:30:32 2009 From: richard.walsh at comcast.net (richard.walsh at comcast.net) Date: Sat, 5 Sep 2009 18:30:32 +0000 (UTC) Subject: [Beowulf] Re: Wake on LAN supported on both built-in interfaces ... ?? In-Reply-To: Message-ID: <126594175.7208511252175432775.JavaMail.root@sz0135a.emeryville.ca.mail.comcast.net> >----- Original Message ----- >From: "David Mathog" >Subject: Wake on LAN supported on both built-in interfaces ... ?? > >richard.walsh at comcast.net wrote: > >> I have a head node that am trying to get WOL set up on. >> >> It is a SuperMicro motherboard (X8DTi-F) with two built >> in interfaces (eth0, eth1). I am told by SuperMicro support >> that both interfaces support WOL fully, but when I probe them >> with ethtool only eth0 indicates that it supports WOL with: >> >> ..... > >That board has "Intel? 82576 Dual-Port Gigabit Ethernet" > >and Intel provides some information on that here: > >http://edc.intel.com/Link.aspx?id=2372 > >where it says: > > Wake-on-LAN support: > Packet recognition and wake-up for LAN on motherboard > applications without software configuration > >and nothing more. That is ambiguous, it requires that at least one >interface support WOL, but it does not say explicitly that both do. >Most likely the hardware does support on both ports but the driver is >confused somehow by the dual chip. > >Try contacting the author of the linux driver and/or Intel directly. David/All, Here is some follow up on this WOL question. I did contact the driver folks and did some further checking. Standby power has to be supplied to both ports, but no one could assure me that it was for my motherboard. The suggestion was that because there was nothing in the EEPROM/BIOS to activate or switch to the other port that is perhaps was not. Economy of design and logic suggest that it should not be needed anyway, and upon further thought I simply swapped interface names with the MAC addresses in the ifcfg-eth0 and ifcfg-eth1 files, and then swapped my cables and rebooted, which was easy to do. There are half a dozen was to set or fix your interface names with specific MAC addresses/ports. Here is a good reference for doing this: http://www.science.uva.nl/research/air/wiki/LogicalInterfaceNames?show_comments=1#comments So, I have WOL working ... ;-) ... as I want. Thanks, rbw -------------- next part -------------- An HTML attachment was scrubbed... URL: From prentice at ias.edu Tue Sep 8 14:18:00 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 08 Sep 2009 17:18:00 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> Message-ID: <4AA6CA08.6090200@ias.edu> Douglas Eadline wrote: > It is not a real "book" (it is short), but some > people may be interested in: > > HPC For Dummies > > http://www.sun.com/x64/ebooks/hpc.jsp > > You have to register for a copy. > The author is some HPC hack. > Wasn't AMD giving this away for free at SC08? -- Prentice From prentice at ias.edu Tue Sep 8 14:26:33 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 08 Sep 2009 17:26:33 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA6CA08.6090200@ias.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA6CA08.6090200@ias.edu> Message-ID: <4AA6CC09.3080909@ias.edu> Prentice Bisbal wrote: > Douglas Eadline wrote: >> It is not a real "book" (it is short), but some >> people may be interested in: >> >> HPC For Dummies >> >> http://www.sun.com/x64/ebooks/hpc.jsp >> >> You have to register for a copy. >> The author is some HPC hack. >> > > Wasn't AMD giving this away for free at SC08? Nevermind. It was different. Must have been the virtualization for dummies book. -- Prentice From michf at post.tau.ac.il Tue Sep 1 01:08:20 2009 From: michf at post.tau.ac.il (Micha Feigin) Date: Tue, 1 Sep 2009 11:08:20 +0300 Subject: [Beowulf] GPU question In-Reply-To: References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: <20090901110820.1533b3ca@vivalunalitshi.luna.local> On Mon, 31 Aug 2009 16:13:55 +0200 Jonathan Aquilina wrote: > >One thing that's not mentioned out loud by NVIDIA (I have read only in > >CUDA programming manual) is that if the video system needs more memory > >that's not available(say you change resolution, while you're waiting > >for your process to finish), it will crash your cuda app, so I advise > >you to use a second card to display (if you have a tesla solution, > >you certainly have a "second" display card). If you are running > >remotly, this i an non issue (framebuffers don't need much memory > >neither change resolution). > in this regard then why waste a pci-x slot when u can get one that has > graphics integrated onto the board leaving the slots free to use for data > processing. is there any difference in performance in a motherboard that has > the graphics card integrated and one that does not? ?The Problem is that I've never ran into an onboard card that's capable of doing real hpc work. Laptops can come with a relatively strong nvidia chip but not enough for hpc (can't handle the wattage and cooling among other things). Motherboards usually come with a cheap intel. And it's not like you can use the graphics slot for anything else. Possibly in a few years when the Intel vision for larabee will come along where larabee is integrated on the board (although I don't think that it will happen before pci-e 3 which is supposed to handle the same speed. By the way, nvidia do say that it's better not to use the main card for cuda if you intend to do real hpc (don't remember where I read it though) and if the second card is not a tesla the suggest a quadro as the main card. They claim that you get better performance (and if you intend to also do glsl, quadro support opengl gpu affinity and I think that it also supports opengl context not on the main card. Tesla apparantly also supports opengl but it's not official). By the way, if you ask nvidia, they suggest for deployment to use only tesla and quadro as they claim that they are designed to handle 24/7 work and that g200 will downclock when it gets hot. The quadros or ridiculously expensive though. From michf at post.tau.ac.il Tue Sep 1 01:20:01 2009 From: michf at post.tau.ac.il (Micha Feigin) Date: Tue, 1 Sep 2009 11:20:01 +0300 Subject: [Beowulf] GPU question In-Reply-To: <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> Message-ID: <20090901112001.23110413@vivalunalitshi.luna.local> On Sun, 30 Aug 2009 04:35:30 +0500 amjad ali wrote: > Hello all, specially Gil Brandao > > Actually I want to start CUDA programming for my |C.I have 2 options to do: > 1) Buy a new PC that will have 1 or 2 CPUs and 2 or 4 GPUs. > 2) Add 1 GPUs to each of the Four nodes of my PC-Cluster. > > Which one is more "natural" and "practical" way? > Does a program written for any one of the above will work fine on the other? > or we have to re-program for the other? > If you use mpi to have several processes, each controlling one gpu then the same code would work in both scenarios. The first scenario though will make communication easier and would allow you to avoid mpi. Cuda runs asynchronously so you can do work on the pc in parallel and control all gpus from one process. If the pcs are powerful enough, the communication is low enough and you want to combine cpu and gpu that both do work than the second option may work better. It will make cooling simpler and you can use smaller PSUs. Note that each of these monsters (g200, tesla, quadro) can take over 200W and solve your heating problem for the winter if you're in Alaska. Also take note that there are big differences in terms of communication (speed and latency). If you put all cards in one pc, communication between them is gpu via pci-e to memory. In several pcs you have gpu (pc a) -- pci-e --> memory (pc a) -- network --> memory (pc b) --> gpu (pc b) > Regards. > > On Sat, Aug 29, 2009 at 5:48 PM, wrote: > > > On Sat, Aug 29, 2009 at 8:42 AM, amjad ali wrote: > > > Hello All, > > > > > > > > > > > > I perceive following computing setups for GP-GPUs, > > > > > > > > > > > > 1) ONE PC with ONE CPU and ONE GPU, > > > > > > 2) ONE PC with more than one CPUs and ONE GPU > > > > > > 3) ONE PC with one CPU and more than ONE GPUs > > > > > > 4) ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than ONE GPUs > > > (e.g. Nvidia C1060) > > > > > > 5) Cluster of PCs with each node having ONE CPU and ONE GPU > > > > > > 6) Cluster of PCs with each node having more than one CPUs and ONE > > GPU > > > > > > 7) Cluster of PCs with each node having ONE CPU and more than ONE > > GPUs > > > > > > 8) Cluster of PCs with each node having more than one CPUs and more > > > than ONE GPUs. > > > > > > > > > > > > Which of these are good/realistic/practical; which are not? Which are > > quite > > > ?natural? to use for CUDA based programs? > > > > > > > CUDA is kind of new technology, so I don't think there is a "natural > > use" yet, though I read that there people doing CUDA+MPI and there are > > papers on CPU+GPU algorithms. > > > > > > > > IMPORTANT QUESTION: Will a cuda based program will be equally good for > > > some/all of these setups or we need to write different CUDA based > > programs > > > for each of these setups to get good efficiency? > > > > > > > There is no "one size fits all" answer to your question. If you never > > developed with CUDA, buy one GPU an try it. If it fits your problems, > > scale it with the approach that makes you more comfortable (but > > remember that scaling means: making bigger problems or having more > > users). If you want a rule of thumb: your code must be > > _truly_parallel_. If you are buying for someone else, remember that > > this is a niche. The hole thing is starting, I don't thing there isn't > > many people that needs much more 1 or 2 GPUs. > > > > > > > > Comments are welcome also for AMD/ATI FireStream. > > > > > > > put it on hold until OpenCL takes of (in the real sense, not in > > "standards papers" sense), otherwise you will have to learn another > > technology that even fewer people knows. > > > > > > Gil Brandao > > From michf at post.tau.ac.il Tue Sep 1 01:58:40 2009 From: michf at post.tau.ac.il (Micha Feigin) Date: Tue, 1 Sep 2009 11:58:40 +0300 Subject: [Beowulf] GPU question In-Reply-To: <4A9BFA3B.5030809@ldeo.columbia.edu> References: <428810f20908290042k22430e4ci6543495e9be86ab3@mail.gmail.com> <428810f20908291635i2d873f9q2735fde88e602920@mail.gmail.com> <4A9BFA3B.5030809@ldeo.columbia.edu> Message-ID: <20090901115840.19c42248@vivalunalitshi.luna.local> On Mon, 31 Aug 2009 12:28:43 -0400 Gus Correa wrote: > Hi Amjad > > 1. Beware of hardware requirements, specially on your existing > computers, which may or may not fit a CUDA-ready GPU. > Otherwise you may end up with a useless lemon. > > A) Not all NVidia graphic cards are CUDA-ready. > NVidia has lists telling which GPUs are CUDA-ready, > which are not. > All newer cards are CUDA ready but with different levels of support. basically everything from geforce 8000 and up will support cuda (even the 40$ cards by the way). g80/g90 cards (geforce 8000 and 9000 series) will only do single precision, have relatively small memory (256-768 mb), have stricter coalescing requirements (working with main card memory efficiently is harder), have limited atomic operation support g200 cards (geforce 200 series) do double precision, can reach a lot more memory (1gb on the g285, 1800mb on the g295), have more cores and higher memory bandwidth and have much better atomic operation support. the g285 has 240 cores, the g295 has to GPUs on board giving 480 cores. These are rather nice for development if you want to work on your main card in your pc and have a low budget. They are not targeted for hpc by NVidia by the way. tesla, this is basically almost the same as the g285 (240 cores) but has 4gb memory and no graphics output. You can get 4 of these in a dedicated rack mount (s1070). They are designed to run cuda, although unofficially the support opengl according to the guy in charge of them at NVidia. They are much more expensive than the g285 though. quadro, these are designed for business graphics for render farms, CAD and high dynamic range work. They have support for high dynamic range imaging and anti aliasing in hardware and are the only cards by nvidia that support gpu affinity for opengl (for glsl). There are unofficial hacks to achieve this if I'm not mistaken under linux with the g200 but no way to do it under windows. There is also a rack mount to connect four of these to one PC (quadro plex) Another thing to note is that the teslas and quadros are manufactures and supported by nvidia and designed for 24/7 deployment. For the geforce series, nvidia makes the GPU chip but not the cards and they get very angry and offended if you suggest putting one of these into a deployment system. > B) Check all the GPU hardware requirements in detail: motherboard, > PCIe version and slot, power supply capacity and connectors, etc. > See the various GPU models on NVidia site, and > the product specs from the specific vendor you choose. > > C) You need a free PCIe slot, most likely 16x, IIRR. > I couldn't find any cuda supported card that works on anything less than pci-e x 16. I found reference to one of the cheap cards (don't remember which one, I thing geforce 8400) that has a version for something less, but could actually find one. > D) Most GPU card models are quite thick, and take up its > own PCIe slot and cover the neighbor slot, which cannot be used. > Hence, if your motherboard is already crowded, make sure > everything will fit. The high end ones take two slots, one for the cooling, I thing that the g295 actually takes 3 slots if memory serves. > > For rackmount a chassis you may need at least 2U height. > On a tower PC chassis this shouldn't be a problem. > You may need some type of riser card if you plan to mount the GPU > parallel to the motherboard. > You also need appropriate cooling to take care of the ridiculous amount of heat. > E) If I remember right, you need PCIe version 1.5 (?) > or version 2 on your motherboard. > Most cards are pci-e 2 > F) You also need a power supply with enough extra power to feed > the GPU beast. > The GPU model specs should tell you how much power you need. > Most likely a 600W PS or larger, specially if you have a dual socket > server motherboard with lots of memory, disks, etc to feed. > They also take their own power input from the psu (two to three inputs) for additional power unlike the cards that take all their power from the pci-e slot so you need a supported psu. The psu also needs to be strong enough (around 200w per cards, the g285 says you need at least 550w for a single card) > G) Depending on the CUDA-ready GPU card, > the low end ones require 6-pin PCIe power connectors > from the power supply. > The higher end models require 8-pin power supply PCIe connectors. > You may find and buy molex-to-PCIe connector adapters also, > so that you can use the molex (i.e. ATA disk power connectors) > if your PS doesn't have the PCIe connectors. > However, you need to have enough power to feed the GPU and the system, > no matter what. > > *** > > 2. Before buying a lot of hardware, I would experiment first with a > single GPU on a standalone PC or server (that fits the HW requirements), > to check how much programming it takes, > and what performance boost you can extract from CUDA/GPU. > > CUDA requires quite a bit of logistics of > shipping data between memory, GPU, CPU, > etc. > It is perhaps more challenging to program than, say, > parallelizing a serial program with MPI, for instance. > Codes that are heavy in FFTs or linear algebra operations are probably > good candidates, as there are CUDA libraries for both. > There is a steep learning curve as you need to understand the hardware to get the most out of your code. I find it easier to code than mpi but I guess that is personal. You code is usually good for CUDA if you need to do the same thing a lot of times. If you code has a lot of logic (a lot of if clauses), a lot of atomic operations or complex data structures that it probably won't transfer well. Also complex (non-ordered) memory reading/writing paradigms can have a significant performance hit (reading is easier to cope with that writing) > At some point only 32-bit floating point arrays would take advantage of > CUDA/GPU, but not 64-bit arrays. > The latter would > require additional programming to change between 64/32 bit > when going to and coming back from the GPU. > Not sure if this still holds true, > newer GPU models may have efficient 64-bit capability, > but it is worth checking this out, including if performance for > 64-bit is as good as for 32-bit. > g200 and up (including tesla and quadro) have 64bit floating point support but it is much less efficient than 32bit. If memory serves it's a ratio of about 1:5. What nvidia calls cores are actually FPUs. these FPUs are 32bit and it combines these somehow to get 64bit arithmetic. ATI are better at 64bit arithmeric performance but ati streams are much more limited, are much harder to code and the documentation is VERY scarce. If you use glsl and double precision it's better to go with ATI. Maybe once OpenCL is mature enough it will also be an option but the g300 will probably be on the market by then and they may change the balance. geforce 8000 and 9000 only do single precission > 3. PGI compilers version 9 came out with "GPU directives/pragmas" > that are akin to the OpenMPI directives/pragmas, > and may simplify the use of CUDA/GPU. > At least before the promised OpenCL comes out. > Check the PGI web site. > > Note that this will give you intra-node parallelism exploring the GPU, > just like OpenMP does using threads on the CPU/cores. > I saw one of these, don't remember if it was PGI. No experience with it though. >From CUDA exprience though, there would be a lot of things that it would be hard for such a compiler to achieve. BTW, there is also a matlab toolbox called jacket from accelereyes that allows you to do cuda from matlab. The numbers are not as good as they advertize (truth in advertizing, they also provide an OpenGL visualization functions and for their code they use opengl and for the matlab version they use surf, from testing the difference is huge) > 4. CUDA + MPI may be quite a challenge to program. > > I hope this helps, > Gus Correa > > amjad ali wrote: > > Hello all, specially Gil Brandao > > > > Actually I want to start CUDA programming for my |C.I have 2 options to do: > > 1) Buy a new PC that will have 1 or 2 CPUs and 2 or 4 GPUs. > > 2) Add 1 GPUs to each of the Four nodes of my PC-Cluster. > > > > Which one is more "natural" and "practical" way? > > Does a program written for any one of the above will work fine on the > > other? or we have to re-program for the other? > > > > Regards. > > > > On Sat, Aug 29, 2009 at 5:48 PM, > > wrote: > > > > On Sat, Aug 29, 2009 at 8:42 AM, amjad ali > > wrote: > > > Hello All, > > > > > > > > > > > > I perceive following computing setups for GP-GPUs, > > > > > > > > > > > > 1) ONE PC with ONE CPU and ONE GPU, > > > > > > 2) ONE PC with more than one CPUs and ONE GPU > > > > > > 3) ONE PC with one CPU and more than ONE GPUs > > > > > > 4) ONE PC with TWO CPUs (e.g. Xeon Nehalems) and more than > > ONE GPUs > > > (e.g. Nvidia C1060) > > > > > > 5) Cluster of PCs with each node having ONE CPU and ONE GPU > > > > > > 6) Cluster of PCs with each node having more than one CPUs > > and ONE GPU > > > > > > 7) Cluster of PCs with each node having ONE CPU and more > > than ONE GPUs > > > > > > 8) Cluster of PCs with each node having more than one CPUs > > and more > > > than ONE GPUs. > > > > > > > > > > > > Which of these are good/realistic/practical; which are not? Which > > are quite > > > ?natural? to use for CUDA based programs? > > > > > > > CUDA is kind of new technology, so I don't think there is a "natural > > use" yet, though I read that there people doing CUDA+MPI and there are > > papers on CPU+GPU algorithms. > > > > > > > > IMPORTANT QUESTION: Will a cuda based program will be equally > > good for > > > some/all of these setups or we need to write different CUDA based > > programs > > > for each of these setups to get good efficiency? > > > > > > > There is no "one size fits all" answer to your question. If you never > > developed with CUDA, buy one GPU an try it. If it fits your problems, > > scale it with the approach that makes you more comfortable (but > > remember that scaling means: making bigger problems or having more > > users). If you want a rule of thumb: your code must be > > _truly_parallel_. If you are buying for someone else, remember that > > this is a niche. The hole thing is starting, I don't thing there isn't > > many people that needs much more 1 or 2 GPUs. > > > > > > > > Comments are welcome also for AMD/ATI FireStream. > > > > > > > put it on hold until OpenCL takes of (in the real sense, not in > > "standards papers" sense), otherwise you will have to learn another > > technology that even fewer people knows. > > > > > > Gil Brandao > > > > > > > > ------------------------------------------------------------------------ > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > From brs at admin.usf.edu Thu Sep 3 10:40:52 2009 From: brs at admin.usf.edu (Smith, Brian) Date: Thu, 3 Sep 2009 13:40:52 -0400 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: <9FA59C95FFCBB34EA5E42C1A8573784F02053A65@mtiexch01.mti.com> References: <4A9F42ED.5050600@scalableinformatics.com><4A9FBD66.1060608@scalableinformatics.com><4A9FDE85.5020008@ldeo.columbia.edu> <9FA59C95FFCBB34EA5E42C1A8573784F02053A65@mtiexch01.mti.com> Message-ID: Gilad, Where are you finding cables for $30? The lowest I've been able to find 1M is in the $60 price range. I have a project going right now that would benefit greatly from $30 cables. -Brian -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Gilad Shainer Sent: Thursday, September 03, 2009 12:51 PM To: Rahul Nabar; Gus Correa Cc: Bewoulf Subject: RE: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes > > See these small SDR switches: > > > > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct= 13 > > http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 > > > > And SDR HCA card: > > > > Thanks Gus! This info was very useful. A 24port switch is $2400 and > the card $125. Thus each compute node would be approximately $300 more > expensive. (How about infiniband cables? Are those special and how > expensive. I did google but was overwhelmed by the variety available.) You can find copper cables from around $30, so the $300 will include the cable too > This isn't bad at all I think. If I base it on my curent node price > it would require only about a 20% performance boost to justify this > investment. I feel Infy could deliver that. When I had calculated it > the economics was totally off; maybe I had wrong figures. You can always run your app on available users system and see the performance boost that you will be able to get. For example, you can use the center (free of charge) - http://www.hpcadvisorycouncil.com/cluster_center.php > The price-scaling seems tough though. Stacking 24 port switches might > get a bit too cumbersome for 300 servers. But when I look at > corresponding 48 or 96 port switches the per-port-price seems to shoot > up. Is that typical? It is the same as buying blades. If you get the switches fully populated, than it will be cost effective. There is a 324 port switch, which should be a good option too. > > For a 300-node cluster you need to consider > > optical fiber for the IB uplinks, > > You mean compute-node-to-switch and switch-to-switch connections? > Again, any $$$ figures, ballpark? It all depends on the speed. If you are using IB SDR or DDR, copper cables will be enough. For QDR you can use passive copper up 7-8 meters, and active up to 12m, before you need to move to fiber. > > I don't know about your computational chemistry codes, > > but for climate/oceans/atmosphere (and probably for CFD) > > IB makes a real difference w.r.t. Gbit Ethernet. > > I have a hunch (just a hunch) that the computational chemistry codes > we use haven't been optimized to get the full advantage of the latency > benefits etc. Some of the stuff they do is pretty bizarre and > inefficient if you look at their source codes (writing to large I/O > files all the time eg.) I know this ought to be fixed but there that > seems a problem for another day! On the same web site I have listed above, there are some best practices with apps performance, You can check them out and see if some of them are more relevant. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From gmkurtzer at gmail.com Thu Sep 3 13:16:48 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Thu, 3 Sep 2009 13:16:48 -0700 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: Message-ID: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> On Wed, Sep 2, 2009 at 11:18 PM, Mark Hahn wrote: >> That brings me to another important question. Any hints on speccing >> the head-node? > > I think you imply a single, central admin/master/head node. ?this is a very > bad idea. ?first, it's generally a bad idea to have users on a fileserver. > ?next, it's best to keep cluster-infrastructure > (monitoring, management, pxe, scheduling) on a dedicated admin machine. > for 300 compute nodes, it might be a good idea to provide more than one > login node (for editing, compilation, etc). To expand on Mark's comment... I would SPEC >=2 systems for head/masters and either spread the load of the required services (e.g. management, monitoring and other sysadmin tasks and put scheduling on the other) OR put all of the services on a single master and then run a shadow master for redundancy. I would not put users on either of these systems. If you were using Perceus..... I would either create an interactive VNFS capsule (include compilers, additional libs, etc..) or make a large more bloated compute VNFS capsule and use that on all of the nodes. In this scenario, all nodes could run stateless *and* diskful so if you need to change the number of interactive nodes you can do it with a simple command sequence: # perceus vnfs import /path/to/interactive.vnfs # perceus node set vnfs interactive n000[0-4] and/or # perceus vnfs import /path/to/compute.vnfs # perceus node set vnfs compute n0[004-299] Have your cake and eat it too. :) The file system needs to be built to handle the load of the apps. 300 nodes means you can go from the low end (Linux RAID and NFS) to a higher end NFS solution, or upper end of a parallel file system or maybe even one of each (NFS and parallel) as they solve some different requirements. -- Greg Kurtzer http://www.infiscale.com/ http://www.perceus.org/ http://www.caoslinux.org/ From gmkurtzer at gmail.com Thu Sep 3 17:19:33 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Thu, 3 Sep 2009 17:19:33 -0700 Subject: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes In-Reply-To: References: <571f1a060909031316m6f40d268n95ae943f470c8f81@mail.gmail.com> Message-ID: <571f1a060909031719l65a152d3q1fb79c8ea97b9c3d@mail.gmail.com> On Thu, Sep 3, 2009 at 3:56 PM, Rahul Nabar wrote: > On Thu, Sep 3, 2009 at 3:16 PM, Greg Kurtzer wrote: > > Thanks for the comments Greg! Sure thing. Glad to offer what I can. > >> If you were using Perceus..... > > No. I've never used Perceus before and although it sounds interesting > this seems like a bad time to try something new! Unless it makes your job easier as you scale up. ;) Feel free to check out: http://www.linux-mag.com/id/6386 http://www.linux-mag.com/id/7239 > >> >> The file system needs to be built to handle the load of the apps. 300 >> nodes means you can go from the low end (Linux RAID and NFS) to a >> higher end NFS solution, or upper end of a parallel file system or >> maybe even one of each (NFS and parallel) as they solve some different >> requirements. > > What exactly do you mean by a "parallel" file system? Something like > GPFS? That's IBM proprietory though isn't it? On the other hand NFS > seems pretty archaic. I've seen quite a few installations use Lustre. > I am planning to play with that. Something in the OpenSource world to > keep costs down. Yes, GPFS is IBM's commercial file system and Lustre is a free solution. Both are very complicated components to the cluster that will take a large investment to do properly (either initial purchase cost and the hidden cost of administration or just a lot of hidden costs). If cost is really an issue, *and* if the applications don't require a parallel file system then why not make your job easier with the use of a quality Network Attached Storage solution (NAS) and use NFS? In either case if you look around you can find people that may even have premade Perceus VNFS capsules (the equivalent of an installer or preconfigured disk image) for Lustre servers, clients, and various other system roles. Best of luck! Greg -- Greg Kurtzer http://www.infiscale.com/ http://www.perceus.org/ http://www.caoslinux.org/ From sabujp at gmail.com Fri Sep 4 16:43:04 2009 From: sabujp at gmail.com (Sabuj Pattanayek) Date: Fri, 4 Sep 2009 18:43:04 -0500 Subject: [Beowulf] where do I set PBSACLUSEGROUPLIST before starting pbs_server from torque 2.4.0b1? Message-ID: Hi, I can't seem to figure out how/where to set the environment variable PBSACLUSEGROUPLIST. This page: http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml says about acl_groups: NOTE: If the PBSACLUSEGROUPLIST variable is set in the pbs_server environment, acl_groups will be check against all groups of which the job user is a member. I tried exporting that variable with: export PBSACLUSEGROUPLIST export PBSACLUSEGROUPLIST=1 export PBSACLUSEGROUPLIST=true before starting pbs_server and pbs_sched but I still can't submit to queues where I belong to the secondary group for which acl_groups is set. Otherwise, I've verified that acl_groups is working because I can only submit to a queue where my primary group is the same as that set in acl_groups for that queue. Anyone have any ideas on how to get pbs_server to look at all the groups that I'm a member of, supposedly using the environment variable mentioned above? Thanks, Sabuj Pattanayek From lukasz at mbi.ucla.edu Fri Sep 4 13:11:28 2009 From: lukasz at mbi.ucla.edu (Lukasz Salwinski) Date: Fri, 04 Sep 2009 13:11:28 -0700 Subject: [Beowulf] HPC for Dummies In-Reply-To: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> Message-ID: <4AA17470.6070501@mbi.ucla.edu> Douglas Eadline wrote: > It is not a real "book" (it is short), but some > people may be interested in: > > HPC For Dummies > > http://www.sun.com/x64/ebooks/hpc.jsp > > You have to register for a copy. > The author is some HPC hack. > Here's a note I've just dropped them: >Sun Microsystems wrote: >> >> Thank you for your interest in our HPC for Dummies ebook. If you >>have any questions, please contact us at HPC_Questions at sun.com. >> >> >> Sun Microsystems, Inc., 18 Network Circle, M/S: UMPK18-124, Attn: >> Global eMarketing, Menlo Park, CA 94025 USA >> Copyright 2009 Sun Microsystems, Inc. All rights reserved. > >I'm sorry but I'm not interested any more as there's no way to get >to the ebook from linux. wish i knew it before registering. lukasz -- ------------------------------------------------------------------------- Lukasz Salwinski PHONE: 310-825-1402 UCLA-DOE Institute for Genomics & Proteomics FAX: 310-206-3914 UCLA, Los Angeles EMAIL: lukasz at mbi.ucla.edu ------------------------------------------------------------------------- From mdublin at genomeweb.com Fri Sep 4 13:18:12 2009 From: mdublin at genomeweb.com (mdublin) Date: Fri, 4 Sep 2009 16:18:12 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA17489.1080404@ldeo.columbia.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA17489.1080404@ldeo.columbia.edu> Message-ID: <7279B23D-7E1E-40AE-A2EF-3925DE49E1B0@genomeweb.com> I'm on OSX, installed the Adobe Digital Additions app, and I still can't figure out how to download it, thanks Sun for making me feel like a child..... On Sep 4, 2009, at 4:11 PM, Gus Correa wrote: > Douglas Eadline wrote: >> It is not a real "book" (it is short), but some >> people may be interested in: >> HPC For Dummies >> http://www.sun.com/x64/ebooks/hpc.jsp >> You have to register for a copy. >> The author is some HPC hack. > > I got interested, and I tried. > Unfortunately it requires Windows or Mac OS to be downloaded and read. > Weird requirement for an HPC audience. > Linux folks are out. > Why not a simple PDF file? > > Gus Correa > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Matthew Dublin Senior Writer Genome Technology 1-212-651-5638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From trainor at presciencetrust.org Fri Sep 4 13:59:14 2009 From: trainor at presciencetrust.org (Douglas J. Trainor) Date: Fri, 4 Sep 2009 20:59:14 +0000 (GMT) Subject: [Beowulf] HPC for Dummies Message-ID: <1048939568.16294.1252097954740.JavaMail.mail@webmail09> An HTML attachment was scrubbed... URL: From margretthompson81 at yahoo.com Fri Sep 4 17:17:31 2009 From: margretthompson81 at yahoo.com (margret thompson) Date: Fri, 4 Sep 2009 17:17:31 -0700 (PDT) Subject: [Beowulf] HPC for Dummies Message-ID: <466061.58715.qm@web44703.mail.sp1.yahoo.com> The actual PDF is at http://acs.libredigital.com/books/ede6ba36-5cb7-4b71-8568-4a1d34451ece.pdf but unfortunately it's encumbered with some sort of encryption.? Who will figure out the key first?? Too bad none of us have any massive supercomputers to bruteforce it...? oh.? Wait. Other hints: http://acs.libredigital.com/prodfulfillment/URLLink.acsm?action=enterorder&ordersource=John%20Wiley%20%26%20Sons%2C%20Inc%2E&orderid=8772F613%2DCFB0%2D6C7F%2DC28711A50A6FEBE7&resid=urn%3Auuid%3Aede6ba36%2D5cb7%2D4b71%2D8568%2D4a1d34451ece&gbauthdate=Fri%2C%204%20Sep%20200918%3A45%3A41&dateval=1252086341&gblver=4&auth=6b3702ae1908fad5014043d98a44b4aaae5b0c73 http://lto.libredigital.com/js/ADEBadgeLauncher.js http://lto.libredigital.com/?SUN_AMD_HighPerformanceComputingForDummies -------------- next part -------------- An HTML attachment was scrubbed... URL: From mdublin at genomeweb.com Tue Sep 8 14:31:43 2009 From: mdublin at genomeweb.com (mdublin) Date: Tue, 8 Sep 2009 17:31:43 -0400 Subject: [Beowulf] HPC for Dummies In-Reply-To: <4AA6CC09.3080909@ias.edu> References: <38883.192.168.1.213.1252093764.squirrel@mail.eadline.org> <4AA6CA08.6090200@ias.edu> <4AA6CC09.3080909@ias.edu> Message-ID: Think it was the Idiot's Guide to Dummies.... On Sep 8, 2009, at 5:26 PM, Prentice Bisbal wrote: > > > > Prentice Bisbal wrote: >> Douglas Eadline wrote: >>> It is not a real "book" (it is short), but some >>> people may be interested in: >>> >>> HPC For Dummies >>> >>> http://www.sun.com/x64/ebooks/hpc.jsp >>> >>> You have to register for a copy. >>> The author is some HPC hack. >>> >> >> Wasn't AMD giving this away for free at SC08? > > Nevermind. It was different. Must have been the virtualization for > dummies book. > > -- > Prentice > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > Matthew Dublin Senior Writer Genome Technology 1-212-651-5638 -------------- next part -------------- An HTML attachment was scrubbed... URL: From rpnabar at gmail.com Wed Sep 9 10:40:23 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Wed, 9 Sep 2009 12:40:23 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? Message-ID: Our new cluster aims to have around 300 compute nodes. I was wondering what is the largest setup people have tested NFS with? Any tips or comments? There seems no way for me to say if it will scale well or not. I have been warned of performance hits but how bad will they be? Infiniband is touted as a solution but the economics don't work out. My question is this: Assume each of my compute nodes have gigabit ethernet AND I specify the switch such that it can handle full line capacity on all ports. Will there still be performance hits as I start adding compute nodes? Why? Or is it unrealistic to configure a switching setup with full line capacities on 300 ports? If not NFS then Lustre etc options do exist. But the more I read about those the more I am convinced that those open another big can of worms. Besides, if NFS would work I do not want to switch. -- Rahul From landman at scalableinformatics.com Wed Sep 9 11:52:18 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 09 Sep 2009 14:52:18 -0400 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <4AA7F962.500@scalableinformatics.com> Rahul Nabar wrote: > If not NFS then Lustre etc options do exist. But the more I read about > those the more I am convinced that those open another big can of > worms. Besides, if NFS would work I do not want to switch. You might also want to look at GlusterFS. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From coutinho at dcc.ufmg.br Wed Sep 9 12:12:04 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Wed, 9 Sep 2009 16:12:04 -0300 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: My two cents: 2009/9/9 Rahul Nabar > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or > comments? There seems no way for me to say if it will scale well or > not. > > I have been warned of performance hits but how bad will they be? > Infiniband is touted as a solution but the economics don't work out. > My question is this: > > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. > Will there still be performance hits as I start adding compute nodes? > Yes. > Why? Because final NFS server bandwidth will be the bandwidth of the most limited device, be it disk, network interface or switch. Even if you have a switch capable of fill line capacity for all 300 nodes, you must put a insanely fast interface in your NFS server and a giant pool of disks to have a decent bandwidth if all nodes access NFS at the same time. But depending on the way people run applications in your cluster, only a small set of nodes will access NFS at the same time and a Ethernet 10Gb with tens of disks will be enough. > Or is it unrealistic to configure a switching setup with full > line capacities on 300 ports? > This will be good for MPI but it will not help much your NFS server problem. > If not NFS then Lustre etc options do exist. But the more I read about > those the more I am convinced that those open another big can of > worms. Besides, if NFS would work I do not want to switch. > > -- > Rahul > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzaletnev at yandex.ru Tue Sep 8 19:53:26 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed, 09 Sep 2009 06:53:26 +0400 Subject: [Beowulf] NFS server for a small cluster Message-ID: <393061252464806@webmail90.yandex.ru> Hello, list I'm going to prepare NFS server for my small PS3-cluster. So, I've a question: if I use 4 SATAII ports for 250 GB disks in software RAID0, is the performance of CPU critical? Witch one should I use Core Duo/1 MB cache or Core2Duo/6 MB cache and how much of DDR2-800 RAM, 2 GB or 8 GB? I'm going to use it for transitional/ unsteady state CFD simulations. Thank you for any suggestions. -- Dmitry Zaletnev From pscadmin at avalon.umaryland.edu Wed Sep 9 12:23:30 2009 From: pscadmin at avalon.umaryland.edu (psc) Date: Wed, 09 Sep 2009 15:23:30 -0400 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <200909091900.n89J07U5031683@bluewest.scyld.com> References: <200909091900.n89J07U5031683@bluewest.scyld.com> Message-ID: <4AA800B2.7060607@avalon.umaryland.edu> I wonder what would be the sensible biggest cluster possible based on 1GB Ethernet network . And especially how would you connect those 1GB switches together -- now we have (on one of our four clusters) Two 48 ports gigabit switches connected together with 6 patch cables and I just ran out of ports for expansion and wonder where to go from here as we already have four clusters and it would be great to stop adding cluster and start expending them beyond number of outlets on the switch/s .... NFS and 1GB Ethernet works great for us and we want to stick with it , but we would love to find a way how to overcome the current "switch limitation". ... I heard that there are some "stackable switches" .. in any case -- any idea , suggestion will be appreciated. thanks!! psc > From: Rahul Nabar > Subject: [Beowulf] how large of an installation have people used NFS > with? would 300 mounts kill performance? > To: Beowulf Mailing List > Message-ID: > > Content-Type: text/plain; charset=ISO-8859-1 > > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or > comments? There seems no way for me to say if it will scale well or > not. > > I have been warned of performance hits but how bad will they be? > Infiniband is touted as a solution but the economics don't work out. > My question is this: > > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. > Will there still be performance hits as I start adding compute nodes? > Why? Or is it unrealistic to configure a switching setup with full > line capacities on 300 ports? > > If not NFS then Lustre etc options do exist. But the more I read about > those the more I am convinced that those open another big can of > worms. Besides, if NFS would work I do not want to switch. > > From gmkurtzer at gmail.com Wed Sep 9 12:32:03 2009 From: gmkurtzer at gmail.com (Greg Kurtzer) Date: Wed, 9 Sep 2009 12:32:03 -0700 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <571f1a060909091232j31eb3e7cldd72cdd46399d8ce@mail.gmail.com> On Wed, Sep 9, 2009 at 10:40 AM, Rahul Nabar wrote: > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or > comments? There seems no way for me to say if it will scale well or > not. > > I have been warned of performance hits but how bad will they be? > Infiniband is touted as a solution but the economics don't work out. > My question is this: > > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. > Will there still be performance hits as I start adding compute nodes? > Why? Or is it unrealistic to configure a switching setup with full > line capacities on 300 ports? > > If not NFS then Lustre etc options do exist. But the more I read about > those the more I am convinced that those open another big can of > worms. Besides, if NFS would work I do not want to switch. NFS itself doesn't have any hard limits and I have seen clusters well over a thousand nodes using it. With that said, you also need to consider your application and user requirements and the budget including administration costs as you architect your resource. As an aside note, generally the more specialized or non-standard the implementation, the more pressure you will put on administration costs. Keep in mind that the requirements of the system and budget need to define the architecture of the system. NFS is a good choice and can be suitable for systems much larger then 300 nodes. *BUT* that would depend on what you are doing with the cluster, application IO requirements, usage patterns, user needs, reliability/uptime goals, etc... Hope that helps. ;) -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From Greg at keller.net Wed Sep 9 13:38:45 2009 From: Greg at keller.net (Greg Keller) Date: Wed, 9 Sep 2009 15:38:45 -0500 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <200909091900.n89J07U6031683@bluewest.scyld.com> References: <200909091900.n89J07U6031683@bluewest.scyld.com> Message-ID: > Date: Wed, 9 Sep 2009 12:40:23 -0500 > From: Rahul Nabar > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or > comments? There seems no way for me to say if it will scale well or > not. > "It all depends" -- Anonymous Cluster expert I routinely run NFS with 300+ nodes, but "it all depends" on the applications' IO profiles. For example, Lot's of nodes reading and writing different files in a generically staggered fashion, may not be a big deal. 300 nodes writing to the same file at the same time... ouch! If you buy generic enough hardware you can hedge your bet, and convert to Gluster or Luster or eventually pNFS if things get ugly. But not all NFS servers are created equal, and a solid purpose built appliance may handle loads a general purpose linux NFS server won't. > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. > Will there still be performance hits as I start adding compute nodes? > Why? Or is it unrealistic to configure a switching setup with full > line capacities on 300 ports? > The bottleneck is more likely the File-server's Nic and/or it's Back- end storage performance. If the file-server is 1GbE attached then having a strong network won't help NFS all that much. 10GbE attached will keep up with a fair number of raided disks on the back-end. Load the NFS server up with a lot of RAM and you could keep a lot of nodes happy if they are reading a common set of files in parallel. Until you get to parallel FS options, it's hard to imagine the switching infrastructure being the bottleneck so long as it supports the 1 or 10GbE performance from the IO node. If you expect heavy MPI usage on the Ethernet side, then non-blocking and low latency issues become relevant, but for IO it only needs to accommodate the slowest link... the IO node. Hope this helps, Greg From jmdavis1 at vcu.edu Wed Sep 9 14:10:11 2009 From: jmdavis1 at vcu.edu (Mike Davis) Date: Wed, 09 Sep 2009 17:10:11 -0400 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA800B2.7060607@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> Message-ID: <4AA819B3.3050503@vcu.edu> psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". ... I heard that there are some "stackable switches" .. > in any case -- any idea , suggestion will be appreciated. > > thanks!! > psc > > When we started running clusters in 2000 we made the decision to use a flat networking model and a single switch if at all possible, We use 144 and 160 port Gig e switches for two of our clusters. The overall performance is better and the routing less complex. Larger switches are available as well. We try to go with a flat model as well for Infiniband. Right now we are using a 96 port Infiniband switch. When we additional nodes to that cluster we will either move up to a 144 or 288 port chassis. Running the numbers I found the cost of the large chassis to be on par with the extra switches required to network using 24 or 36 port switches. -- Mike Davis Technical Director (804) 828-3885 Center for High Performance Computing jmdavis1 at vcu.edu Virginia Commonwealth University "Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity." George S. Patton From coutinho at dcc.ufmg.br Wed Sep 9 14:50:53 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Wed, 9 Sep 2009 18:50:53 -0300 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA800B2.7060607@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> Message-ID: 2009/9/9 psc > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". ... I heard that there are some "stackable switches" .. > in any case -- any idea , suggestion will be appreciated. > > Stackable switches are small switches 16 to 48 ports that have proprietary high bandwidth uplinks to connect switches of the same type. Typically these connections are pairs of 10Gbps (as they are full-duplex, sometimes vendors say that they are 20Gbps) cables that connect all switches in ring configuration. This solution is cheaper than a modular switch, but has limited bandwidth. > thanks!! > psc > > > From: Rahul Nabar > > Subject: [Beowulf] how large of an installation have people used NFS > > with? would 300 mounts kill performance? > > To: Beowulf Mailing List > > Message-ID: > > > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Our new cluster aims to have around 300 compute nodes. I was wondering > > what is the largest setup people have tested NFS with? Any tips or > > comments? There seems no way for me to say if it will scale well or > > not. > > > > I have been warned of performance hits but how bad will they be? > > Infiniband is touted as a solution but the economics don't work out. > > My question is this: > > > > Assume each of my compute nodes have gigabit ethernet AND I specify > > the switch such that it can handle full line capacity on all ports. > > Will there still be performance hits as I start adding compute nodes? > > Why? Or is it unrealistic to configure a switching setup with full > > line capacities on 300 ports? > > > > If not NFS then Lustre etc options do exist. But the more I read about > > those the more I am convinced that those open another big can of > > worms. Besides, if NFS would work I do not want to switch. > > > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lindahl at pbm.com Wed Sep 9 15:24:57 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 9 Sep 2009 15:24:57 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA800B2.7060607@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> Message-ID: <20090909222457.GB11329@bx9.net> On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network. People build multi-thousand node clusters like that. In the HPC world, oil-and-gas and some other industrial applications don't need more than 1 gbit. Of couse these days they use 10 gbit to connect 1gbit switches. In the non-HPC world, companies with huge datacenters often have thousands to many thousands of nodes in a single layer-2 network with 1gbit to the nodes. That's limited only by the size of the mac addr tables in your switches. When you have to split your cluster into layer-3 chunks, the bandwidth between chunks sucks, but most clusters in the non-HPC world are limited to 2000-4000 nodes anyway by various software limitations. -- greg From skylar at cs.earlham.edu Wed Sep 9 16:15:35 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed, 09 Sep 2009 16:15:35 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <20090909222457.GB11329@bx9.net> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <20090909222457.GB11329@bx9.net> Message-ID: <4AA83717.6090207@cs.earlham.edu> Greg Lindahl wrote: > On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: > > >> I wonder what would be the sensible biggest cluster possible based on >> 1GB Ethernet network. >> > > People build multi-thousand node clusters like that. In the HPC world, > oil-and-gas and some other industrial applications don't need more > than 1 gbit. Of couse these days they use 10 gbit to connect 1gbit > switches. > > In the non-HPC world, companies with huge datacenters often have > thousands to many thousands of nodes in a single layer-2 network with > 1gbit to the nodes. That's limited only by the size of the mac addr > tables in your switches. When you have to split your cluster into > layer-3 chunks, the bandwidth between chunks sucks, but most clusters > in the non-HPC world are limited to 2000-4000 nodes anyway by various > software limitations. > > How do people with large layer-2 networks deal with broadcast storms? I've been in situations even on small networks where a node goes haywire and starts spewing broadcast traffic and slows everything in its broadcast domain down. The probability and impact of that goes up with larger networks, and it seems like even the baseline chatter from ARP could be significant. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From Jaime at servepath.com Wed Sep 9 15:12:14 2009 From: Jaime at servepath.com (Jaime Requinton) Date: Wed, 9 Sep 2009 15:12:14 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA819B3.3050503@vcu.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <4AA819B3.3050503@vcu.edu> Message-ID: Can you use this switch? You won't lose a port for uplink since it has fiber and/or copper uplink ports. Just my 10 cents... -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Mike Davis Sent: Wednesday, September 09, 2009 2:10 PM To: psc Cc: beowulf at beowulf.org Subject: Re: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". ... I heard that there are some "stackable switches" .. > in any case -- any idea , suggestion will be appreciated. > > thanks!! > psc > > When we started running clusters in 2000 we made the decision to use a flat networking model and a single switch if at all possible, We use 144 and 160 port Gig e switches for two of our clusters. The overall performance is better and the routing less complex. Larger switches are available as well. We try to go with a flat model as well for Infiniband. Right now we are using a 96 port Infiniband switch. When we additional nodes to that cluster we will either move up to a 144 or 288 port chassis. Running the numbers I found the cost of the large chassis to be on par with the extra switches required to network using 24 or 36 port switches. -- Mike Davis Technical Director (804) 828-3885 Center for High Performance Computing jmdavis1 at vcu.edu Virginia Commonwealth University "Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity." George S. Patton _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Wed Sep 9 17:25:25 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Wed, 9 Sep 2009 17:25:25 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA83717.6090207@cs.earlham.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <20090909222457.GB11329@bx9.net> <4AA83717.6090207@cs.earlham.edu> Message-ID: <20090910002525.GB3121@bx9.net> On Wed, Sep 09, 2009 at 04:15:35PM -0700, Skylar Thompson wrote: > How do people with large layer-2 networks deal with broadcast storms? You don't cause them, and you monitor for them. I haven't seen a wild transmitter in a while. Monitoring means you'll at least know it's there; quickly dignosing which node is the bad one can be challenging. More common is pilot error. Trying to talk to nodes which are down, for example, causes an arp every time. So don't do that too often. Modern L3 switches can't discard broadcast packets very quickly, which makes the issue much more challenging than you might think. -- greg From hahn at mcmaster.ca Wed Sep 9 20:59:22 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Wed, 9 Sep 2009 23:59:22 -0400 (EDT) Subject: [Beowulf] NFS server for a small cluster In-Reply-To: <393061252464806@webmail90.yandex.ru> References: <393061252464806@webmail90.yandex.ru> Message-ID: > I'm going to prepare NFS server for my small PS3-cluster. So, I've a > question: if I use 4 SATAII ports for 250 GB disks in software RAID0, is > the performance of CPU critical? raid0 copies data around, but is not particularly cpu-intensive (raid5 is worse; r6 worser). but the network is just gigabit, right? which means that the bandwidth is only about 1 disk work anyway (a modern disk sustains > 130 MB/s on the outer half...) > Witch one should I use Core Duo/1 MB cache > or Core2Duo/6 MB cache and how much of DDR2-800 RAM, 2 GB or 8 GB? extra ram doesn't help a fileserver much except when you can read-cache effectively (either file contents or metadata.) I wouldn't sweat cache size, and probably not ram size either (unless your working set really is < 8GB.) From skylar at cs.earlham.edu Wed Sep 9 21:10:23 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Wed, 09 Sep 2009 21:10:23 -0700 Subject: [Beowulf] NFS server for a small cluster In-Reply-To: References: <393061252464806@webmail90.yandex.ru> Message-ID: <4AA87C2F.9090109@cs.earlham.edu> Mark Hahn wrote: > extra ram doesn't help a fileserver much except when you can read-cache > effectively (either file contents or metadata.) I wouldn't sweat cache > size, and probably not ram size either (unless your working set really > is < 8GB.) > _______________________________________________ One thing extra RAM can buy a file server is aggregating writes into contiguous blocks. I'm not sure what the sweet spot as far as RAM size is, and I suspect strongly that the effect is dependent on the number and size of your files, so your YMMV. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From hahn at mcmaster.ca Wed Sep 9 22:11:38 2009 From: hahn at mcmaster.ca (Mark Hahn) Date: Thu, 10 Sep 2009 01:11:38 -0400 (EDT) Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: > Our new cluster aims to have around 300 compute nodes. I was wondering > what is the largest setup people have tested NFS with? Any tips or well, 300 is no problem at all. though if you're talking to a single Gb-connected server, you can't home for much BW per node... > comments? There seems no way for me to say if it will scale well or > not. it's not to hard to figure out some order-of-magnitude bandwidth requirements. how many nodes need access to a single namespace at once? do jobs drop checkpoints of a known size periodically? faster/more ports on a single NFS server gets you fairly far (hundreds of MB/s), but you can also agregate across multiple NFS servers (if you don't need all the IO in a single directory...) > I have been warned of performance hits but how bad will they be? NFS is fine at hundreds of nodes. nodes can generate a fairly high load of, for instance, getattr calls, but that can be mitigated some with an acregmin setting. > Infiniband is touted as a solution but the economics don't work out. depends on how much bandwidth you need... > Assume each of my compute nodes have gigabit ethernet AND I specify > the switch such that it can handle full line capacity on all ports. but why? your fileservers won't support saturating all nodes links at once, so why a full-bandwidth fabric? the fabric backbone only needs to match the capacity of the storage (I'd guess 10G would be reasonable, unless you really ramp up the number of fat fileservers.) or do you mean the fabric is full-bandwidth to optimally support MPI? > If not NFS then Lustre etc options do exist. But the more I read about yes - I wouldn't resort to Lustre until it was clear that NFS wouldn't do. Lustre does a great job of scaling content bandwidth and capacity all within a single namespace. but NFS, even several instances, is indeed a lot simpler... From hearnsj at googlemail.com Wed Sep 9 22:53:50 2009 From: hearnsj at googlemail.com (John Hearns) Date: Thu, 10 Sep 2009 06:53:50 +0100 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <9f8092cc0909092253u2cee121fie3809899d06650ba@mail.gmail.com> As the guys say, it depends. NFS can be very successful on that size of cluster, if attention is paid to the performance of the NFS server and the bandwidth you have to it. SGI Enhanced NFS as a for instance. Please let me add to your list though - contact your Panasas salesman. I really think you could use a supported, commercial solution here. From hearnsj at googlemail.com Wed Sep 9 22:54:58 2009 From: hearnsj at googlemail.com (John Hearns) Date: Thu, 10 Sep 2009 06:54:58 +0100 Subject: [Beowulf] Nehalem EX Message-ID: <9f8092cc0909092254w48f1d94dk367cfb2630751c7@mail.gmail.com> Does anyone on the list have a feeling for how close Nehalem EX systems are? A bit of heat and light generated earlier int he year, but nothing being unveiled yet. From bill at cse.ucdavis.edu Wed Sep 9 23:24:13 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 09 Sep 2009 23:24:13 -0700 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <4AA89B8D.2020403@cse.ucdavis.edu> Mark Hahn wrote: >> Our new cluster aims to have around 300 compute nodes. I was wondering >> what is the largest setup people have tested NFS with? Any tips or > > well, 300 is no problem at all. though if you're talking to a single > Gb-connected server, you can't home for much BW per node... True. Then again in the days of 20+ GB/sec memory systems, 8GB/sec pci-e busses, fast raid controllers and quad ethernet pci-e/motherboards there's not much reason to connect a single GigE to a file server these days. Hell even many of the cheap 1U systems come with quad ethernet. Sun ships in on many (all?) of their systems. I was just looking at a single socket xeon lynnfield board with 4 GigE from supermicro. I suspect the supermicro quad port motherboard costs less than an intel quad port pci-e card ($422 at newegg). If you have 4 48 port GigE switches connected via the interconnect cable it would make sense to give each switch a subnet and direct connect one GigE per switch. Of course file servers themselves are cheap, at anything much more than 16 disks I usually build multiple servers. In the comparisons I've made so far 3 16 disk servers compare rather well against a single 48 disk server. Usually cheaper as well, granted it does take 9U instead of 3-4U. Granted without a fancy parallel file system you can't load balance file servers or network links. Having more file servers and multiple uplink per server can certainly substantially improve your average throughput. With that said 3 16 disk servers make a good building block for a parallel file system of your choice if you change your mind later on. Often we have different groups of users contributing to a cluster and politically it's nice to separate them off on their own server that by definition get 100% of the disk they paid for, and their use of their uplink or disk doesn't effect the other file file servers. This gives said group options that they wouldn't have with a larger shared server. >> comments? There seems no way for me to say if it will scale well or >> not. > > it's not to hard to figure out some order-of-magnitude bandwidth > requirements. how many nodes need access to a single namespace at > once? do jobs drop checkpoints of a known size periodically? > faster/more ports on a single NFS server gets you fairly far (hundreds > of MB/s), but you can also agregate across multiple NFS > servers (if you don't need all the IO in a single directory...) Indeed, this kind of thing works quite well for us with 180-230 node clusters. >> Assume each of my compute nodes have gigabit ethernet AND I specify >> the switch such that it can handle full line capacity on all ports. > > but why? your fileservers won't support saturating all nodes links at > once, so why a full-bandwidth fabric? the fabric backbone only needs to Agreed, it's a waste unless MPI or related needs it. > match the capacity of the storage (I'd guess 10G would be reasonable, I've been looking at 10G for NFS/file servers (we already use sdr and ddr for MPI), but so far the cost seems to favor not putting too many disks in a single box and using more than one uplink. So instead of 1 48 disk file server with a 10G uplink we end up with 3 16 disk servers with 4 GigE uplinks each. It also avoids each individual file server not being as mission critical. I'd be a bit nervous if an expensive cluster depended on a single piece equipment. My theory goes if you have just one you spent too much on that piece of equipment. That way we leave the switch <-> switch connections entirely for MPI and not for trying to spread the single 10G connection for all I/O across 4 switches. > unless you really ramp up the number of fat fileservers.) > or do you mean the fabric is full-bandwidth to optimally support MPI? Hopefully. >> If not NFS then Lustre etc options do exist. But the more I read about > > yes - I wouldn't resort to Lustre until it was clear that NFS wouldn't do. > Lustre does a great job of scaling content bandwidth and capacity all > within a single namespace. but NFS, even several instances, is indeed > a lot simpler... Agreed. NFS works well in the 200 node range if you aren't too I/O intensive. Sometimes it's a bit more complicated because you end up staging to a local disk for better performance. But it's stable and keeps our users happy. I'm watching the parallel file system space closely and would certainly design around it if we ended up with a significantly larger (or more I/O intensive) cluster. From kilian.cavalotti.work at gmail.com Thu Sep 10 00:25:09 2009 From: kilian.cavalotti.work at gmail.com (Kilian CAVALOTTI) Date: Thu, 10 Sep 2009 09:25:09 +0200 Subject: [Beowulf] Nehalem EX In-Reply-To: <9f8092cc0909092254w48f1d94dk367cfb2630751c7@mail.gmail.com> References: <9f8092cc0909092254w48f1d94dk367cfb2630751c7@mail.gmail.com> Message-ID: Hi John, On Thu, Sep 10, 2009 at 7:54 AM, John Hearns wrote: > Does anyone on the list have a feeling for how close Nehalem EX systems are? > A bit of heat and light generated earlier int he year, but nothing > being unveiled yet. The original release date was somewhere in 2nd half of 2009, but it seems to have been pushed a little bit since, to the beginning of 2010. Nothing official, though. Cheers, -- Kilian From henning.fehrmann at aei.mpg.de Thu Sep 10 00:39:57 2009 From: henning.fehrmann at aei.mpg.de (Henning Fehrmann) Date: Thu, 10 Sep 2009 09:39:57 +0200 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <4AA800B2.7060607@avalon.umaryland.edu> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> Message-ID: <20090910073957.GA8487@gretchen.aei.mpg.de> Hi On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . Hmmm, may I cheat and use a 10Gb core switch? If you setup a cluster with few thousand nodes you have to ask yourself whether this network should be non-blocking or not. For a non blocking network you need the right core-switch technology. Unfortunately, there are not many vendors out there which provide non-blocking Ethernet based core switches but I am aware of at least two. One provides or will provide 144 10Gb Ethernet ports. Another one sells switches with more than 1000 1 GB ports. You could buy edge-switches with 4 10Gb uplinks and 48 1GB ports. If you just use 40 of them you end up with a 1440 non-blocking 1Gb ports. It might be also possible to cross connect two of these core-switches with the help of some smaller switches so that one ends up with 288 10Gb ports and, in principle, one might connect 2880 nodes in a non-blocking way, but we did not have the possibility to test it successfully yet. One of problems is that the internal hash table can not store that many mac addresses. Anyway, one probably needs to change the mac addresses of the nodes to avoid an overflow of the hash tables. An overflow might cause arp storms. Once this works one runs into some smaller problems. One of them is the arp cache of the nodes. It should be adjusted to hold as many mac addresses as you have nodes in the cluster. > And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". With NFS you can nicely test the setup. Use one NFS server and let all nodes write different files into it and look what happens. Cheers, Henning From rpnabar at gmail.com Thu Sep 10 08:19:50 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 10:19:50 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <4AA7F962.500@scalableinformatics.com> References: <4AA7F962.500@scalableinformatics.com> Message-ID: On Wed, Sep 9, 2009 at 1:52 PM, Joe Landman wrote: > Rahul Nabar wrote: > >> If not NFS then Lustre etc options do exist. But the more I read about >> those the more I am convinced that those open another big can of >> worms. Besides, if NFS would work I do not want to switch. > > You might also want to look at GlusterFS. Thanks Joe! From all the reports and advice it seems I'll start with NFS and a good muscular storage device and then if NFS fails under load move to glusterfc, lustre etc. Per se there seems no way of knowing if NFS will keep up or not a priori. -- Rahul From rpnabar at gmail.com Thu Sep 10 08:32:00 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 10:32:00 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: On Wed, Sep 9, 2009 at 2:12 PM, Bruno Coutinho wrote: > Because final NFS server bandwidth will be the bandwidth of the most limited > device, be it disk, network interface or switch. > Even if you have a switch capable of fill line capacity for all 300 nodes, > you must put a insanely fast interface in your NFS server and a giant pool > of disks to have a decent bandwidth if all nodes access NFS at the same > time. I'm thinking of having multiple 10GigE uplinks between the switch and the NFS server. The actual storage is planned to reside on a box of SAS disks. Approx 15 disks. THe NFS server is planned with at least two RAID cards with multiple SAS connections to the box. But that's just my planning. The question is do people have numbers. What I/O throughputs are your NFS devices giving? I want to get a feel for what my I/O performance envelope should be like. What kind of I/O gurrantees are available? Any vendors around want to comment? On the other hand just multiplying NFS clients by their peak bandwidth (300 x 1 GB) is an overkill. THat is a very unlikely situation. What are typical workloads like? Given x NFS mounts in a computational environment with a y GB uplink each what's the factor on the net loading of the central storage? Any back of the envelope numbers? > But depending on the way people run applications in your cluster, only a > small set of nodes will access NFS at the same time and a Ethernet 10Gb with > tens of disks will be enough. strace profiling shows that app1 has very little NFS I/O. App2 has about 10% runtime devoted to NFS I/O. Multiple seeks only. More reads than writes. (All this thanks to Jeff Laytons excellent strace analyzer and profiling help) -- Rahul From rpnabar at gmail.com Thu Sep 10 08:36:25 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 10:36:25 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <571f1a060909091232j31eb3e7cldd72cdd46399d8ce@mail.gmail.com> References: <571f1a060909091232j31eb3e7cldd72cdd46399d8ce@mail.gmail.com> Message-ID: On Wed, Sep 9, 2009 at 2:32 PM, Greg Kurtzer wrote: > NFS itself doesn't have any hard limits and I have seen clusters well > over a thousand nodes using it. Thanks Greg! That is very reassuring to know! :) I myself had a installation with 256 NFS mounts but these were ancient clusters which were essentially "groups of single cpu PCs" The "well over a 1000 node NFS clusters" that Greg refers to: Any masters of such installations around on this list? If so I'd give an arm and leg and more to be in touch and grab your tips and comments. Whenever I mention "300 nodes", "Gigabit Ethernet" and NFS in the same breath people look at me as if I was a madman. :) > As an aside note, generally the more specialized or non-standard the > implementation, the more pressure you will put on administration > costs. Exactly. Hence I want NFS to keep things simple ergo cheap. > Keep in mind that the requirements of the system and budget need to > define the architecture of the system. NFS is a good choice and can be > suitable for systems much larger then 300 nodes. *BUT* that would > depend on what you are doing with the cluster, application IO > requirements, usage patterns, user needs, reliability/uptime goals, > etc... I see too many invocations of the "it depends" rule of HPC everywhere I go! :) -- Rahul From rpnabar at gmail.com Thu Sep 10 08:44:41 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 10:44:41 -0500 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: <200909091900.n89J07U6031683@bluewest.scyld.com> Message-ID: On Wed, Sep 9, 2009 at 3:38 PM, Greg Keller wrote: >> > "It all depends" -- Anonymous Cluster expert Thanks Greg. And I hate that anonymous expert. He's the bane of my current existence. I even get nightmares with his ghastly face in them. :) > I routinely run NFS with 300+ nodes, but "it all depends" on the > applications' IO profiles. 50% projected runtime is with an application with negligible reads and writes (VASP). The other 50% goes to an app. (DACAPO) which strace shows to be using 10% of its runtime devoted to I/O. Mostly seeks. More reads than writes. Multiple small reads and writes. All cores doing I/O not a central master core. >For example, Lot's of nodes reading and writing > different files in a generically staggered fashion, How do you enforce the staggering? Do people write staggered I/O codes themselves? Or can on alliviate this problem by scheduler settings? > Luster or eventually pNFS if things get ugly. ?But not all NFS servers are > created equal, and a solid purpose built appliance may handle loads a > general purpose linux NFS server won't. Disk array connected to generic Linux server? Or standalone Fileserver? Reccomendations? What exactly does a "solid purpose built appliance" offer that a Generic Linux server (well configured) connected to an array of disks does not offer? > The bottleneck is more likely the File-server's Nic and/or it's Back-end > storage performance. ?If the file-server is 1GbE attached then having a > strong network won't help NFS all that much. ?10GbE attached will keep up > with a fair number of raided disks on the back-end. ?Load the NFS server up > with a lot of RAM and you could keep a lot of nodes happy if they are > reading a common set of files in parallel. Yup; I'm going for at least 24 GB RAM and twin 10 GigE cards connecting the file server to the switch. -- Rahul From landman at scalableinformatics.com Thu Sep 10 09:18:00 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 10 Sep 2009 12:18:00 -0400 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: <200909091900.n89J07U6031683@bluewest.scyld.com> Message-ID: <4AA926B8.6060702@scalableinformatics.com> Rahul Nabar wrote: >> Luster or eventually pNFS if things get ugly. But not all NFS servers are >> created equal, and a solid purpose built appliance may handle loads a >> general purpose linux NFS server won't. > > Disk array connected to generic Linux server? Or standalone > Fileserver? Reccomendations? At least one company on this list sells some nice fast storage boxen. I am biased of course, as I work there ... > What exactly does a "solid purpose built appliance" offer that a > Generic Linux server (well configured) connected to an array of disks > does not offer? "It depends". Your off the shelf Linux servers aren't very well designed for high performance file service. You would either need to go to a special purpose built server, or the pure purpose-built appliance boxen. The latter often have some additional features you may or may not find useful, at a price you may or may not be willing to pay for. The former, depending upon whom you speak with, will provide excellent performance for reasonable prices on your use case. >> The bottleneck is more likely the File-server's Nic and/or it's Back-end >> storage performance. If the file-server is 1GbE attached then having a >> strong network won't help NFS all that much. 10GbE attached will keep up >> with a fair number of raided disks on the back-end. Load the NFS server up >> with a lot of RAM and you could keep a lot of nodes happy if they are >> reading a common set of files in parallel. > > Yup; I'm going for at least 24 GB RAM and twin 10 GigE cards > connecting the file server to the switch. FWIW: I didn't post it to this list when we did this, but we had a single client and server show a 1 GB/s (954 MB/s really, I rounded up) over a single single-mode fibre running NFS. "Who says you can?t do Gigabyte per second NFS? I keep hearing this. Its not true though. See below. NFS client: Scalable Informatics Delta-V (?V) 4 unit NFS server: Scalable Informatics JackRabbit 4 unit. (you can buy these units today from Scalable Informatics and its partners) 10GbE: single XFP fibre between two 10GbE NICs. This is NOT a clustered NFS result. root at dv4:~# mount | grep data2 10.1.3.1:/data on /data2 type nfs (rw,intr,rsize=262144,wsize=262144,tcp,addr=10.1.3.1) root at dv4:~# mpirun -np 4 ./io-bm.exe -n 32 -f /data2/test/file -r -d -v N=32 gigabytes will be written in total each thread will output 8.000 gigabytes page size ... 4096 bytes number of elements per buffer ... 2097152 number of buffers per file ... 512 Thread=3: time = 33.665s IO bandwidth = 243.337 MB/s Thread=2: time = 33.910s IO bandwidth = 241.580 MB/s Thread=1: time = 34.262s IO bandwidth = 239.101 MB/s Thread=0: time = 34.244s IO bandwidth = 239.226 MB/s Naive linear bandwidth summation = 963.244 MB/s More precise calculation of Bandwidth = 956.404 MB/s " The machine running the code has 8GB of RAM, so writing 32 GB is far outside of its cache. The remote system, (the 10.1.3.1 unit) has a native local disk performance of about 1.6/2.0 GB/s read/write. So yes, with the right system, you can get a nice bit of performance out of it. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From landman at scalableinformatics.com Thu Sep 10 09:32:22 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Thu, 10 Sep 2009 12:32:22 -0400 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: Message-ID: <4AA92A16.3080401@scalableinformatics.com> Rahul Nabar wrote: > I'm thinking of having multiple 10GigE uplinks between the switch and > the NFS server. The actual storage is planned to reside on a box of > SAS disks. Approx 15 disks. THe NFS server is planned with at least > two RAID cards with multiple SAS connections to the box. ugh ... Why are you designing it ahead of time? Why not take your requirements and needs and use that to dictate the design? > But that's just my planning. The question is do people have numbers. > What I/O throughputs are your NFS devices giving? I want to get a Depending upon workload, you can get performance ranging from 100MB/s through GB+/s. > feel for what my I/O performance envelope should be like. What kind of > I/O gurrantees are available? Any vendors around want to comment? You want a guarantee of I/O performance? For an arbitrary I/O pattern and load? So if you suddenly start random seeking with 4kB reads, you still want to hit 1+GB/s with these 4kB random seek and reads? Not sure if anyone would be willing to guarantee a particular rate for any workload. We have found well known benchmark codes (bonnie++ 1.0x and some of 1.9x) doing not so good I/O (long OS based pauses) where other codes seem fine. We use our io-bm code, fio, and a few others to bang on our systems. fio lets us model per unit workloads fairly nicely, io-bm lets us create a system/cluster-wide I/O hammer. > On the other hand just multiplying NFS clients by their peak bandwidth > (300 x 1 GB) is an overkill. THat is a very unlikely situation. What Each 1Gb interface can move about 120MB/s best case. So 300x 120MB/s => 3.6E+4 MB/s . This is likely to be overkill, as you report your highest IO utilization is about 10% of CPU (need to get what that translates to in MB/s, I'd suggest installing iftop on that machine and measuring when it is doing its 10% time in IO). > are typical workloads like? Given x NFS mounts in a computational > environment with a y GB uplink each what's the factor on the net > loading of the central storage? Any back of the envelope numbers? In the distant past, we used 8 nodes per GbE port for a port on the NFS server. This allowed us to serve up to 32 nodes with 4GbE ports, and the NFS servers weren't badly loaded. This ratio is a function of utilization of the links, the I/O duty cycle, etc. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From rpnabar at gmail.com Thu Sep 10 09:56:02 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Thu, 10 Sep 2009 11:56:02 -0500 Subject: [Beowulf] how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: <4AA92A16.3080401@scalableinformatics.com> References: <4AA92A16.3080401@scalableinformatics.com> Message-ID: On Thu, Sep 10, 2009 at 11:32 AM, Joe Landman wrote: > Rahul Nabar wrote: > >> I'm thinking of having multiple 10GigE uplinks between the switch and >> the NFS server. The actual storage is planned to reside on a box of >> SAS disks. Approx 15 disks. THe NFS server is planned with at least >> two RAID cards with multiple SAS connections to the box. > > ugh ... Why are you designing it ahead of time? ?Why not take your > requirements and needs and use that to dictate the design? Oh no! I wasn't meaning that I was designing this. Sorry. I gave my vendor the specs. and these were the options that came up. He is meaning to design each component to balance loads. But realistically for 300 compute-nodes a config. close to what I wrote is to be expected he said. Just giving and idea of where my current expectation is about where we will be ending up at. -- Rahul From kus at free.net Thu Sep 10 10:52:08 2009 From: kus at free.net (Mikhail Kuzminsky) Date: Thu, 10 Sep 2009 21:52:08 +0400 Subject: [Beowulf] Re:recommendations for a good ethernet switch for connecting ~300 compute nodes Message-ID: Many quantum chemical programs like Gamess-US or Gaussian performs well w/local I/O to local HDDs, in which case you'll have small NFS load even on such large cluster. Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Institute of Organic Chemistry RAS Moscow From Greg at keller.net Thu Sep 10 11:13:02 2009 From: Greg at keller.net (Greg Keller) Date: Thu, 10 Sep 2009 13:13:02 -0500 Subject: [Beowulf] Re: how large of an installation have people used NFS with? would 300 mounts kill performance? In-Reply-To: References: <200909091900.n89J07U6031683@bluewest.scyld.com> Message-ID: <38C4F654-C13D-41BC-B879-C94C3892E199@keller.net> On Sep 10, 2009, at 10:44 AM, Rahul Nabar wrote: > On Wed, Sep 9, 2009 at 3:38 PM, Greg Keller wrote: >> For example, Lot's of nodes reading and writing >> different files in a generically staggered fashion, > > How do you enforce the staggering? Do people write staggered I/O codes > themselves? Or can on alliviate this problem by scheduler settings? Although there's probably a way to enforce it at the app level, or scheduler, all of that would require specific knowledge of what jobs (and nodes) are accessing what files how at what time. I was thinking that if it's largely embarassingly parallel jobs that start/stop independently and have somewhat randomized IO, then there is some natural staggering. If the app starts on all nodes simultanously and then they all start reading/writing the same files nearly simultaneously, then staggering is probably impossible and a parallel FS is worth investigating. > >> Luster or eventually pNFS if things get ugly. But not all NFS >> servers are >> created equal, and a solid purpose built appliance may handle loads a >> general purpose linux NFS server won't. > > Disk array connected to generic Linux server? Or standalone > Fileserver? Reccomendations? > > What exactly does a "solid purpose built appliance" offer that a > Generic Linux server (well configured) connected to an array of disks > does not offer? Joe's post is spot on here. Don't let legend and lore scare you off, NFS can do great things on current generic and special purpose servers with the right config and software. There's nothing in your configuration and usage summary that screams NFS killer to me. If you use generic or special purpose servers, you can repurpose them as part of a parallel FS if you need to. Purpose built *appliances* generally give you: Simple setup and admin GUI Replication and other fancy features HPCC doesn't normally care about Zero flexibility if you change course and head towards a parallel FS. A singular support channel to complain to if things go badly (YMMV) None of those matter to me more than the money they cost, so I buy standard servers and run standard linux NFS on internal raid controllers with no HA, and have occasional crashes and issues I can't resolve cleanly. We are perpetually looking for a "next step" to get better support/stability, but it's good enough for our 300 and 600 node systems at the moment. > > -- > Rahul Cheers! Greg -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jaime at servepath.com Wed Sep 9 19:01:00 2009 From: Jaime at servepath.com (Jaime Requinton) Date: Wed, 9 Sep 2009 19:01:00 -0700 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <4AA819B3.3050503@vcu.edu> Message-ID: Can you use this switch? You won't lose a port for uplink since it has fiber and/or copper uplink ports. Just my 10 cents... Forgot to paste the link: http://www.bestbuy.com/site/olspage.jsp?skuId=8891915&type=product&id=1212192931527&ref=06&loc=01&ci_src=14110944&ci_sku=8891915 -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Jaime Requinton Sent: Wednesday, September 09, 2009 3:12 PM To: Mike Davis; psc Cc: beowulf at beowulf.org Subject: RE: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? Can you use this switch? You won't lose a port for uplink since it has fiber and/or copper uplink ports. Just my 10 cents... -----Original Message----- From: beowulf-bounces at beowulf.org [mailto:beowulf-bounces at beowulf.org] On Behalf Of Mike Davis Sent: Wednesday, September 09, 2009 2:10 PM To: psc Cc: beowulf at beowulf.org Subject: Re: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? psc wrote: > I wonder what would be the sensible biggest cluster possible based on > 1GB Ethernet network . And especially how would you connect those 1GB > switches together -- now we have (on one of our four clusters) Two 48 > ports gigabit switches connected together with 6 patch cables and I just > ran out of ports for expansion and wonder where to go from here as we > already have four clusters and it would be great to stop adding cluster > and start expending them beyond number of outlets on the switch/s .... > NFS and 1GB Ethernet works great for us and we want to stick with it , > but we would love to find a way how to overcome the current "switch > limitation". ... I heard that there are some "stackable switches" .. > in any case -- any idea , suggestion will be appreciated. > > thanks!! > psc > > When we started running clusters in 2000 we made the decision to use a flat networking model and a single switch if at all possible, We use 144 and 160 port Gig e switches for two of our clusters. The overall performance is better and the routing less complex. Larger switches are available as well. We try to go with a flat model as well for Infiniband. Right now we are using a 96 port Infiniband switch. When we additional nodes to that cluster we will either move up to a 144 or 288 port chassis. Running the numbers I found the cost of the large chassis to be on par with the extra switches required to network using 24 or 36 port switches. -- Mike Davis Technical Director (804) 828-3885 Center for High Performance Computing jmdavis1 at vcu.edu Virginia Commonwealth University "Never tell people how to do things. Tell them what to do and they will surprise you with their ingenuity." George S. Patton _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From john.hearns at mclaren.com Thu Sep 10 01:57:35 2009 From: john.hearns at mclaren.com (Hearns, John) Date: Thu, 10 Sep 2009 09:57:35 +0100 Subject: [Beowulf] Forget station wagons loaded with tapes Message-ID: <68A57CCFD4005646957BD2D18E60667B0D17AA6E@milexchmb1.mil.tagmclarengroup.com> Forget the odl chestnut of a station wagon loaded with tapes. In the 21st Century, solid state drives are king. Plus carrier pigeons. http://www.henriska.com/blog/?p=615 The contents of this email are confidential and for the exclusive use of the intended recipient. If you receive this email in error you should not copy it, retransmit it, use it or disclose its contents but should return it to the sender immediately and delete your copy. From pscadmin at avalon.umaryland.edu Thu Sep 10 05:28:55 2009 From: pscadmin at avalon.umaryland.edu (psc) Date: Thu, 10 Sep 2009 08:28:55 -0400 Subject: [Beowulf] how large can we go with 1GB Ethernet? / Re: how large of an installation have people used NFS, with? In-Reply-To: <20090910073957.GA8487@gretchen.aei.mpg.de> References: <200909091900.n89J07U5031683@bluewest.scyld.com> <4AA800B2.7060607@avalon.umaryland.edu> <20090910073957.GA8487@gretchen.aei.mpg.de> Message-ID: <4AA8F107.7010805@avalon.umaryland.edu> Thank you all for the answers. Would you guys please share with me some good brands of those 200+ 1GB Ethernet switches? I think I'll leave our current clusters alone , but the new cluster I will design for about 500 to 1000 nodes --- I don't think that we will go much above since for big jobs our scientists using outside resources. We do all our calculations and analysis on the nodes and only the final produce we sent to the frontend , also we don't run jobs across the nodes , so I don't need to get too much creative with the network beside being sure that I can expand the cluster without having the switches as a limitation (our current situation) thank you again! Henning Fehrmann wrote: > Hi > > On Wed, Sep 09, 2009 at 03:23:30PM -0400, psc wrote: > >> I wonder what would be the sensible biggest cluster possible based on >> 1GB Ethernet network . >> > > Hmmm, may I cheat and use a 10Gb core switch? > > If you setup a cluster with few thousand nodes you have to ask yourself > whether this network should be non-blocking or not. > > For a non blocking network you need the right core-switch technology. > Unfortunately, there are not many vendors out there which provide > non-blocking Ethernet based core switches but I am aware of at least > two. One provides or will provide 144 10Gb Ethernet ports. Another one > sells switches with more than 1000 1 GB ports. > You could buy edge-switches with 4 10Gb uplinks and 48 1GB ports. If > you just use 40 of them you end up with a 1440 non-blocking 1Gb ports. > > It might be also possible to cross connect two of these core-switches > with the help of some smaller switches so that one ends up with 288 > 10Gb ports and, in principle, one might connect 2880 nodes in a > non-blocking way, but we did not have the possibility to test it > successfully yet. One of problems is that the internal hash table can > not store that many mac addresses. Anyway, one probably needs to change > the mac addresses of the nodes to avoid an overflow of the hash tables. > An overflow might cause arp storms. > > Once this works one runs into some smaller problems. One of them is the arp > cache of the nodes. It should be adjusted to hold as many mac addresses > as you have nodes in the cluster. > > > >> And especially how would you connect those 1GB >> switches together -- now we have (on one of our four clusters) Two 48 >> ports gigabit switches connected together with 6 patch cables and I just >> ran out of ports for expansion and wonder where to go from here as we >> already have four clusters and it would be great to stop adding cluster >> and start expending them beyond number of outlets on the switch/s .... >> NFS and 1GB Ethernet works great for us and we want to stick with it , >> but we would love to find a way how to overcome the current "switch >> limitation". >> > > With NFS you can nicely test the setup. Use one NFS server and let all > nodes write different files into it and look what happens. > > Cheers, > Henning > From orion at cora.nwra.com Thu Sep 10 15:44:32 2009 From: orion at cora.nwra.com (Orion Poplawski) Date: Thu, 10 Sep 2009 16:44:32 -0600 Subject: [Beowulf] Some perspective to this DIY storage server mentioned at Storagemojo In-Reply-To: <20090904081722.GN4508@leitl.org> References: <20090904081722.GN4508@leitl.org> Message-ID: <4AA98150.3050505@cora.nwra.com> On 09/04/2009 02:17 AM, Eugen Leitl wrote: > > http://www.c0t0d0s0.org/archives/5899-Some-perspective-to-this-DIY-storage-server-mentioned-at-Storagemojo.html > > Some perspective to this DIY storage server mentioned at Storagemojo > > Thursday, September 3. 2009 > > I've received yesterday some mails/tweets with hints to a "Thumper for poor" > DIY chassis. Those mails asked me for an opinion towards this piece of > hardware and if it's a competition to our X4500/X4540. I'm waiting for the day I can buy a X4540 without disks. Or maybe it will still be too expensive... -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA/CoRA Division FAX: 303-415-9702 3380 Mitchell Lane orion at cora.nwra.com Boulder, CO 80301 http://www.cora.nwra.com From eugen at leitl.org Fri Sep 11 02:30:08 2009 From: eugen at leitl.org (Eugen Leitl) Date: Fri, 11 Sep 2009 11:30:08 +0200 Subject: [Beowulf] Some perspective to this DIY storage server mentioned at Storagemojo In-Reply-To: <4AA98150.3050505@cora.nwra.com> References: <20090904081722.GN4508@leitl.org> <4AA98150.3050505@cora.nwra.com> Message-ID: <20090911093008.GU9828@leitl.org> On Thu, Sep 10, 2009 at 04:44:32PM -0600, Orion Poplawski wrote: > I'm waiting for the day I can buy a X4540 without disks. Or maybe it > > will still be too expensive... I've also become sufficiently annoyed at Sun for their unwillingness or inability to ship hotplug drive carriers without (premium-priced) drives in them to switch to Supermicro. Below is maybe not Thumper, but you can put 2 TByte drives at 100 EUR/TByte costs into them yourself, giving you 50 TByte raw storage (or 48 TByte raw storage, and 4x 2.5" SSDs for hybrid storage ZIL/L2ARC cache with Opensolaris) in 4U of rack space. http://www.supermicro.com/products/nfo/chassis_storage.cfm [...] SC848A - 24 Hot-swap HDDs in 4U (Quad Motherboard Support) Optimized for enterprise-level heavy-capacity storage applications, Supermicro's SC848 Chassis supports 4-way serverboards that demand high volume I/O or computational usage and features 24 hot-swap 3.5" SAS/SATA hard drive trays and 2 fixed internal hard drive bays in a 4U space. The SC848 design offers maximum HDD per space ratio in a 4U form factor, high power efficiency (up to 88%) with (2+1) redundant 1800W power supply, optimized HDD signal trace routing and improved HDD tray design to dampen HDD vibrations and maximize performance. SC846A/TQ/E1/E2 - 24 Hot-swap HDDs in 4U Optimized for enterprise-level heavy-capacity storage applications, Supermicro's SC846 Chassis features 24 hot-swap 3.5" SAS/SATA hard drive trays and 2 fixed internal hard drive bays in a 4U space. The SC846 design offers maximum HDD per space ratio in a 4U form factor, high power efficiency, optimized HDD signal trace routing and improved HDD tray design to dampen HDD vibrations and maximize performance. Equipped with a 900W or Gold Level 1200W (93%+) high-efficiency redundant power supply and 5 hot-plug redundant cooling fans, the SC846 is a reliable and maintenance-free storage workhorse system. From rpnabar at gmail.com Sat Sep 12 08:10:43 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Sep 2009 10:10:43 -0500 Subject: [Beowulf] filesystem metadata mining tools Message-ID: As the number of total files on our server was exploding (~2.5 million / 1 Terabyte) I wrote a simple shell script that used find to tell me which users have how many. So far so good. But I want to drill down more: *Are there lots of duplicate files? I suspect so. Stuff like job submission scripts which users copy rather than link etc. (fdupes seems puny for a job of this scale) *What is the most common file (or filename) *A distribution of filetypes (executibles; netcdf; movies; text) and prevalence. *A distribution of file age and prevelance (to know how much of this material is archivable). Same for frequency of access; i.e. maybe the last access stamp. * A file size versus number plot. i.e. Is 20% of space occupied by 80% of files? etc. I've used cushion plots in the past (sequiaview; pydirstat) but those seem more desktop oriented than suitable for a job like this. Essentially I want to data mine my file usage to strategize. Are there any tools for this? Writing a new find each time seems laborious. I suspect forensics might also help identify anomalies in usage across users which might be indicative of other maladies. e.g. a user who had a runaway job write a 500GB file etc. Essentially are there any "filesystem metadata mining tools"? -- Rahul From skylar at cs.earlham.edu Sat Sep 12 11:34:25 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sat, 12 Sep 2009 11:34:25 -0700 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: Message-ID: <4AABE9B1.9020502@cs.earlham.edu> Rahul Nabar wrote: > As the number of total files on our server was exploding (~2.5 million > / 1 Terabyte) I > wrote a simple shell script that used find to tell me which users have how > many. So far so good. > > But I want to drill down more: > > *Are there lots of duplicate files? I suspect so. Stuff like job submission > scripts which users copy rather than link etc. (fdupes seems puny for > a job of this scale) > > *What is the most common file (or filename) > > *A distribution of filetypes (executibles; netcdf; movies; text) and > prevalence. > > *A distribution of file age and prevelance (to know how much of this > material is archivable). Same for frequency of access; i.e. maybe the last > access stamp. > > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of > files? etc. > > I've used cushion plots in the past (sequiaview; pydirstat) but those > seem more desktop oriented than suitable for a job like this. > > Essentially I want to data mine my file usage to strategize. Are there any > tools for this? Writing a new find each time seems laborious. > > I suspect forensics might also help identify anomalies in usage across > users which might be indicative of other maladies. e.g. a user who had a > runaway job write a 500GB file etc. > > Essentially are there any "filesystem metadata mining tools"? > > What OS is this on? If you have dtrace available you can use that to at least gather data on new files coming in, which could reduce your search scope considerably. It obviously doesn't directly answer your question, but it might make it easier to use the existing tools. Depending on what filesystem you have you might be able to query the filesystem itself for this data. On GPFS, for instance, you can write a policy that would move all files older than, say, three months to a different storage pool. You can then run that policy in a preview mode to see what files would have been moved. The policy scan on GPFS is quite a bit faster than running a find against the entire filesystem, so it's a definite win. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 251 bytes Desc: OpenPGP digital signature URL: From coutinho at dcc.ufmg.br Sat Sep 12 12:59:57 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Sat, 12 Sep 2009 16:59:57 -0300 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: <4AABE9B1.9020502@cs.earlham.edu> References: <4AABE9B1.9020502@cs.earlham.edu> Message-ID: This tool do can do part of what you want: http://www.chiark.greenend.org.uk/~sgtatham/agedu/ This display files by size and color file by type. http://gdmap.sourceforge.net/ Perhaps agedu can handle large subsets of your files, but gdmap is desktop oriented. 2009/9/12 Skylar Thompson > Rahul Nabar wrote: > > As the number of total files on our server was exploding (~2.5 million > > / 1 Terabyte) I > > wrote a simple shell script that used find to tell me which users have > how > > many. So far so good. > > > > But I want to drill down more: > > > > *Are there lots of duplicate files? I suspect so. Stuff like job > submission > > scripts which users copy rather than link etc. (fdupes seems puny for > > a job of this scale) > > > > *What is the most common file (or filename) > > > > *A distribution of filetypes (executibles; netcdf; movies; text) and > > prevalence. > > > > *A distribution of file age and prevelance (to know how much of this > > material is archivable). Same for frequency of access; i.e. maybe the > last > > access stamp. > > > > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of > > files? etc. > > > > I've used cushion plots in the past (sequiaview; pydirstat) but those > > seem more desktop oriented than suitable for a job like this. > > > > Essentially I want to data mine my file usage to strategize. Are there > any > > tools for this? Writing a new find each time seems laborious. > > > > I suspect forensics might also help identify anomalies in usage across > > users which might be indicative of other maladies. e.g. a user who had a > > runaway job write a 500GB file etc. > > > > Essentially are there any "filesystem metadata mining tools"? > > > > > What OS is this on? If you have dtrace available you can use that to at > least gather data on new files coming in, which could reduce your search > scope considerably. It obviously doesn't directly answer your question, > but it might make it easier to use the existing tools. > > Depending on what filesystem you have you might be able to query the > filesystem itself for this data. On GPFS, for instance, you can write a > policy that would move all files older than, say, three months to a > different storage pool. You can then run that policy in a preview mode > to see what files would have been moved. The policy scan on GPFS is > quite a bit faster than running a find against the entire filesystem, so > it's a definite win. > > -- > -- Skylar Thompson (skylar at cs.earlham.edu) > -- http://www.cs.earlham.edu/~skylar/ > > > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.p.lux at jpl.nasa.gov Sat Sep 12 16:02:10 2009 From: james.p.lux at jpl.nasa.gov (Lux, Jim (337C)) Date: Sat, 12 Sep 2009 16:02:10 -0700 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: Message-ID: On 9/12/09 8:10 AM, "Rahul Nabar" wrote: > As the number of total files on our server was exploding (~2.5 million > / 1 Terabyte) I > wrote a simple shell script that used find to tell me which users have how > many. So far so good. > > But I want to drill down more: > > *Are there lots of duplicate files? I suspect so. Stuff like job submission > scripts which users copy rather than link etc. (fdupes seems puny for > a job of this scale) > > *What is the most common file (or filename) > > *A distribution of filetypes (executibles; netcdf; movies; text) and > prevalence. > > *A distribution of file age and prevelance (to know how much of this > material is archivable). Same for frequency of access; i.e. maybe the last > access stamp. > > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of > files? etc. > Another useful application for such a tool would be to get better KLOC counts of source code trees. I find that our trees have lots of duplication among branches (e.g. Everyone has a "test.c" for unit test in with their modules, and all of them are pretty similar) From rpnabar at gmail.com Sat Sep 12 18:19:14 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Sep 2009 20:19:14 -0500 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: <4AABE9B1.9020502@cs.earlham.edu> Message-ID: On Sat, Sep 12, 2009 at 2:59 PM, Bruno Coutinho wrote: > This tool do can do part of what you want: > http://www.chiark.greenend.org.uk/~sgtatham/agedu/ > > This display files by size and color file by type. > http://gdmap.sourceforge.net/ Thanks Bruno! agedu sounds very promising. I just installed it. Just remains to be seen how well it scales for large filesystems. I was using gdmap before but it is very Desktop oriented, you are right. -- Rahul From rpnabar at gmail.com Sat Sep 12 18:22:10 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Sat, 12 Sep 2009 20:22:10 -0500 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: <4AABE9B1.9020502@cs.earlham.edu> References: <4AABE9B1.9020502@cs.earlham.edu> Message-ID: On Sat, Sep 12, 2009 at 1:34 PM, Skylar Thompson wrote: > What OS is this on? Thanks Skylar! Linux. RedHat. >If you have dtrace available you can use that to at I don't but let me try to install it. > Depending on what filesystem you have ext3 I need to see if this can be queried. -- Rahul From skylar at cs.earlham.edu Sat Sep 12 19:22:11 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sat, 12 Sep 2009 19:22:11 -0700 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: <4AABE9B1.9020502@cs.earlham.edu> Message-ID: <4AAC5753.9040301@cs.earlham.edu> Rahul Nabar wrote: > On Sat, Sep 12, 2009 at 1:34 PM, Skylar Thompson wrote: > >> What OS is this on? >> > > Thanks Skylar! Linux. RedHat. > >> If you have dtrace available you can use that to at >> > > I don't but let me try to install it. > I believe there's an alpha/beta-level release of both a dtrace kernel module and the user-space tools for Linux. There's also SystemTap which should give you similar data. I've only used dtrace on Solaris and haven't used SystemTap at all, so YMMV. >> Depending on what filesystem you have >> > > ext3 > > I need to see if this can be queried. > I don't think so, but you might be able to accomplish the same thing at the application level. If you have a limited set of applications, you could have them write the metadata you need into a database as they create and update files. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From stuartb at 4gh.net Fri Sep 11 12:39:48 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Fri, 11 Sep 2009 15:39:48 -0400 (EDT) Subject: [Beowulf] Intra-cluster security Message-ID: We are working with a couple small clusters (6-8 nodes) and will soon be working with some much larger cluster/supercomputer systems. We are currently using SGE 6.2 for job queuing. We use kerberos for authentication and ssh for system access. What are peoples thoughts about secure communications between the nodes of a cluster? I see a cluster as a single computational resource and would like to see flexibility of communications between the nodes of the cluster. There seem to be a couple of approaches: - Old style rsh/rlogin. Not acceptable for me. - Kerberos with ssh works fine for interactive users, but doesn't seem to translate well to a queuing environment. Or am I missing something? - Each user creates a password-less ssh private key, puts the public key in the authorized_hosts file and has relatively unfettered ssh access between nodes (nfs shared home directory helps a lot). This seems to be the most common approach. It is end-user setup/training intensive (I suppose it could be automated/audited). I consider it dangerous to encourage use of password-less ssh keys. - It looks like SGE has some new functionality for using certificates and its own certificate authority. I haven't looked closely at this yet. It looks like each user has a password-less private certificate and the authorization comes from not having the certificate revoked. This seems almost equivalent to the password-less ssh key solution. - It looks like I can configure the cluster systems to handle local ssh transparently. This would involve setting setuid/setgid on ssh, building cluster wide authorized_keys files and other things. I haven't studied this closely but there are a few references available (http://www.snailbook.com/faq/trusted-host-howto.auto.html among others). I favor this last solution as being the most user transparent. I find is surprising that none of the cluster distributions seem to use this method. I would like some feedback as to how well this works in practice and whether there are any obvious or non-obvious gotchas people might have already encountered. Thanks, Stuart Barkley -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From hearnsj at googlemail.com Sun Sep 13 01:07:56 2009 From: hearnsj at googlemail.com (John Hearns) Date: Sun, 13 Sep 2009 09:07:56 +0100 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: <9f8092cc0909130107sf75f2c5n90cc6454d0e47078@mail.gmail.com> 2009/9/11 Stuart Barkley : > > - Each user creates a password-less ssh private key, puts the public > key in the authorized_hosts file and has relatively unfettered ssh > access between nodes (nfs shared home directory helps a lot). ?This > seems to be the most common approach. ?It is end-user setup/training > intensive (I suppose it could be automated/audited). I consider it > dangerous to encourage use of password-less ssh keys. Yes, I would agree this is the most common approach. You can automate it by having a script which runs when you first login to the cluster (Oscar does this). You can also use shosts trusts. A script which loops through cluster nodes and runs an ssh-keyscan is useful. Re. security its the armadillo principle - hard on the outside, soft on the inside. From glykos at mbg.duth.gr Sun Sep 13 02:46:47 2009 From: glykos at mbg.duth.gr (Nicholas M Glykos) Date: Sun, 13 Sep 2009 12:46:47 +0300 (EEST) Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: Hi Stuart, > - Each user creates a password-less ssh private key, puts the public > key in the authorized_hosts file and has relatively unfettered ssh > access between nodes (nfs shared home directory helps a lot). This > seems to be the most common approach. It is end-user setup/training > intensive (I suppose it could be automated/audited). A quick note to say that in the case of the perceus/warewulf/slurm combination as distributed with CaosNSA, you not only get the automation you've mentioned, but you can also restrict user access to individual nodes (this is through a pam module for slurm that only allows ssh access to those nodes that a user has active jobs on). Nicholas -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/ From nixon at nsc.liu.se Sun Sep 13 03:31:41 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Sun, 13 Sep 2009 12:31:41 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: (Stuart Barkley's message of "Fri\, 11 Sep 2009 15\:39\:48 -0400 $EDT$") References: Message-ID: Stuart Barkley writes: > - Kerberos with ssh works fine for interactive users, but doesn't seem > to translate well to a queuing environment. Or am I missing > something? It's quite possible to use, but you do get a ticket expiry problem. > - Each user creates a password-less ssh private key, puts the public > key in the authorized_hosts file and has relatively unfettered ssh > access between nodes (nfs shared home directory helps a lot). This > seems to be the most common approach. Yes, this is common. And a really, really BAD IDEA. Do not do this. Bad, bad, BAD. > I consider it dangerous to encourage use of password-less ssh keys. Yes, very much so. And your users will discover that they can copy that passphrase-less private key to their personal workstation and get password-less access to the cluster. (Yes, they will.) And then the key will get stolen. (Yes, it will.) And then you get http://www.us-cert.gov/current/archive/2008/09/08/archive.html#ssh_key_based_attacks Of course, you can disallow ssh key authentication from external machines to mitigate the problem, but that's just a band-aid for a mis-engineered system. In case I didn't come across clearly, let me repeat that: DO NOT USE PASSPHRASE-LESS PRIVATE KEYS! (There are some exceptions, of course, like when you want to run things in batch from cron, and similar. But then you must, must, must use proper limitations for that key in authorized_keys.) > - It looks like I can configure the cluster systems to handle local > ssh transparently. This would involve setting setuid/setgid on ssh, > building cluster wide authorized_keys files and other things. I > haven't studied this closely but there are a few references available > (http://www.snailbook.com/faq/trusted-host-howto.auto.html among > others). This is the way to go. All our systems are set up this way. Works just fine. You just need a mechanism for maintaining host keys and ssh_known_hosts. (And remember that this doesn't work for root - you need separately set up ~root/.shosts and ~root/.ssh/known_hosts if you want it.) Oh, and DO NOT USE PASSPHRASE-LESS PRIVATE KEYS! Do the Internet a service and scan your users' home directories for passphrase-less private ssh keys. This is as easy as running # grep -L ENCRYPTED /home/*/.ssh/id_?sa Delete all such keys that don't have a good reason for existence. (Yes, we do so on all our systems.) -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From ashley at pittman.co.uk Sun Sep 13 04:00:33 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Sun, 13 Sep 2009 12:00:33 +0100 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: Message-ID: <1252839633.3887.0.camel@alpha> On Sat, 2009-09-12 at 10:10 -0500, Rahul Nabar wrote: > *A distribution of file age and prevelance (to know how much of this > material is archivable). Same for frequency of access; i.e. maybe the last > access stamp. I thought access stamps were a thing of the past and everyone ran with "noatime" these days? Ashley. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From reuti at staff.uni-marburg.de Sun Sep 13 04:18:02 2009 From: reuti at staff.uni-marburg.de (Reuti) Date: Sun, 13 Sep 2009 13:18:02 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: Hi, Am 11.09.2009 um 21:39 schrieb Stuart Barkley: > We are working with a couple small clusters (6-8 nodes) and will soon > be working with some much larger cluster/supercomputer systems. We > are currently using SGE 6.2 for job queuing. We use kerberos for > authentication and ssh for system access. > > What are peoples thoughts about secure communications between the > nodes of a cluster? I see a cluster as a single computational > resource and would like to see flexibility of communications between > the nodes of the cluster. > > There seem to be a couple of approaches: > > - Old style rsh/rlogin. Not acceptable for me. I wouldn't be concerned by this per se, but I also disabled it (in / etc/xinetd.d/rsh) because of another reason: - I setup ssh hostbased authentication, but limit it to admin staff with "AllowGroups admin". The ssh-keysign has to be set suid as you mention below, I can also send you a rough outline of the necessary steps. - Users can login to any node by using an interactive job in SGE. For this special interactive queue I set h_cpu=60, hence they can't abuse it. SGE in turn will either use: a) traditional rsh, but on a random port selected by SGE (this I have right now) b) a "builtin" method in the newer versions of SGE c) plain ssh (but this will lack correct accounting), directed to a special sshd_config (because of the AllowGroups rule in the default one) d) ssh with a recompiled SGE with -tight-ssh flag for correct accounting If you have more than one queue in addition to this interactive queue on each system, you can't limit the maximum number of slots any longer in the exechost definition (as than the interactive queue to peek around would also be taken into account), but set it up in an RQS (resource quota set) like: limit queues !login.q hosts {@dualquad} to slots=8 -- Reuti PS: I saw on Debian, that their "su" will not only set the user, but also remove any imposed cpu time limit by making an su to oneself. For the SGE queue to impose the h_cpu limit this must be disabled then. I don't know, whether this is still the Debian default. From landman at scalableinformatics.com Sun Sep 13 07:06:40 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 13 Sep 2009 10:06:40 -0400 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: <4AACFC70.6040205@scalableinformatics.com> I started writing a long response to this, decrying security theatre in the face of real issues, but thought better of it. Much shorter version with free advice. Leif Nixon wrote: > Stuart Barkley writes: > >> - Kerberos with ssh works fine for interactive users, but doesn't seem >> to translate well to a queuing environment. Or am I missing >> something? > > It's quite possible to use, but you do get a ticket expiry problem. > >> - Each user creates a password-less ssh private key, puts the public >> key in the authorized_hosts file and has relatively unfettered ssh >> access between nodes (nfs shared home directory helps a lot). This >> seems to be the most common approach. > > Yes, this is common. And a really, really BAD IDEA. Do not do this. Bad, > bad, BAD. > >> I consider it dangerous to encourage use of password-less ssh keys. > > Yes, very much so. And your users will discover that they can copy that > passphrase-less private key to their personal workstation and get > password-less access to the cluster. (Yes, they will.) And then the key > will get stolen. (Yes, it will.) And then you get > > http://www.us-cert.gov/current/archive/2008/09/08/archive.html#ssh_key_based_attacks I won't fisk this, other than to note most of the exploits we have cleaned up for our customers, have been windows based attack vectors. Contrary to the implication here, the ssh-key attack vector, while a risk, isn't nearly as dangerous as others, in active use, out there. http://www.darknet.org.uk/2008/08/puttyhijack-v10-hijack-sshputty-connections-on-windows/ Real security is security in depth. Its understanding real risks, and mitigating the same, or making the downside of the compromise as small as possible. Leif had a suggestion further down about careful management of keys, that is eminently reasonable. You don't leave your house keys under the door mat, if you care about security that is. Same principle applies here. Fake security, aka security theatre (c.f. http://en.wikipedia.org/wiki/Security_theater ) are things you get when people want to seem like they are doing something, even if the thing doesn't help, or worse, gives you a false sense of security. See every anti-virus/anti-phishing package out there for windows. If you think you are safe because you are running them, you are sadly mistaken. I'd argue that security theatre is more dangerous than the real threats. Threats can be mitigated. The danger is in using theatrics and pronouncements rather than practical measures. As John Hearns pointed out, hard on the outside soft on the inside. Doesn't help with clouds, though you can do IPsec to IPsec bridging of virtual private clusters (we do this for our customers). Assume multiple attack vectors, and that the bad guys and gals are going for your weak links. You need a realistic assessment of what your weak links are, they will be exploited. Most IT managers are fearful of this conversation, many are patently in denial about it. Regardless, the successful attacks we have seen and cleaned up after all came from *inside* organizations. Where they have been thwarted, has been due to other good practices. Where they have been successful, they have had success due to very very bad practices. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From reuti at Staff.Uni-Marburg.DE Sun Sep 13 08:03:37 2009 From: reuti at Staff.Uni-Marburg.DE (Reuti) Date: Sun, 13 Sep 2009 17:03:37 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: <2194C7C3-B708-49B9-951D-644FE86839FF@staff.uni-marburg.de> Am 13.09.2009 um 12:31 schrieb Leif Nixon: > > This is the way to go. All our systems are set up this way. Works just > fine. You just need a mechanism for maintaining host keys and > ssh_known_hosts. (And remember that this doesn't work for root - you > need separately set up ~root/.shosts and ~root/.ssh/known_hosts if you > want it.) > > Oh, and DO NOT USE PASSPHRASE-LESS PRIVATE KEYS! > > Do the Internet a service and scan your users' home directories for > passphrase-less private ssh keys. This is as easy as running > > # grep -L ENCRYPTED /home/*/.ssh/id_?sa > > Delete all such keys that don't have a good reason for existence. > (Yes, > we do so on all our systems.) I agree. And to have it still convenient between multiple clusters I guide my students to use just one passphrase protected key and an ssh- agent in additions. There is nice Howto about it: http://unixwiz.net/techtips/ssh-agent-forwarding.html But: even with a passphrase the ssh-key should be protected as much as possible. Once someone has the private key, any offline brute- force to get the passphrase won't take long I fear. They could just try to recreate the public part of the key with: ssh-keygen -y which is completely offline, as this will also need the passphrase to be entered. -- Reuti From skylar at cs.earlham.edu Sun Sep 13 09:48:23 2009 From: skylar at cs.earlham.edu (Skylar Thompson) Date: Sun, 13 Sep 2009 09:48:23 -0700 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: <1252839633.3887.0.camel@alpha> References: <1252839633.3887.0.camel@alpha> Message-ID: <4AAD2257.407@cs.earlham.edu> Ashley Pittman wrote: > On Sat, 2009-09-12 at 10:10 -0500, Rahul Nabar wrote: > >> *A distribution of file age and prevelance (to know how much of this >> material is archivable). Same for frequency of access; i.e. maybe the last >> access stamp. >> > > I thought access stamps were a thing of the past and everyone ran with > "noatime" these days? > > Ashley. > > Are there any studies showing the overhead of atime updates? I've heard anecdotal evidence saying that it makes a big difference, and the same going the other way. FWIW, you can disable atime updates on a per-file basis in Linux and FreeBSD using chattr. -- -- Skylar Thompson (skylar at cs.earlham.edu) -- http://www.cs.earlham.edu/~skylar/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From nixon at nsc.liu.se Sun Sep 13 10:58:23 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Sun, 13 Sep 2009 19:58:23 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: <4AACFC70.6040205@scalableinformatics.com> (Joe Landman's message of "Sun\, 13 Sep 2009 10\:06\:40 -0400") References: <4AACFC70.6040205@scalableinformatics.com> Message-ID: Joe Landman writes: > I won't fisk this, other than to note most of the exploits we have > cleaned up for our customers, have been windows based attack vectors. > Contrary to the implication here, the ssh-key attack vector, while a > risk, isn't nearly as dangerous as others, in active use, out there. I'm really hoping you aren't accusing me of security theatre. This may be a case of differences between user communitites - while I have seen one or maybe two cases where windows-related attacks were involved, I have seen dozens and dozens of cases where ssh key theft was involved. I have a blacklist of literally hundreds of stolen ssh keys from a very large number of sites, and I dearly miss a key revocation mechanism in ssh. We try to educate our users to use either a good strong password or to use ssh keys together with the ssh agent and agent forwarding, so that the private key never needs to leave the user's personal workstation. > Fake security, aka security theatre (c.f. > http://en.wikipedia.org/wiki/Security_theater ) are things you get > when people want to seem like they are doing something, even if the > thing doesn't help, or worse, gives you a false sense of security. See > every anti-virus/anti-phishing package out there for windows. If you > think you are safe because you are running them, you are sadly > mistaken. And on our side of the fence, we get things like Trusted IRIX, with a really elaborate, checkbox-compliant permissions system. Of course, since it was built on IRIX, any serious attacker would cut through it like a hot knife through molten butter, but there obviously wasn't a checkbox for that. -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From bill at cse.ucdavis.edu Sun Sep 13 11:34:13 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Sun, 13 Sep 2009 11:34:13 -0700 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: Message-ID: <4AAD3B25.8090900@cse.ucdavis.edu> Stuart Barkley wrote: > - Each user Very dangerous way to say it. Ideally you do everything possible to minimize the work of the user, that way they can't get it wrong. > creates a password-less ssh private key, puts the public I'm a fan of password-less private keys. Before the screaming begins, let me explain the wrong way to do it. Rocks creates a password-less key for each user, plops it in ~/.ssh. Unfortunately they seem very resistant to suggestions on fixing this. The main problem is that if someone leaves themselves logged in someone could slurp the private key and have access forever, even if the user tries to be security conscious and uses a large passphrase that they keep secure. You could however point the compute nodes to a different keystore which the head node does not look at. That way even if stolen it doesn't get you cluster acccess. > key in the authorized_hosts file and has relatively unfettered ssh > access between nodes (nfs shared home directory helps a lot). This I don't recommend allowing users to populate .ssh, instead I suggest managing it yourself (the admin). Users tend to only add keys when they upgrade a laptop, buy a new one, lose a laptop, get compromised, etc. So there could be keys that end up lost, shared, or compromised. By forcing users to have one you reduce your exposure. Last thing you want to hear is oh, that's access is not from my current key... > seems to be the most common approach. It is end-user setup/training > intensive (I suppose it could be automated/audited). I consider it > dangerous to encourage use of password-less ssh keys. An alternative is to use host-based ssh auth for access inside the cluster, this depends on either more labor intensive management of keys, or automating the install/reinstall node process. > - It looks like I can configure the cluster systems to handle local > ssh transparently. This would involve setting setuid/setgid on ssh, > building cluster wide authorized_keys files and other things. I > haven't studied this closely but there are a few references available > (http://www.snailbook.com/faq/trusted-host-howto.auto.html among > others). Looks pretty straight forward to me. > I favor this last solution as being the most user transparent. I find > is surprising that none of the cluster distributions seem to use this > method. I would like some feedback as to how well this works in > practice and whether there are any obvious or non-obvious gotchas > people might have already encountered. Works for us, I share your surprise that it's not more popular. From landman at scalableinformatics.com Sun Sep 13 12:13:19 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Sun, 13 Sep 2009 15:13:19 -0400 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: <4AACFC70.6040205@scalableinformatics.com> Message-ID: <4AAD444F.3030605@scalableinformatics.com> Leif Nixon wrote: > Joe Landman writes: > >> I won't fisk this, other than to note most of the exploits we have >> cleaned up for our customers, have been windows based attack vectors. >> Contrary to the implication here, the ssh-key attack vector, while a >> risk, isn't nearly as dangerous as others, in active use, out there. > > I'm really hoping you aren't accusing me of security theatre. Nope. I thought I made it clear that I wasn't (and if not, then let me re-iterate that I am not accusing you of this). I am noting that the there may be something of an overhyping of this vulnerability from where we sit. YMMV. > This may be a case of differences between user communitites - while I > have seen one or maybe two cases where windows-related attacks were Likely it is a difference. Most attacks we see are windows related, exploiting the inherent weakness of that platform, and is relative ease of compromise in order to compromise harder to take down systems. Why break through the heavily fortified door when the window (pun un-intended) is so easy to crack? This is the nature (outside of incessant ssh probes) of all of the exploits we have seen be successful at our customers sites. > involved, I have seen dozens and dozens of cases where ssh key theft was > involved. I have a blacklist of literally hundreds of stolen ssh keys > from a very large number of sites, and I dearly miss a key revocation > mechanism in ssh. > > We try to educate our users to use either a good strong password or to > use ssh keys together with the ssh agent and agent forwarding, so that > the private key never needs to leave the user's personal workstation. We have started hearing about malware infected USB dongles. If you have a password equivalent stored on your workstation ... it is at risk. > >> Fake security, aka security theatre (c.f. >> http://en.wikipedia.org/wiki/Security_theater ) are things you get >> when people want to seem like they are doing something, even if the >> thing doesn't help, or worse, gives you a false sense of security. See >> every anti-virus/anti-phishing package out there for windows. If you >> think you are safe because you are running them, you are sadly >> mistaken. > > And on our side of the fence, we get things like Trusted IRIX, with a > really elaborate, checkbox-compliant permissions system. Of course, > since it was built on IRIX, any serious attacker would cut through it > like a hot knife through molten butter, but there obviously wasn't a > checkbox for that. Trusted computing, trusted Irix, etc. are examples of what I am talking about. You have a sense of security. Whether its warranted or not is a completely separate question. Most of our users are companies, research universities, etc. We hear horror stories from admins on compromises. We do get an occasional call from a customer, wondering how a system behind a firewall could be compromised (remember that theatre and false sense of security?). Forensic examination showed us the path in, happily riding along the same connection that the user had, grabbing their keystrokes, and replaying them. Installing bits, and attempting rootkits. I have a nice little collection of rootkit detritus and dejecta, as well as logs of what the cracker attempted, all while getting in via the same compromised machine the legitimate user logged in to. It didn't really get bad ... until the user typed the root password in. No, wasn't bad until then, most of the defenses held. Their cluster, they have root. We tried warning them that there was no conceivable scenario in which they ever needed to be root. We were ignored. Their IT staff was none too pleased. I wrote up a whole series of posts on it, detailing everything (apart from the victims name/id/location/university) so that some others could learn and protect themselves. My descriptions managed to get me ... moderated ... by someone who claimed I was being alarmist ... for posting the gory details and making suggestions to the same community on how to avoid it. I am simply saying that what we see may be different, and that I hear far too much "one-size-fits-all" security prescriptions, that often fail to deter attacks, and provide what I think is a false sense of security if you follow that and ignore the other issues. I see to much of "if we install a firewall, we will be secure" mindset running about. -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From hearnsj at googlemail.com Sun Sep 13 23:19:58 2009 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 14 Sep 2009 07:19:58 +0100 Subject: [Beowulf] Switch recommendations and 10G to the desktop? Message-ID: <9f8092cc0909132319v180f72bdndf95dd3bcc7b4403@mail.gmail.com> I'm looking for recommendations for 1 48 port, or two stacked 24 port, switches for desktop users. The aim is to bond 2xgigabit connections. I would have normally first thought Nortel for this job. Thoughts? Secondly, do folks here have much experience of 10gig to the desktop? Distance is a bit too far for copper CX4, and fibre is pricey. I note Extreme networks have a switch with 10G over cat6, I guess there are others. Thoughts again please! From nixon at nsc.liu.se Mon Sep 14 00:45:42 2009 From: nixon at nsc.liu.se (Leif Nixon) Date: Mon, 14 Sep 2009 09:45:42 +0200 Subject: [Beowulf] Intra-cluster security In-Reply-To: <4AAD444F.3030605@scalableinformatics.com> (Joe Landman's message of "Sun\, 13 Sep 2009 15\:13\:19 -0400") References: <4AACFC70.6040205@scalableinformatics.com> <4AAD444F.3030605@scalableinformatics.com> Message-ID: Joe Landman writes: > Leif Nixon wrote: >> Joe Landman writes: >> >>> I won't fisk this, other than to note most of the exploits we have >>> cleaned up for our customers, have been windows based attack vectors. >>> Contrary to the implication here, the ssh-key attack vector, while a >>> risk, isn't nearly as dangerous as others, in active use, out there. >> >> I'm really hoping you aren't accusing me of security theatre. > > Nope. I thought I made it clear that I wasn't (and if not, then let > me re-iterate that I am not accusing you of this). Good. 8^) > I am noting that the there may be something of an overhyping of this > vulnerability from where we sit. YMMV. Well, it *is* being actively exploited on a big scale. It's not just a theoretical thing. > Likely it is a difference. Most attacks we see are windows related, > exploiting the inherent weakness of that platform, and is relative > ease of compromise in order to compromise harder to take down systems. > Why break through the heavily fortified door when the window (pun > un-intended) is so easy to crack? This is the nature (outside of > incessant ssh probes) of all of the exploits we have seen be > successful at our customers sites. That's interesting. I haven't seen many cross-OS attacks. My theory has always been that the mainstream windows evil-doer has lots and lots of easy targets, and there is no point for him to spend the energy to learn how to attack these weird Linux clusters. I can't say I'd love to be proven wrong. 8^) 8^/ > I wrote up a whole series of posts on it, detailing everything (apart > from the victims name/id/location/university) so that some others > could learn and protect themselves. My descriptions managed to get me > ... moderated ... by someone who claimed I was being alarmist ... for > posting the gory details and making suggestions to the same community > on how to avoid it. Too bad. The community needs more war stories. There is too much covering up. > I am simply saying that what we see may be different, and that I hear > far too much "one-size-fits-all" security prescriptions, that often > fail to deter attacks, and provide what I think is a false sense of > security if you follow that and ignore the other issues. I see to > much of "if we install a firewall, we will be secure" mindset running > about. Exactly. Or, on the other hand, "firewalls are an inherently bad solution; all endpoints should be properly secured and should not have to rely on a firewall.". Rigid dogma is always bad. (Except, of course, when it comes to DELETING ALL THOSE PASSPHRASE-LESS KEYS!) -- / Swedish National Infrastructure for Computing Leif Nixon - Security officer < National Supercomputer Centre \ Nordic Data Grid Facility From smulcahy at atlanticlinux.ie Mon Sep 14 01:31:52 2009 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Mon, 14 Sep 2009 09:31:52 +0100 Subject: [Beowulf] Intra-cluster security In-Reply-To: References: <4AACFC70.6040205@scalableinformatics.com> <4AAD444F.3030605@scalableinformatics.com> Message-ID: <4AADFF78.5000006@atlanticlinux.ie> Leif Nixon wrote: >> I wrote up a whole series of posts on it, detailing everything (apart >> from the victims name/id/location/university) so that some others >> could learn and protect themselves. My descriptions managed to get me >> ... moderated ... by someone who claimed I was being alarmist ... for >> posting the gory details and making suggestions to the same community >> on how to avoid it. > > Too bad. The community needs more war stories. There is too much > covering up. I strongly agree with this. Real war stories would give real admins a better chance of prioritising their efforts to address real problems (rather than whatever is currently being pushed by one vendor or another). In practice, I think a lot of security problems come from layer 8 - if you can deal with that, the rest is easy :) -stephen -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From hearnsj at googlemail.com Mon Sep 14 11:55:35 2009 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 14 Sep 2009 19:55:35 +0100 Subject: [Beowulf] Switch recommendations and 10G to the desktop? In-Reply-To: <4AAE4011.1080408@tamu.edu> References: <9f8092cc0909132319v180f72bdndf95dd3bcc7b4403@mail.gmail.com> <4AAE4011.1080408@tamu.edu> Message-ID: <9f8092cc0909141155q61a81875q1c3d18fe6da044f4@mail.gmail.com> 2009/9/14 Gerry Creager : > Nortel still has switches available, and they'd be my first choice, as well, > but I'm concerned about support... and delivery! > > Force10 S-series bears looking at, as does the Nexus line from (shudder) > Cisco. ?I consider the Cisco offerings to have barely enough backplane to be > non-blocking. Gerry, thanks for that. Talkign with our supplier today looks like I'll go for Force 10 - but keeping the heid screwed I'll stay with 1Gbps. From hearnsj at googlemail.com Mon Sep 14 11:58:38 2009 From: hearnsj at googlemail.com (John Hearns) Date: Mon, 14 Sep 2009 19:58:38 +0100 Subject: [Beowulf] PCOIP graphics Message-ID: <9f8092cc0909141158n4036b960h25b622f429e808ed@mail.gmail.com> HPCwire today had an article on BOXX workstations which use something called PCOIP to transfer the graphics display over a slowish network connection. I'm quite interested in this, and it looks like you can buy a card plus a separate box to do this job from http://www.teradici.com/ Anyone out there used one of these boxes 'in the wild'? I have looked at the software equivalents - ie VirtualGL but not had much success. From coutinho at dcc.ufmg.br Mon Sep 14 15:12:30 2009 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Mon, 14 Sep 2009 19:12:30 -0300 Subject: [Beowulf] PCOIP graphics In-Reply-To: <9f8092cc0909141158n4036b960h25b622f429e808ed@mail.gmail.com> References: <9f8092cc0909141158n4036b960h25b622f429e808ed@mail.gmail.com> Message-ID: 2009/9/14 John Hearns > HPCwire today had an article on BOXX workstations which use something > called PCOIP to transfer > the graphics display over a slowish network connection. > Nomachine NX does this too. You can download a client and test on their servers. http://www.nomachine.com/ > I'm quite interested in this, and it looks like you can buy a card > plus a separate box to do this job from http://www.teradici.com/ > > Anyone out there used one of these boxes 'in the wild'? > > I have looked at the software equivalents - ie VirtualGL but not had > much success. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathog at caltech.edu Mon Sep 14 15:40:25 2009 From: mathog at caltech.edu (David Mathog) Date: Mon, 14 Sep 2009 15:40:25 -0700 Subject: [Beowulf] Re: Intra-cluster security Message-ID: Our NFS file server does double duty, serving both the compute nodes on the inside and some workstations on the outside. That is somewhat analogous to the intra-cluster situation. It turns out that updates some time in the last year introduced an issue where the lock manager stopped respecting the ports which were supposedly assigned to it. It took me a long time to notice this since there wasn't very much file locking going on between the workstations and the file server. However, anybody who used gnome (which I don't) would have seen it, since gnome does some file locking at startup, and this bug was was causing it to start very slowly. Normally this port assignment issue wouldn't be a cluster issue, since one doesn't normally run a firewall between the file server and the compute nodes. However, this would be a problem for intra-cluster file sharing. To see if your file server has been bitten run rpcinfo -p on it, and if nlockmgr isn't in the right place, welcome to the club. For more information, see for instance: https://bugzilla.redhat.com/show_bug.cgi?id=434795 https://bugs.launchpad.net/ubuntu/+source/nfs-utils/+bug/28706 Long story short, the only way around this that I know of at present is to put: options lockd nlm_udpport=4001 nlm_tcpport=4001 (or whatever port you want) in /etc/modprobe.conf or an equivalent location and restart the NFS server. OK, make that: umount all mounts on the clients, restart the server, and remount on all the clients. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rpnabar at gmail.com Mon Sep 14 15:43:50 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 14 Sep 2009 17:43:50 -0500 Subject: [Beowulf] switching capacity terminology confusion Message-ID: I was totally confused on the spec sheet (one example link below) for switches. What is the difference between "Switching Fabric Capacity 288Gbps" and "User traffic capacity 176 Gbps"? The user traffic numbers seem to be a lot lower than the switching fabric numbers. Is this some sort of overhead? Or..... If this weren't enough there is another mysterious number: "Stacking Capacity 96 Gbps" which is even lower than the others. How does one interpret all of these. I started out with my simplistic view of "a gigabit eth switch that supports full line capacity on all ports" http://www.force10networks.com/products/pdf/Force10-S25N-S50N-FTOS.pdf -- Rahul From rpnabar at gmail.com Mon Sep 14 15:52:07 2009 From: rpnabar at gmail.com (Rahul Nabar) Date: Mon, 14 Sep 2009 17:52:07 -0500 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: Message-ID: On Mon, Sep 14, 2009 at 5:43 PM, Rahul Nabar wrote: > I was totally confused on the spec sheet (one example link below) for > switches. What is the difference between "Switching Fabric Capacity > 288Gbps" and "User traffic capacity 176 Gbps"? The user traffic > numbers seem to be a lot lower than the switching fabric numbers. Is > this some sort of overhead? Or..... As if to make life even more difficult a comparable Dell switch here has an additional characteristic too: "Forwarding Rate 131 Mpps" How does that tie in to the big picture? http://www.dell.com/downloads/global/products/pwcnt/en/PC_6200Series_proof1.pdf -- Rahul From hearnsj at googlemail.com Mon Sep 14 23:36:44 2009 From: hearnsj at googlemail.com (John Hearns) Date: Tue, 15 Sep 2009 07:36:44 +0100 Subject: [Beowulf] PCOIP graphics In-Reply-To: References: <9f8092cc0909141158n4036b960h25b622f429e808ed@mail.gmail.com> Message-ID: <9f8092cc0909142336k77710f16n2852f8dab0644a4d@mail.gmail.com> 2009/9/14 Bruno Coutinho : > > 2009/9/14 John Hearns >> >> HPCwire today had an article on BOXX workstations which use something >> called PCOIP to transfer >> the graphics display over a slowish network connection. > > Nomachine NX does this too. > You can download a client and test on their servers. > http://www.nomachine.com/ > Bruno, I am a big fan of Nomachine, and I use it on my personal workstation. However, in my experience Nomachine does not handle OpenGL accelerated graphics - there is a recipe on their site for doing that, but it did not work for me. Also, and I hesitate to do a company down who I think are generally marvellous, they were not interested in doing an IA64 version for our Prism visualization system. If anyone is using Nomachine NX successfully to run a remote display of a decently specced visualization box running Linux and OpenGL then I'd really like to hear about it. From cap at nsc.liu.se Tue Sep 15 02:58:21 2009 From: cap at nsc.liu.se (Peter Kjellstrom) Date: Tue, 15 Sep 2009 11:58:21 +0200 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: Message-ID: <200909151158.21995.cap@nsc.liu.se> On Tuesday 15 September 2009, Rahul Nabar wrote: ... > As if to make life even more difficult a comparable Dell switch here > has an additional characteristic too: > > "Forwarding Rate 131 Mpps" How does that tie in to the big picture? This is packet rate (packets per second), not bandwidth (bytes per second). /Peter -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part. URL: From prentice at ias.edu Tue Sep 15 11:26:56 2009 From: prentice at ias.edu (Prentice Bisbal) Date: Tue, 15 Sep 2009 14:26:56 -0400 Subject: [Beowulf] filesystem metadata mining tools In-Reply-To: References: Message-ID: <4AAFDC70.4010805@ias.edu> I have used perl in the past to gather summaries of file usage like this. The details are fuzzy(it was a couple of years ago), but I think I did a 'find -ls' to a text file and then used perl to parse the file and and add up the various statistics. I wasn't gathering as many statistics as you, but it was pretty easy to write for a novice perl programmer like me. Prentice Rahul Nabar wrote: > As the number of total files on our server was exploding (~2.5 million > / 1 Terabyte) I > wrote a simple shell script that used find to tell me which users have how > many. So far so good. > > But I want to drill down more: > > *Are there lots of duplicate files? I suspect so. Stuff like job submission > scripts which users copy rather than link etc. (fdupes seems puny for > a job of this scale) > > *What is the most common file (or filename) > > *A distribution of filetypes (executibles; netcdf; movies; text) and > prevalence. > > *A distribution of file age and prevelance (to know how much of this > material is archivable). Same for frequency of access; i.e. maybe the last > access stamp. > > * A file size versus number plot. i.e. Is 20% of space occupied by 80% of > files? etc. > > I've used cushion plots in the past (sequiaview; pydirstat) but those > seem more desktop oriented than suitable for a job like this. > > Essentially I want to data mine my file usage to strategize. Are there any > tools for this? Writing a new find each time seems laborious. > > I suspect forensics might also help identify anomalies in usage across > users which might be indicative of other maladies. e.g. a user who had a > runaway job write a 500GB file etc. > > Essentially are there any "filesystem metadata mining tools"? > From davidramirezmolina at gmail.com Mon Sep 14 11:04:51 2009 From: davidramirezmolina at gmail.com (David Ramirez) Date: Mon, 14 Sep 2009 13:04:51 -0500 Subject: [Beowulf] Virtualization in head node ? Message-ID: Still a newbie in HPC, in the first stages of building a Beowulf cluster (8 nodes). I wonder if anybody out there has used Linux virtual machines in the head node, just to be able to experiment with different configurations & deployments and jump back without much effort if things go bad. Considering (out of experience in desktop) VMWare or Sun Virtualbox. Any hints, comments ? David Ramirez Grad Student CIS Prairie View A&M University, Texas -- | David Ramirez Molina | davidramirezmolina at gmail.com | Houston, Texas - USA Ancora Imparo (A?n aprendo) - Michelangelo a los 80 a?os -------------- next part -------------- An HTML attachment was scrubbed... URL: From reuti at Staff.Uni-Marburg.DE Tue Sep 15 16:05:44 2009 From: reuti at Staff.Uni-Marburg.DE (Reuti) Date: Wed, 16 Sep 2009 01:05:44 +0200 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: Message-ID: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> Am 14.09.2009 um 20:04 schrieb David Ramirez: > Still a newbie in HPC, in the first stages of building a Beowulf > cluster (8 nodes). > > I wonder if anybody out there has used Linux virtual machines in > the head node, just to be able to experiment with different > configurations & deployments and jump back without much effort if > things go bad. Considering (out of experience in desktop) VMWare or > Sun Virtualbox. Any hints, comments ? I was also just thinking of it in the last couple of days. In the past often a dedicated login- and dedicated file-server was used to split the load. Putting the login-node in a virtual machine will still keep the users away from the file-server, but allows you to combine them in one piece of metal when it's powerful enough. Or maybe even one virtual login node per user, when you don't allow anyone seeing the actions of others. To operate Sun VirtualBox w/o the graphical interface is possible, and you can also direct the virtual console to any remote machine using "rdesktop" as client on any platform you like. -- Reuti > David Ramirez > Grad Student CIS > Prairie View A&M University, Texas > > -- > | David Ramirez Molina > | davidramirezmolina at gmail.com > | Houston, Texas - USA > > Ancora Imparo (A?n aprendo) - Michelangelo a los 80 a?os > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From lindahl at pbm.com Tue Sep 15 16:21:11 2009 From: lindahl at pbm.com (Greg Lindahl) Date: Tue, 15 Sep 2009 16:21:11 -0700 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: References: Message-ID: <20090915232111.GB16891@bx9.net> On Mon, Sep 14, 2009 at 05:52:07PM -0500, Rahul Nabar wrote: > "Forwarding Rate 131 Mpps" How does that tie in to the big picture? Most layer 3 devices are not capable of forwarding full line-rate traffic of tiny packets. You should go hunt down a lab report on switch testing to see these kinds of details discussed. -- g From hearnsj at googlemail.com Tue Sep 15 22:44:54 2009 From: hearnsj at googlemail.com (John Hearns) Date: Wed, 16 Sep 2009 06:44:54 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> References: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> Message-ID: <9f8092cc0909152244t60bd1f5t883c8638dd6b67d9@mail.gmail.com> 2009/9/16 Reuti : > To operate Sun VirtualBox w/o the graphical interface is possible, and you > can also direct the virtual console to any remote machine using "rdesktop" > as client on any platform you like. I agree re. Virtualbox - I'm evaluating it for desktop use, not for the purpose suggested here, and its quite impressive. The latest version offers direct access to accelerated 3D graphics on the host. Also I put a virtual machine to sleep yesterday when running a CAD style application, and started it up later, the application picked itself up mid-stride and continued on. Well, maybe I'm easily impressed. From award at uda.ad Wed Sep 16 00:23:07 2009 From: award at uda.ad (Alan Ward) Date: Wed, 16 Sep 2009 09:23:07 +0200 Subject: RS: [Beowulf] Virtualization in head node ? References: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> <9f8092cc0909152244t60bd1f5t883c8638dd6b67d9@mail.gmail.com> Message-ID: I have been working quite a lot with VBox, mostly for server stuff. I agree it can be quite impressive, and has some nice features (e.g. do not stop a machine, sleep it - and wake up pretty fast). On the other hand, we found that anything that has to do with disk access is pretty slow, specially when working with a local disk image file. Cheers, -Alan -----Missatge original----- De: beowulf-bounces at beowulf.org en nom de John Hearns Enviat el: dc. 16/09/2009 07:44 Per a: Beowulf Mailing List Tema: Re: [Beowulf] Virtualization in head node ? 2009/9/16 Reuti : > To operate Sun VirtualBox w/o the graphical interface is possible, and you > can also direct the virtual console to any remote machine using "rdesktop" > as client on any platform you like. I agree re. Virtualbox - I'm evaluating it for desktop use, not for the purpose suggested here, and its quite impressive. The latest version offers direct access to accelerated 3D graphics on the host. Also I put a virtual machine to sleep yesterday when running a CAD style application, and started it up later, the application picked itself up mid-stride and continued on. Well, maybe I'm easily impressed. _______________________________________________ Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- An HTML attachment was scrubbed... URL: From tjrc at sanger.ac.uk Wed Sep 16 02:34:48 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed, 16 Sep 2009 10:34:48 +0100 Subject: RS: [Beowulf] Virtualization in head node ? In-Reply-To: References: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> <9f8092cc0909152244t60bd1f5t883c8638dd6b67d9@mail.gmail.com> Message-ID: <8B848D15-066C-436B-BB3E-84A8FC27C9B1@sanger.ac.uk> On 16 Sep 2009, at 8:23 am, Alan Ward wrote: > > I have been working quite a lot with VBox, mostly for server stuff. > I agree it can be quite impressive, and has some nice features (e.g. > do not stop a machine, sleep it - and wake up pretty fast). > > On the other hand, we found that anything that has to do with disk > access is pretty slow, specially when working with a local disk > image file. I think that's pretty standard for most virtualisation, whichever vendor it comes from. The I/O is fairly sub-optimal. I've had a fair bit of experience now of various VMware flavours. The I/O performance of the desktop versions is fairly shocking; this is presumably largely down to the fact that desktops and laptops tend to have fairly slow I/ O to start with, and the virtualisation penalty is very noticeable. Our production virtualisation system uses dual-fabric SAN-attached storage (EVA5000), ESX 4.0 as the hypervisor, and we're running about 20 virtual machines per physical host. Most of these applications are not I/O heavy, but really trivial benchmarking using hdparm indicates I/O bandwidth within the VM of about half that if the machine were physical. Very unscientific test, though. I should do some proper testing with bonnie++... Virtual disk performance in ESX 4.0 definitely feels better than ESX 3.5, but that's largely because they've got rid of some fairly serious brokenness in memory handling in the hypervisor which was leading to unnecessary swapping of the VMs. ESX 4.0 also has a new guest paravirtual SCSI driver which is supposed to improve virtual disk performance by about 20% but I have yet to test that. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From eugen at leitl.org Wed Sep 16 03:27:30 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 16 Sep 2009 12:27:30 +0200 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 Message-ID: <20090916102730.GU9828@leitl.org> Below bonnie++ stats are pretty bad for a RAID 10 of 14 SATA drives (WD RE4, 2 TByte), right? [oracle at localhost data]$ bonnie++ -d /data/blah Writing with putc()...done Writing intelligently...done Rewriting...done Reading with getc()...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03 ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP localhost.lo 47320M 82406 99 466201 46 124266 17 73482 96 541644 33 415.4 0 ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 5488 4 +++++ +++ 4292 0 3883 3 +++++ +++ 2580 1 localhost.localdomain,47320M,82406,99,466201,46,124266,17,73482,96,541644,33,415.4,0,16,5488,4,+++++,+++,4292,0,3883,3,+++++,+++,2580,1 -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From smulcahy at atlanticlinux.ie Wed Sep 16 04:18:27 2009 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Wed, 16 Sep 2009 12:18:27 +0100 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 In-Reply-To: <20090916102730.GU9828@leitl.org> References: <20090916102730.GU9828@leitl.org> Message-ID: <4AB0C983.1070509@atlanticlinux.ie> Hi, I'm not sure what else people read from bonnie++ results but I normally focus on the sequential output block (which I think of as "block write speed" and sequential input block (which I think of as "block read speed"). Smarter folk on this list may be able to provide a more scientific analysis of your results than that though. In your case, I'm reading the results below as a block write speed of 455 MB/sec and a block read speed of 528 MB/sec which seems pretty good to me (unless my math has failed me). What kind of performance are you expecting? -stephen Eugen Leitl wrote: > Below bonnie++ stats are pretty bad for a RAID 10 of 14 SATA > drives (WD RE4, 2 TByte), right? > > [oracle at localhost data]$ bonnie++ -d /data/blah > Writing with putc()...done > Writing intelligently...done > Rewriting...done > Reading with getc()...done > Reading intelligently...done > start 'em...done...done...done... > Create files in sequential order...done. > Stat files in sequential order...done. > Delete files in sequential order...done. > Create files in random order...done. > Stat files in random order...done. > Delete files in random order...done. > Version 1.03 ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > localhost.lo 47320M 82406 99 466201 46 124266 17 73482 96 541644 33 415.4 0 > ------Sequential Create------ --------Random Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 5488 4 +++++ +++ 4292 0 3883 3 +++++ +++ 2580 1 > localhost.localdomain,47320M,82406,99,466201,46,124266,17,73482,96,541644,33,415.4,0,16,5488,4,+++++,+++,4292,0,3883,3,+++++,+++,2580,1 > > -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From eugen at leitl.org Wed Sep 16 04:41:15 2009 From: eugen at leitl.org (Eugen Leitl) Date: Wed, 16 Sep 2009 13:41:15 +0200 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 In-Reply-To: <4AB0C983.1070509@atlanticlinux.ie> References: <20090916102730.GU9828@leitl.org> <4AB0C983.1070509@atlanticlinux.ie> Message-ID: <20090916114115.GK9828@leitl.org> On Wed, Sep 16, 2009 at 12:18:27PM +0100, stephen mulcahy wrote: > I'm not sure what else people read from bonnie++ results but I normally I realize people mostly prefer IOZone, or similar. > focus on the sequential output block (which I think of as "block write > speed" and sequential input block (which I think of as "block read > speed"). Smarter folk on this list may be able to provide a more > scientific analysis of your results than that though. > > In your case, I'm reading the results below as a block write speed of > 455 MB/sec and a block read speed of 528 MB/sec which seems pretty good Thanks, this was the kind of comment I was looking for. The raw aggregate disk speed is roughly 770 MB/sec, so this result is not all that bad. This is Linux md RAID 10, CentOS 5.3 [oracle at localhost data]$ cat /proc/mdstat Personalities : [raid10] md4 : active raid10 sdp[13] sdo[12] sdn[11] sdm[10] sdl[9] sdk[8] sdj[7] sdi[6] sdh[5] sdg[4] sdf[3] sde[2] sdd[1] sdc[0] 13674601472 blocks 64K chunks 2 near-copies [14/14] [UUUUUUUUUUUUUU] unused devices: [oracle at localhost data]$ uname -a Linux localhost.localdomain 2.6.18-128.el5xen #1 SMP Wed Jan 21 11:12:42 EST 2009 x86_64 x86_64 x86_64 GNU/Linux Dual-socket Nehalem E5520 @ 2.27GHz, 24 GByte RAM. > to me (unless my math has failed me). What kind of performance are you > expecting? I expected slightly more, but this is adequate. Thanks again! -- Eugen* Leitl leitl http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE From landman at scalableinformatics.com Wed Sep 16 05:32:02 2009 From: landman at scalableinformatics.com (Joe Landman) Date: Wed, 16 Sep 2009 08:32:02 -0400 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 In-Reply-To: <20090916102730.GU9828@leitl.org> References: <20090916102730.GU9828@leitl.org> Message-ID: <4AB0DAC2.2020600@scalableinformatics.com> Eugen Leitl wrote: > Below bonnie++ stats are pretty bad for a RAID 10 of 14 SATA > drives (WD RE4, 2 TByte), right? Well, we'd suggest using fio rather than bonnie++, but I'll save that for a post somewhere else. > > [oracle at localhost data]$ bonnie++ -d /data/blah [...] > Version 1.03 ------Sequential Output------ --Sequential Input- --Random- > -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- > Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP > localhost.lo 47320M 82406 99 466201 46 124266 17 73482 96 541644 33 415.4 0 > ------Sequential Create------ --------Random Create-------- > -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- > files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP > 16 5488 4 +++++ +++ 4292 0 3883 3 +++++ +++ 2580 1 > localhost.localdomain,47320M,82406,99,466201,46,124266,17,73482,96,541644,33,415.4,0,16,5488,4,+++++,+++,4292,0,3883,3,+++++,+++,2580,1 Your block output is 466MB/s, for something like 7 disks (14 disk RAID10). So This means you are getting 66.6 MB/s per drive for write, and 77.4 MB/s per drive for read. The 2TB RE4 drives can read at about 110 MB/s and write at about 105 MB/s (from our tests in the lab). So you are in the range of 60 to 70% efficiency. Not bad. RAID10 tends to be more efficient than RAID6 or RAID5 (at least the software versions of these). -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics, Inc. email: landman at scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From smulcahy at atlanticlinux.ie Wed Sep 16 05:40:19 2009 From: smulcahy at atlanticlinux.ie (stephen mulcahy) Date: Wed, 16 Sep 2009 13:40:19 +0100 Subject: [Beowulf] bad bonnie++ in 14-drive RAID 10 In-Reply-To: <20090916114115.GK9828@leitl.org> References: <20090916102730.GU9828@leitl.org> <4AB0C983.1070509@atlanticlinux.ie> <20090916114115.GK9828@leitl.org> Message-ID: <4AB0DCB3.9020502@atlanticlinux.ie> Eugen Leitl wrote: > On Wed, Sep 16, 2009 at 12:18:27PM +0100, stephen mulcahy wrote: > >> I'm not sure what else people read from bonnie++ results but I normally > > I realize people mostly prefer IOZone, or similar. Personally, I like bonnie++ for the general overview it gives. It's a shame there isn't a public collection of good bonnie++ date people can use to compare their results with others (for a general, is my hardware performing in the right ballpark, type check) >> In your case, I'm reading the results below as a block write speed of >> 455 MB/sec and a block read speed of 528 MB/sec which seems pretty good > > Thanks, this was the kind of comment I was looking for. The raw > aggregate disk speed is roughly 770 MB/sec, so this result is not all > that bad. Like I said, maybe wait for some of the storage experts to wake up and respond and you may get a more accurate analysis of your results. -stephen -- Stephen Mulcahy Atlantic Linux http://www.atlanticlinux.ie Registered in Ireland, no. 376591 (144 Ros Caoin, Roscam, Galway) From rockwell at pa.msu.edu Tue Sep 15 12:45:29 2009 From: rockwell at pa.msu.edu (Tom Rockwell) Date: Tue, 15 Sep 2009 15:45:29 -0400 Subject: [Beowulf] XEON power variations Message-ID: <4AAFEED9.6040302@pa.msu.edu> Hi, Intel assigns the same power consumption to different clockspeeds of L, E, X series XEON. All L series have the same rating, all E series etc. So, taking their numbers, the fastest of each type will always have the best performance per watt. And there is no power consumption penalty for buying the fastest clockspeed of each type. Vendor's power calculators reflect this (dell.com/calc for example). To me this seems like marketing simplification... Anybody know any different, e.g. have you seen other numbers from vendors or tested systems yourself? Is the power consumption of a system with an E5502 CPU really the same as one with an E5540? Thanks, Tom Rockwell From dzaletnev at yandex.ru Tue Sep 15 15:55:15 2009 From: dzaletnev at yandex.ru (Dmitry Zaletnev) Date: Wed, 16 Sep 2009 02:55:15 +0400 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: Message-ID: <229971253055315@webmail34.yandex.ru> When install CentOS 5.3, you get Xen virtual machine for free, with a nice interface, and in it, modes with internal network and NAT to outside world work simultaneously, witch is not the case of Sun xVM VirtualBox. Never used VMWare because of its value of $189, people say it's a good VM. But whatfor, if there is CentOS 5.3 with Xen, the industry best emulator/VM? Sincerely, Dmitry Zaletnev, Saint-Petersburg, Russia > Still a newbie in HPC, in the first stages of building a Beowulf cluster (8 nodes). > I wonder if anybody out there has used Linux virtual machines in the head node, just to be able to experiment with different configurations & deployments and jump back without much effort if things go bad. Considering (out of experience in desktop) VMWare or Sun Virtualbox. Any hints, comments ? > David Ramirez > Grad Student CIS > Prairie View A&M University, Texas > -- > | David Ramirez Molina > | davidramirezmolina at gmail.com > | Houston, Texas - USA > Ancora Imparo (A?n aprendo) - Michelangelo a los 80 a?os > From david.ritch.lists at gmail.com Wed Sep 16 05:25:36 2009 From: david.ritch.lists at gmail.com (David B. Ritch) Date: Wed, 16 Sep 2009 08:25:36 -0400 Subject: RS: [Beowulf] Virtualization in head node ? In-Reply-To: <8B848D15-066C-436B-BB3E-84A8FC27C9B1@sanger.ac.uk> References: <354FA500-0364-471D-A305-5D101D4040B5@staff.uni-marburg.de> <9f8092cc0909152244t60bd1f5t883c8638dd6b67d9@mail.gmail.com> <8B848D15-066C-436B-BB3E-84A8FC27C9B1@sanger.ac.uk> Message-ID: <4AB0D940.1050908@gmail.com> At the RedHat Summit a couple of weeks ago, RH said that with a switch from Xen to KVM and lots of tuning, they were able to get the I/O overhead down to 5%. I thought that was pretty impressive. They also introduced a new product RedHat Enterprise Virtualization, which is supposed to support process migration and all the other niceties that we've come to expect from virtualization. I haven't played with it yet, but it sounds quite interesting. I'd be interested to hear of anyone else's experiences with these. David On 9/16/2009 5:34 AM, Tim Cutts wrote: > > On 16 Sep 2009, at 8:23 am, Alan Ward wrote: > >> >> I have been working quite a lot with VBox, mostly for server stuff. I >> agree it can be quite impressive, and has some nice features (e.g. do >> not stop a machine, sleep it - and wake up pretty fast). >> >> On the other hand, we found that anything that has to do with disk >> access is pretty slow, specially when working with a local disk image >> file. > > I think that's pretty standard for most virtualisation, whichever > vendor it comes from. The I/O is fairly sub-optimal. I've had a fair > bit of experience now of various VMware flavours. The I/O performance > of the desktop versions is fairly shocking; this is presumably largely > down to the fact that desktops and laptops tend to have fairly slow > I/O to start with, and the virtualisation penalty is very noticeable. > > Our production virtualisation system uses dual-fabric SAN-attached > storage (EVA5000), ESX 4.0 as the hypervisor, and we're running about > 20 virtual machines per physical host. Most of these applications are > not I/O heavy, but really trivial benchmarking using hdparm indicates > I/O bandwidth within the VM of about half that if the machine were > physical. Very unscientific test, though. I should do some proper > testing with bonnie++... > > Virtual disk performance in ESX 4.0 definitely feels better than ESX > 3.5, but that's largely because they've got rid of some fairly serious > brokenness in memory handling in the hypervisor which was leading to > unnecessary swapping of the VMs. > > ESX 4.0 also has a new guest paravirtual SCSI driver which is supposed > to improve virtual disk performance by about 20% but I have yet to > test that. > > Tim > > From stuartb at 4gh.net Wed Sep 16 05:35:50 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Wed, 16 Sep 2009 08:35:50 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: Message-ID: On Mon, 14 Sep 2009 at 14:04 -0000, David Ramirez wrote: > Still a newbie in HPC, in the first stages of building a Beowulf > cluster (8 nodes). Also a newbie to HPC, but now accumulating systems very quickly. > I wonder if anybody out there has used Linux virtual machines in the > head node, just to be able to experiment with different > configurations & deployments and jump back without much effort if > things go bad. Considering (out of experience in desktop) VMWare or > Sun Virtualbox. Any hints, comments I simulated a complete rocks system early on with VMs. I had 4 compute nodes and one head node. It was very useful for understanding some of the configuration basics. I never actually ran a problem on this system. I used Ubuntu server on a quad core system with kvm for this purpose and found it quite acceptable. I experimented with running our sge master as a VM on a login node but decided I preferred running this VM one one of our Xen hosts. We are using CentOS and Xen for production VMs. Once I figured out a prototype Xen configuration file it became pretty easy to run VMs for infrastructure purposes. Our system will consists of several compute clusters which need to share a common infrastructure. Rocks and other out of the box clustering systems are useful to study, but none of them look sufficient to handle our needs. Having an experimental VM structure is a good way to study the various systems. Stuart -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone From tjrc at sanger.ac.uk Wed Sep 16 07:37:23 2009 From: tjrc at sanger.ac.uk (Tim Cutts) Date: Wed, 16 Sep 2009 15:37:23 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: <229971253055315@webmail34.yandex.ru> References: <229971253055315@webmail34.yandex.ru> Message-ID: On 15 Sep 2009, at 11:55 pm, Dmitry Zaletnev wrote: > When install CentOS 5.3, you get Xen virtual machine for free, with > a nice interface, and in it, modes with internal network and NAT to > outside world work simultaneously, witch is not the case of Sun xVM > VirtualBox. Never used VMWare because of its value of $189, people > say it's a good VM. But whatfor, if there is CentOS 5.3 with Xen, > the industry best emulator/VM? VMware has some free versions; the pay-for versions have a number of extra features which are generally missing from the competitors, and some of which are quite shiny. The automated hot migration of VMs to load balance, for example. Last time I looked, you could manually migrate Xen VMs from one host to another, but it wouldn't do it automatically. vSphere also has high availability and fault tolerance features; I use the HA but not the FT yet (FT is like Marathon for Windows - it runs two copies of the VM in lock-step on two hosts, so that if one of the physical servers dies, the VM doesn't even need to reboot. Obviously there's a significant performance penalty in this). The other thing I find useful in vSphere that isn't yet present in Xen (at least last time I looked) was the ability to give particular users fine-grained access to their VM. For example, I used to have to give some users sudo access to their machines, and generally I can get around that now by allowing them to reboot their virtual machine instead, and they no longer need sudo at all. I consider this an improvement. More contentious is the memory deduplication trick. I can see arguments both for and against this. VMware's workstations products, and Xen, and presumably other hypervisors, give the VM as much RAM as you configure it with, regardless of whether it's going to use it or not. ESX can be configured to do this too, but by default it doesn't, and allows you to overcommit memory. It pays for this partly by deduplicating memory pages. Here's the output from the esxtop monitor program on one of our VMware servers (an HP BL490 blade server with 72GB of RAM): 3:29:06pm up 6 days 23:40, 161 worlds; MEM overcommit avg: 0.00, 0.00, 0.00 PMEM /MB: 73718 total: 618 cos, 967 vmk, 30880 other, 41252 free VMKMEM/MB: 72164 managed: 4329 minfree, 5980 rsvd, 65700 ursvd, high state COSMEM/MB: 69 free: 1239 swap_t, 1239 swap_f: 0.00 r/s, 0.00 w/s NUMA /MB: 36582 (11703), 36034 (29548) PSHARE/MB: 3953 shared, 121 common: 3832 saving SWAP /MB: 13 curr, 0 target: 0.00 r/s, 0.00 w/s MEMCTL/MB: 0 curr, 0 target, 25748 max The PSHARE row is the key one here; it's identified 3953 MB of memory pages which are the same on various machines, and is using only 121 MB to store them, saving 3832 MB of RAM. Our VMs are very heterogeneous; there are CentOS, Scientific Linux, Debian 4, Debian 5, SLES 10 SP2, Windows XP (both 32 and 64-bit), Windows Server 2003 (both 32 and 64- bit), Solaris... if they were more homogeneous, I'm sure the PSHARE saving would be much higher. Tim -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From rgb at phy.duke.edu Wed Sep 16 09:01:22 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 16 Sep 2009 12:01:22 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: On Wed, 16 Sep 2009, Tim Cutts wrote: > > On 15 Sep 2009, at 11:55 pm, Dmitry Zaletnev wrote: > >> When install CentOS 5.3, you get Xen virtual machine for free, with a nice >> interface, and in it, modes with internal network and NAT to outside world >> work simultaneously, witch is not the case of Sun xVM VirtualBox. Never >> used VMWare because of its value of $189, people say it's a good VM. But >> whatfor, if there is CentOS 5.3 with Xen, the industry best emulator/VM? > > VMware has some free versions; the pay-for versions have a number of extra > features which are generally missing from the competitors, and some of which > are quite shiny. The automated hot migration of VMs to load balance, for I'll also speak out in favor of VMware, as I use it pretty extensively (and had negative experiences the first few times I tried Xen). I haven't tried what is it, KVM, only because of a lack of time and energy. The primary advantage of VMware at the server level is probably its management interface, which is quite powerful and intuitive. In the latest server edition, it is web-based which gives you extremely easy ways of performing remote server management. I think it is an ideal way of running Windows servers where you can't live without them -- put e.g. Centos in a rock-solid, conservative, stripped, firewalled configuration on a multicore multiprocessor big memory server, create as many Windows Server VMs as you need and/or the machine supports, and you get the ability to do pretty much anything you want remotely (such as "hard reboot" a hung VM, checkpoint the VMs, stop the VMs and back up the then-hard VM images, "freeze" the VM so that on a reboot it goes back to the last pristine saved VM image, forgetting all the data and viruses and so on that might have accumulated in the meantime) on a very stable and secure base. It also gives you some interesting ways of accomplishing failover, as you can imagine, as the backed up VM images are quite portable and can be moved around in a VMware server farm. It's good for linux VMs too, don't get me wrong -- in fact, you can download a whole bunch of "canned" preconfigured VMs for e.g. mail server, web server etc. that only require you to boot them, adjust the configuration to fit your local requirements, and you're done. These prebuilt VMs can easily be made into a sandbox, or put outside your security boundary to do various chores with something even stronger than a chroot relative to "inside" servers and clients, on a single piece of hardware. I just don't think it is possible for one to propagate from inside a VM to the host OS or into other VMs, as everything is private unless you very deliberately set up sharing; they are really separate systems (although I'm guessing that somebody with root on the toplevel system would still own the world, but who knows?). But that's not where I use VMware the most. It's "Workstation" product is just awesome. It gives you what I would argue is a SLIGHTLY BETTER control interface to any or all VMs you might want to run on your personal desktop or laptop. I used to install Win/Lin dual boot against the not-rare-enough times I absolutely had to use Win for something, and of course this forces you to lose Lin. It also made it difficult and cumbersome to e.g. play Win games through an emulator or after a reboot. It was wasteful of resources, and of course BOTH Win AND Lin want to control the boot process, and getting a dual boot to actually work was a pain then and remains an even bigger pain today, since Vista-of-Evil doesn't want to play AT ALL nice in a dual boot environment -- I have yet to get it to work although I haven't tried infinitely hard I admit because I hate it anyway. Now it is trivial. Pop up the VMware console, boot up my XPPro VM and who needs Vista? XPPro will run forever on the virtualized hardware interface as long as I can get linux to boot and run devices on the toplevel system. If I change machines, my XPPro VM can go with me without all of the tedious crap from Windows Update and phone calls to Windows service people that don't know what you're talking about or what to do about it once they do. Diablo II expansion, a click away, snapshot/suspendable and resumable INDEFINITELY. Various window apps I have to have to work, sometimes, rarely. I can also run Fedora and Debian on the same machine in case I want to develop on both. I can set up a sandbox personal webserver on which to do web development without exposing my personal data to cracking and theft. Now for the bad news. VMware has its own share of "problems" with e.g. the rapidly varying linux kernel, and they are not always rapidly resolved. 6.5.2 would only run under e.g. Fedora 11 with a tedious patch and some flakes for most of its existence as a revision. 6.5.3 runs fine on Fedora 11, but the install RPM is broken and only a real linux geek can ease it through an install (by hand-making the target of the hung build script and then gently killing the hung steps of the make until the script frees up and concludes). I've heard of larger problems with bleeding edge kernels. For a server these usually aren't a problem, but for workstations and laptops (far more likely to run one of the dynamic, less debugged but more niftily tricked out distros) it can be. And yeah, it costs money, although Duke has a site license (finally) so it won't cost ME money any more, or at least not much. But I've paid for a Workstation license out of pocket. It's worth it. Unless/until Xen or KVM or something else comes out with a similarly powerful and tricked out console and ease of use and (still, overall) reliability, VMware will be on my personal laptops for the rest of time. It's just too useful a tool to live without, if you are a serious computer geek who develops software, webware, does consulting, plays games, needs multiple OS's but only want to carry one box and don't want to have to reconfigure reboot to get to them. "This message was paid for by the VMware corporation..." -- (not, just kidding, kidding:-) which now owes me at LEAST a couple of free copies of Workstation for the unsolicited testimonial that I would guess I will receive when hell freezes over... rgb Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From gerry.creager at tamu.edu Wed Sep 16 09:28:17 2009 From: gerry.creager at tamu.edu (Gerry Creager) Date: Wed, 16 Sep 2009 11:28:17 -0500 Subject: [Beowulf] Re: switching capacity terminology confusion In-Reply-To: <20090915232111.GB16891@bx9.net> References: <20090915232111.GB16891@bx9.net> Message-ID: <4AB11221.4060902@tamu.edu> Greg Lindahl wrote: > On Mon, Sep 14, 2009 at 05:52:07PM -0500, Rahul Nabar wrote: > >> "Forwarding Rate 131 Mpps" How does that tie in to the big picture? > > Most layer 3 devices are not capable of forwarding full line-rate > traffic of tiny packets. You should go hunt down a lab report on > switch testing to see these kinds of details discussed. A couple of vendors went after the Really Tiny Packet market some time back, among them Anritsu. Fujitsu resorbed Anritsu's switch-making capabilities, so it you want a real good idea of good small packet performance peruse the specs on the Fujitsu switches. The guiding document on packet switching is (or was) RFC2544. In looking at the Force10 S50 vs the Dell switch, I'd go with the increased performance specs in the Force10, even though the S50 isn't really Force10 silicon, if I recall correctly. I've several S50s in my data center, hammer the fool out of them, and am happy. Prior to them, we used Foundry EdgeIron1G switches for our gigabit-connected clusters. They worked well. For our newer gigabit-connected cluster we went with the HP 5412zl, and have been happy. I'd not recommend cheap switches: They can bite you if you go too cheap and result in poor MPI and I/O performance. gerry From jlb17 at duke.edu Wed Sep 16 10:12:22 2009 From: jlb17 at duke.edu (Joshua Baker-LePain) Date: Wed, 16 Sep 2009 13:12:22 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: On Wed, 16 Sep 2009 at 12:01pm, Robert G. Brown wrote > Unless/until Xen or KVM or something else comes out with a similarly > powerful and tricked out console and ease of use and (still, overall) > reliability, VMware will be on my personal laptops for the rest of time. > It's just too useful a tool to live without, if you are a serious > computer geek who develops software, webware, does consulting, plays > games, needs multiple OS's but only want to carry one box and don't want > to have to reconfigure reboot to get to them. I was a dyed-in-the-wool vmware user until quite recently, too, but the pain of keeping it running on "current" distros (read: Fedora) finally forced me to look elsewhere. I think you'll be pleasantly surprised by VirtualBox if you give it a shot. Then again, who knows what Oracle will do with it... -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF From ashley at pittman.co.uk Wed Sep 16 10:39:27 2009 From: ashley at pittman.co.uk (Ashley Pittman) Date: Wed, 16 Sep 2009 18:39:27 +0100 Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: Message-ID: <1253122767.3677.14.camel@alpha> On Mon, 2009-09-14 at 13:04 -0500, David Ramirez wrote: > Still a newbie in HPC, in the first stages of building a Beowulf > cluster (8 nodes). > > I wonder if anybody out there has used Linux virtual machines in the > head node, just to be able to experiment with different configurations > & deployments and jump back without much effort if things go bad. I do this all the time, I tend to run either a virtual frontend and N virtual compute nodes locally on a server here or more commonly I just start a number of Amazon EC2 instances as compute nodes and run the frontend functionality on node zero. For parallel jobs the compute performance is dreadful (I over-commit the virtual CPU's) but for experimenting and testing different setups it's ideal, I typically run 128 process jobs at a cost of $.40 per hour. The software configuration and deployment is exactly the same on virtual machines as it is on physical machines unless you have any non-ethernet devices. As for running production systems using Virtual machines, for the front-end I'd be happy to do that as well, compute nodes should not be considered for this however. On medium sized clusters there is typically a "head node" which to run the management software, the resource manager, and such and a second front-end machine known as a "login node" for users to login to to compile code, submit jobs and perform cluster I/O from. On smaller machines these two rolls are more often rolled into one machine. Virtualisation allows you to separate out these two rolls again onto separate VM's, if budget allows getting two physical machines and running one Management VM and two Login VM's across them strikes me as a good solution for providing resilience. I'm sure a case could be made for running ten Login instances here but I'm not sure of the benefits myself. Ashley Pittman. -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk From sbyna at nec-labs.com Wed Sep 16 03:30:54 2009 From: sbyna at nec-labs.com (Suren Byna) Date: Wed, 16 Sep 2009 06:30:54 -0400 Subject: [Beowulf] [hpc-announce] CfP: Special Issue of JPDC on "Data Intensive Computing", Submission: Jan 15th 2010 Message-ID: Call for Papers: Special Issue of Journal of Parallel and Distributed Computing on "Data Intensive Computing" --------------------------------------------------------------------------- Data intensive computing is posing many challenges in exploiting parallelism of current and upcoming computer architectures. Data volumes of applications in the fields of sciences and engineering, finance, media, online information resources, etc. are expected to double every two years over the next decade and further. With this continuing data explosion, it is necessary to store and process data efficiently by utilizing enormous computing power that is available in the form of multi/manycore platforms. There is no doubt in the industry and research community that the importance of data intensive computing has been raising and will continue to be the foremost fields of research. This raise brings up many research issues, in forms of capturing and accessing data effectively and fast, processing it while still achieving high performance and high throughput, and storing it efficiently for future use. Programming for high performance yielding data intensive computing is an important challenging issue. Expressing data access requirements of applications and designing programming language abstractions to exploit parallelism are at immediate need. Application and domain specific optimizations are also parts of a viable solution in data intensive computing. While these are a few examples of issues, research in data intensive computing has become quite intense during the last few years yielding strong results. This special issue of the Journal Parallel and Distributed Computing (JPDC) is seeking original unpublished research articles that describe recent advances and efforts in the design and development of data intensive computing, functionalities and capabilities that will benefit many applications. Topics of interest include (but are not limited to): * Data-intensive applications and their challenges * Storage and file systems * High performance data access toolkits * Fault tolerance, reliability, and availability * Meta-data management * Remote data access * Programming models, abstractions for data intensive computing * Compiler and runtime support * Data capturing, management, and scheduling techniques * Future research challenges of data intensive computing * Performance optimization techniques * Replication, archiving, preservation strategies * Real-time data intensive computing * Network support for data intensive computing * Challenges and solutions in the era of multi/many-core platforms * Stream computing * Green (Power efficient) data intensive computing * Security and protection of sensitive data in collaborative environments Guide for Authors Papers need not be solely abstract or conceptual in nature: proofs and experimental results can be included as appropriate. Authors should follow the JPDC manuscript format as described in the "Information for Authors" at the end of each issue of JPDC or at http://ees.elsevier.com/jpdc/ . The journal version will be reviewed as per JPDC review process for special issues. Important Dates: Paper Submission : January 15, 2010 Notification of Acceptance/Rejection : May 31, 2010 Final Version of the Paper : September 15, 2010 Submission Guidelines All manuscripts and any supplementary material should be submitted through Elsevier Editorial System (EES) at http://ees.elsevier.com/ jpdc . Authors must select "Special Issue: Data Intensive Computing" when they reach the "Article Type" step in the submission process. First time users must register themselves as Author. For the latest details of the JPDC special issue see http://www.cs.iit.edu/~suren/jpdc Guest Editors: Dr. Surendra Byna NEC Labs America E-mail: sbyna at nec-labs.com Prof. Xian-He Sun Illinois Institute of Technology E-mail: sun at cs.iit.edu From stuartb at 4gh.net Wed Sep 16 11:33:06 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Wed, 16 Sep 2009 14:33:06 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: On Wed, 16 Sep 2009 at 12:01 -0000, Robert G. Brown wrote: > XPPro will run forever on the virtualized hardware interface as long > as I can get linux to boot and run devices on the toplevel system. > If I change machines, my XPPro VM can go with me without all of the > tedious crap from Windows Update and phone calls to Windows service > people that don't know what you're talking about or what to do about > it once they do. Are you sure about this? We have some interest in being able to archive complete installations for future use (e.g. +5-10 years). I'm skeptical of the existing trend of virtualization to handle all of the needs for product activation or other software licensing schemes. The following is speculation not facts. If you move an existing VM within the same virtualization and cpu technology you may be able to get away without reactivation or obtaining a new license key. MAC addresses can be set in several virtual environment which can help in some cases. However, a lot of other things can impact product activation and licensing checks. Different virtual environments provide different emulated devices. Emulated disk serial numbers, BIOS versions, cpu family, cpu stepping, processor flags and other unknown things may be included in the product activation or licensing checks. It may be that some of the processor emulation technologies can provide this functionality. qemu can emulate a number of hardware systems but again only in specific configurations which may differ from a real world licensed configuration. Stuart From bill at cse.ucdavis.edu Wed Sep 16 13:36:40 2009 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Wed, 16 Sep 2009 13:36:40 -0700 Subject: [Beowulf] XEON power variations In-Reply-To: <4AAFEED9.6040302@pa.msu.edu> References: <4AAFEED9.6040302@pa.msu.edu> Message-ID: <4AB14C58.4080303@cse.ucdavis.edu> Tom Rockwell wrote: > Hi, > > Intel assigns the same power consumption to different clockspeeds of L, > E, X series XEON. All L series have the same rating, all E series etc. > So, taking their numbers, the fastest of each type will always have the > best performance per watt. Wrong, well they might. But not because the power use is the same. > And there is no power consumption penalty > for buying the fastest clockspeed of each type. No marketing visible penalty. Intel doesn't want you buying their low profit cheap chips instead of the high profit expensive chips because of the power you save. > Vendor's power > calculators reflect this (dell.com/calc for example). To me this seems > like marketing simplification... Anybody know any different, e.g. have > you seen other numbers from vendors or tested systems yourself? Try silicon mechanics, their configurator shows the system power for different configs. > Is the power consumption of a system with an E5502 CPU really the same > as one with an E5540? A random dual socket at SI shows 197 watts for a 5502 and 259 watts for a 5540. Intel/AMD often just bin them by power for marketing reasons. It's just funny business. So for instance the 3 new lynnfields are <= 95 watts, yet in testing the slowest is more power efficient they their "efficient" line which includes the 9550S which is rated as under <= 65 watts. So while intel isn't lying, it seems clear they are avoiding posting real numbers to make the faster clocked cpus more attractive. From rgb at phy.duke.edu Wed Sep 16 15:12:52 2009 From: rgb at phy.duke.edu (Robert G. Brown) Date: Wed, 16 Sep 2009 18:12:52 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru> Message-ID: On Wed, 16 Sep 2009, Stuart Barkley wrote: > On Wed, 16 Sep 2009 at 12:01 -0000, Robert G. Brown wrote: > >> XPPro will run forever on the virtualized hardware interface as long >> as I can get linux to boot and run devices on the toplevel system. >> If I change machines, my XPPro VM can go with me without all of the >> tedious crap from Windows Update and phone calls to Windows service >> people that don't know what you're talking about or what to do about >> it once they do. > > Are you sure about this? We have some interest in being able to > archive complete installations for future use (e.g. +5-10 years). > I'm skeptical of the existing trend of virtualization to handle all of > the needs for product activation or other software licensing schemes. Obviously, I have no idea. How could I? But it is plausible that it would work over a 10 year frame. There is really very, very little reason to change the virtualized hardware interface; it is usually chosen to be a wrapper that simulates something with boring, common, simple drivers OR it is a passthru. Passthru's obviously won't usually work -- if you had an OS from a decade ago that didn't know about USB, I'd guess that the USB drivers would either not work or would actively break it. OTOH, you can deconfigure USB passthru. I'd say that there is an excellent chance of running VMs for a 5+ year time frame, decreasing out to 10 but still quite possible at 10, especially for "vanilla" VMs that e.g. just use VGA, a basic network, and nothing else. You're obviously at serious risk of the CPU itself going away at the ten year mark. Who knows if even 32 bit emulation will exist in a decade, or if even 64 bit emulation will still exist? Who knows if the instruction set will have changed? Getting something that downshift emulates a 32 bit intel CPU on a 128 bit 16 core 2020 CPU with an entirely new instruction set... well, that might be a problem. > The following is speculation not facts. > > If you move an existing VM within the same virtualization and cpu > technology you may be able to get away without reactivation or > obtaining a new license key. MAC addresses can be set in several > virtual environment which can help in some cases. Usually this is the case, and often it is even legal, if you don't run two copies of one VM at once -- VMs are a good way to get failover and generally are viewed as a reinstall of a system from backup on new hardware, which is usually permitted or at least tolerated by any company in the server business. MS gets pissy IIRC about virtualizing Vista or better versions of Windows -- I think you have to have at least a business or professional class Vista license before it is legal to virtualize it. Of course this is shooting themselves in the foot, because it means that people trash Vista Home and install XPPro as a VM instead, stretching out that support lifetime still more. In a VM nobody cares -- if you boot a frozen VM image and mount space from elsewhere from data, even if they REMOVE all support and update streams you can't credibly get a virus on it. > However, a lot of other things can impact product activation and > licensing checks. Different virtual environments provide different > emulated devices. Emulated disk serial numbers, BIOS versions, cpu > family, cpu stepping, processor flags and other unknown things may be > included in the product activation or licensing checks. > > It may be that some of the processor emulation technologies can > provide this functionality. qemu can emulate a number of hardware > systems but again only in specific configurations which may differ > from a real world licensed configuration. Yeah, I don't really view it as a way of bypassing licensing, and in my case since Duke has a site license for Windows as well cloning an XPPro VM is totally legal for me. I did it yesterday -- it was awesome. A straight copy of the VM over, boot it on the new system AND on the old system, there it is right down to my (cough cough) copy of DII expansion. Which, um, I don't plan to play on more than one VM at a time...;-) rgb > > Stuart > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu From stuartb at 4gh.net Wed Sep 16 16:51:30 2009 From: stuartb at 4gh.net (Stuart Barkley) Date: Wed, 16 Sep 2009 19:51:30 -0400 (EDT) Subject: [Beowulf] Virtualization in head node ? In-Reply-To: References: <229971253055315@webmail34.yandex.ru>