From john.hearns at streamline-computing.com Tue Feb 1 09:34:32 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: References: Message-ID: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote: > > on that note, though - does anyone have comments about booting > machines from flash? > I've booted a mini-ITX system from flash, the distribution in question was a wireless access point. All you need is a CF to IDE adapter. Its common to have firewall distributions, such as ipcop, to boot from flash. http://www.ipcop.org/1.4.0/en/install/html/mkflash.html I believe one wrinkle is to either log to a remote host, or if you log locally to log to a ramdisk and only write to the CF card at infrequent intervals. John Hearns ps. >sounds like putting mudflaps and a cattle bar on a city-SUV. Called Chelsea Tractors in the part of the world I live in. From hahn at physics.mcmaster.ca Tue Feb 1 10:07:37 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> Message-ID: > > on that note, though - does anyone have comments about booting > > machines from flash? > > > I've booted a mini-ITX system from flash, > the distribution in question was a wireless access point. > All you need is a CF to IDE adapter. I don't really see those much at all. perhaps I'm not using the right search terms. have you looked into booting from usb-flash? that would be very much dependent on bios, of course, but far more accessible. thanks, mark hahn. From James.P.Lux at jpl.nasa.gov Tue Feb 1 10:32:35 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures References: Message-ID: <6.1.1.1.2.20050201103123.0417a670@mail.jpl.nasa.gov> At 09:34 AM 2/1/2005, John Hearns wrote: >On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote: > > > > on that note, though - does anyone have comments about booting > > machines from flash? > > >I've booted a mini-ITX system from flash, >the distribution in question was a wireless access point. >All you need is a CF to IDE adapter. > >Its common to have firewall distributions, such as ipcop, >to boot from flash. >http://www.ipcop.org/1.4.0/en/install/html/mkflash.html > >I believe one wrinkle is to either log to a remote host, >or if you log locally to log to a ramdisk and only write to >the CF card at infrequent intervals. > >John Hearns > >ps. > >sounds like putting mudflaps and a cattle bar on a city-SUV. > >Called Chelsea Tractors in the part of the world I live in. > I boot mini-ITX systems from flash, and also via PXE, both wireless and wired. As John says, you need a CF to IDE adapter, which in my case is combined with the unregulated 12VDC to ATX power supply, a watchdog timer, and some other hardware. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From James.P.Lux at jpl.nasa.gov Tue Feb 1 10:50:28 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> Message-ID: <6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov> At 10:07 AM 2/1/2005, Mark Hahn wrote: > > > on that note, though - does anyone have comments about booting > > > machines from flash? > > > > > I've booted a mini-ITX system from flash, > > the distribution in question was a wireless access point. > > All you need is a CF to IDE adapter. > >I don't really see those much at all. perhaps I'm not using >the right search terms. Try JKMicrodevices or ituner.com or www.mini-itx.com or www.damnsmalllinux.org or wwww.logicsupply.com or www.epiacenter.com (google for "compact flash mini-itx" ) They run about $15-$20, depending on configuration, and there's nothing special about them for MiniITX.. they should work on anything. There ARE rumored to be "difficulties" with how the CF is formatted in some contexts, but I don't know any details. Maybe it has to do with whether the BIOS supports the "virtual" head, track, sector details? I've also heard that one cannot boot "Win xx" from CF, but have no reason to see why this would be so (it's a disk drive, after all...) Maybe with a PCI<>CF adapter it's a problem? >have you looked into booting from usb-flash? that would be very >much dependent on bios, of course, but far more accessible. Oooooh... that didn't work so well for me on the various machines I tried it on. The IDE/CF is essentially bios independent (to the BIOS, it just looks link another IDE drive). The USB drive has to have all the USB stuff up and running first. >thanks, mark hahn. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From alvin at Mail.Linux-Consulting.com Tue Feb 1 13:58:51 2005 From: alvin at Mail.Linux-Consulting.com (Alvin Oga) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> Message-ID: On Tue, 1 Feb 2005, John Hearns wrote: > > on that note, though - does anyone have comments about booting > > machines from flash? > > > I've booted a mini-ITX system from flash, > the distribution in question was a wireless access point. > All you need is a CF to IDE adapter. ANY system can be booted from CF ... amd for an AP, you'd probably want to boot off a usb stick since those are presumably hotswappable whereas CF is not there are lots of "CF - ide adpators" pcengine.ch makes um and resells to the list of folks in the list jim posted ( ituner(mini-box), logicsupply, etc ... ) they also have those that plug the CF into the ide port on the motherboard - but, i havent seen any hotswap cf-ide adaptors yet though > Its common to have firewall distributions, such as ipcop, > to boot from flash. > http://www.ipcop.org/1.4.0/en/install/html/mkflash.html installing to 128MB or 256MB CF implies that you install the minimum packages ( glibc + networking ) and have the rest of your binaries on nfs-server:/usr/local/cluster-stuff which gets automounted onto the CF-based nodes it'd be good to keep a master CFnode ( minimal system install ) so that it can be updated and patched as needed on one place, and those patch files also makes it to the next CF release for the other nodes -- or dont patch the cf after its made :-) > I believe one wrinkle is to either log to a remote host, > or if you log locally to log to a ramdisk and only write to > the CF card at infrequent intervals. writing to CF is good and bad ... since it has limited write capabilities, but there's not much writing that needs to be done, and even if there is, one can write all the system data to /dev/ramdisk instead of CF the CF can be mounted read-only c ya alvin From john.hearns at streamline-computing.com Wed Feb 2 08:38:38 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures In-Reply-To: References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> Message-ID: <33904.143.167.3.70.1107362318.squirrel@webmail.streamline-computing.com> >> > on that note, though - does anyone have comments about booting >> > machines from flash? >> > > have you looked into booting from usb-flash? that would be very > much dependent on bios, of course, but far more accessible. > Indeed, as Alvin says any system can be booted from a CF. Some mini-ITX Cases come with a little slot, which makes changing the CF card easy. I agree with the USB comment - I always travel with a USB stick which has Stresslinux on it. www.stresslinux.org This is a little distro which has lm_sensors, cpu_burn etc. on it, plus memtest. Invaluable for the roaming engineer :-) From list-beowulf at onerussian.com Thu Feb 3 19:20:05 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] NFS over TCP or smth else... WHAT I've done wrong? Message-ID: <20050204032005.GB2444@washoe.rutgers.edu> Dear Beowulfers, Today is sad day for our 25 nodes cluster: I decided to improve its performance and as a result I crippled it quite a lot. The story is that for some reason many nodes started loosing connection with the NFS server node, I started looking for a solution and decided to try NFS over TCP. After I've adjusted configs across the cluster (cfengine rulez), even rebooted the nodes (besides the main one) for the sake of it, and put a slight load on a cluster (occupied 6 nodes with intensive I/O which rw data from the NFS server) pretty much all of 60 nfsd instances start occupying CPU on the main node, so load reached around 20 or 30 which is star hitting number... main node (NFS server) start to behave unresponsively and start killing applications due to reason of "running out of memory". So what is wrong in the next config: vana:/raid /raid nfs defaults,tcp,hard,rw,nosuid,wsize=8192,rsize=8192 ? later I've adjusted it with bg,timeo=60,noatime to reduce the load but it didn't quite help. details about cluster: 23 active nodes at the moment running 2.6.8.1 SMP, main node with 8GB, RPCNFSDCOUNT=70, nfs-kernel-server What would be the best NFS config for it if we provide two directories from the NFS server: /raid as rw,sync and /share/apps as ro,async Thank you in advance P.S. BTW - here is the dump from "killing mess" Fixed up OOM kill of mm-less task oom-killer: gfp_mask=0xd0 DMA per-cpu: cpu 0 hot: low 2, high 6, batch 1 cpu 0 cold: low 0, high 2, batch 1 cpu 1 hot: low 2, high 6, batch 1 cpu 1 cold: low 0, high 2, batch 1 Normal per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 HighMem per-cpu: cpu 0 hot: low 32, high 96, batch 16 cpu 0 cold: low 0, high 32, batch 16 cpu 1 hot: low 32, high 96, batch 16 cpu 1 cold: low 0, high 32, batch 16 Free pages: 2969440kB (2966528kB HighMem) Active:506964 inactive:611412 dirty:461 writeback:0 unstable:0 free:742360 slab:193835 mapped:269296 pagetables:2983 DMA free:1048kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB protections[]: 8 476 732 Normal free:1864kB min:936kB low:1872kB high:2808kB active:32632kB inactive:21288kB present:901120kB protections[]: 0 468 724 HighMem free:2966528kB min:512kB low:1024kB high:1536kB active:1995096kB inactive:2424488kB present:7471104kB protections[]: 0 0 256 DMA: 0*4kB 15*8kB 10*16kB 8*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1048kB Normal: 14*4kB 2*8kB 0*16kB 38*32kB 1*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1864kB HighMem: 0*4kB 0*8kB 0*16kB 48126*32kB 16915*64kB 2081*128kB 109*256kB 57*512kB 20*1024kB 0*2048kB 0*4096kB = 2966528kB Swap cache: add 538373, delete 522525, find 54148646/54172304, race 0+5 Out of Memory: Killed process 17465 (gnome-settings-). -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] Key http://www.onerussian.com/gpg-yoh.asc GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050203/a3da36ad/attachment.bin From wytsang at clustertech.com Tue Feb 1 02:42:37 2005 From: wytsang at clustertech.com (Clotho) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] IntelMPITEST-1.0 compiled with icc in heterogeneous environment Message-ID: <41FF5D1D.50903@clustertech.com> Hi, I would like to ask a question about using icc to compile IntelMPITEST-1.0 and run the program in heterogeneous environment. I have a i386 node and a x86_64 node. I have configed and compiled IntelMPITEST-1.0 testsuite at the i386 node. I run the testsuite in the i386 node, and use "mpirun -machinefile" to run the binary on both nodes. I have tried the test with gcc and pgi compilers, they work. But for icc8, I have encounter error in c/blocking/functional/MPI_Ssend_ator The error message is very long, but has similar pattern like: MPITEST error (3): i=0, long double value= -0.0000000000, expected 0.0000000000 MPITEST error (3): 10 errors in buffer (3,0,13) len 8 commsize 4 commtype -10 data_type 13 root 3 MPITEST error (3): Send/Receive lengths differ - Sender(node/length)=0/8, Receiver(node/length)=3/-32766 MPITEST error (3): i=0, long double value= -0.0000000000, expected 0.0000000000 MPITEST error (3): 117 errors in buffer (4,0,13) len 83 commsize 4 commtype -10 data_type 13 root 3 MPITEST error (3): Send/Receive lengths differ - Sender(node/length)=0/83, Receiver(node/length)=3/-32766 All the errors are related to data_type 13 and 14. This error does not happen when I run the tests on 2 i386 nodes. Have you any idea on the problem? Thank you. PS. I find that the error message is produced from "libmpitest.c" line 2361. And I find that one of the many compilation warning is related to the line ./libmpitest.c(2361): warning #181: argument is incompatible with corresponding format string conversion i, ((derived1 *)buffer)[i].LongDouble[k], May be it's related, I am not sure. From denis.che at gmail.com Tue Feb 1 07:18:22 2005 From: denis.che at gmail.com (Denis) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: Max common block size, global array size on ia32 Message-ID: >A more involved fix is to change the location of the shared >libraries in memory by changing kernel. Look for the variable >__PAGE_OFFSET in the kernel header files. How exactly do you go about doing this? I know how to compile/recompile a kernel, but I have no idea as to how to implement this fix... I have a similar machine... Dual Xeon 2.2-GHz with 2GB RAM and exactly the same problem with mem limitations for a single fixed-size array... Thanks From Kris.Boutilier at scrd.bc.ca Tue Feb 1 10:23:05 2005 From: Kris.Boutilier at scrd.bc.ca (Kris Boutilier) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures Message-ID: There quite an elegant set of scripts available at http://gate-bunker.p6.msu.ru/~berk/router.html to tweak a standard debian installation to boot from an IDE device and run entirely from tempfs from that point on, thereby avoiding the 'worn out' compact flash problem. Targeted at router applications but certainly useful for other semi-embedded applications. > -----Original Message----- > From: John Hearns [SMTP:john.hearns@streamline-computing.com] > Sent: Tuesday, February 01, 2005 9:35 AM > To: beowulf@beowulf.org > Subject: Re: [Beowulf] Re: real hard drive failures > > On Mon, 2005-01-31 at 14:14 -0500, Mark Hahn wrote: > > > > on that note, though - does anyone have comments about booting > > machines from flash? > > > I've booted a mini-ITX system from flash, > the distribution in question was a wireless access point. > All you need is a CF to IDE adapter. > > Its common to have firewall distributions, such as ipcop, > to boot from flash. > http://www.ipcop.org/1.4.0/en/install/html/mkflash.html > {clip} From dwu at swales.com Tue Feb 1 11:18:05 2005 From: dwu at swales.com (Dominic Wu) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> <6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov> Message-ID: <003701c50892$c2876df0$69704e89@jpl.nasa.gov> It is motherboard dependent and if your BIOS supports USB boot and most newer ones do, there should be no problem in theory. That said, booting up CF (or any solidstate/microdrive devices) via an IDE interface is still probably easier with less drivers you have to load. > > >have you looked into booting from usb-flash? that would be very > >much dependent on bios, of course, but far more accessible. > > Oooooh... that didn't work so well for me on the various machines I tried > it on. The IDE/CF is essentially bios independent (to the BIOS, it just > looks link another IDE drive). The USB drive has to have all the USB stuff > up and running first. > > > >thanks, mark hahn. > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From Glen.Gardner at verizon.net Tue Feb 1 17:20:00 2005 From: Glen.Gardner at verizon.net (Glen Gardner) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures References: Message-ID: <42002AC0.3090806@verizon.net> USB flash is really slow. Regular CF (@ 128 KB/s writes) on a cf to ide adapter is a lot faster (particularly write speed) than USB flash (@ 64 KB/s write speed) "thumb drives". I've had good luck with IBM microdrives, but CF is getting cheaper than microdrives. Of course , the microdrives are a lot faster (@ 1MB/s R/W) than CF on write. But CF is pretty fast on read (10 MB/s ??). CF has a limited number of writes before it fails , anywhere from 100K to 1M write cycles. The time for write cycles is typically anywhere from 300 milliseconds to 500 milliseconds for a 32 KB chunk for regular CF. Typically you write a chunk of CF at once in each write cycle, and 32KB is a typical figure for that (but it varies with the particular memory chips used). This is why CF is so awfully slow when writing. Using serial CF makes it even worse, which is one reason why USB thumb drives are even slower than regular CF cards. CF is okay for booting a system from, but things like /tmp , /var are best mountd in a memory file system and only written to cf when shutting down. Swap partitions and /home need to be mounted via NFS. At present, I have two nodes of a 14 node cluster booting from CF, and /home is mounted on another machine with a proper hard drive via NFS. Ten of the nodes are booting from microdrives, and two nodes have ata 133 hard drives for /home, development and backups. /var /tmp and swap are actually mounted on the cf card, and I'm waiting to see how long before the cf actually expires. These nodes have been up 24/7 for over a month now, with no problems. I have not tried to force the nodes to swap. For saving power and reducing heat, CF is going to be the best you can get. Microdrives are almost as good, laptop drives are pretty good, and a regular IDE drive is a pig in comparison. I use a USB thumb drive with a bootable OS on it as an emergency boot drive. It comes in handy when installing a node. Since I use microdrives, all I do is shut down the node and plug the new microdrive into the cf adapter, and the cf thumb drive in the usb port and turn the node on, and it boots from USB so I can then install a system image stored on the development node onto the new microdrive via an NFS mount. It takes about 5 minutes to install and configure a new node in this fashion. Writing the disk image to a 512 MB cf card is going to take up to an hour, and plan on at least twice that to write a disk image to a 512 MB USB flash. (CF is just plain slow) Glen Mark Hahn wrote: >>>on that note, though - does anyone have comments about booting >>>machines from flash? >>> >>> >>> >>I've booted a mini-ITX system from flash, >>the distribution in question was a wireless access point. >>All you need is a CF to IDE adapter. >> >> > >I don't really see those much at all. perhaps I'm not using >the right search terms. > >have you looked into booting from usb-flash? that would be very >much dependent on bios, of course, but far more accessible. > >thanks, mark hahn. > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > -- Glen E. Gardner, Jr. AA8C AMSAT MEMBER 10593 Glen.Gardner@verizon.net http://members.bellatlantic.net/~vze24qhw/index.html From award at andorra.ad Tue Feb 1 23:25:01 2005 From: award at andorra.ad (Alan Ward i Koeck) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: real hard drive failures References: <1107279272.5470.10.camel@dhcp47.priv.wark.uk.streamline-computing.com> <6.1.1.1.2.20050201104425.041f1938@mail.jpl.nasa.gov> Message-ID: <4200804D.66B07502@andorra.ad> Jim Lux wrote: > > At 10:07 AM 2/1/2005, Mark Hahn wrote: > > >have you looked into booting from usb-flash? that would be very > >much dependent on bios, of course, but far more accessible. > > Oooooh... that didn't work so well for me on the various machines I tried > it on. The IDE/CF is essentially bios independent (to the BIOS, it just > looks link another IDE drive). The USB drive has to have all the USB stuff > up and running first. Done that, though I had to use a kernel diskette with USB et al compiled in. My BIOS could only boot from a USB external hard drive/CD, not flash. Alan Ward > >thanks, mark hahn. > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From mtpratol at cs.sfu.ca Wed Feb 2 12:54:28 2005 From: mtpratol at cs.sfu.ca (Matthew Pratola) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends Message-ID: Hi all, Can anyone recommend a simple web frontend for submitting SGE jobs? Thanks, Matthew Pratola M.Sc. Candidate Dept. of Statistics and Actuarial Science Simon Fraser University Vancouver, BC, CANADA From diep at xs4all.nl Wed Feb 2 19:53:27 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050203045323.01002100@pop.xs4all.nl> Good morning! With the intention to run my chessprogram on a beowulf to be constructed here (starting with 2 dual-k7 machines here) i better get some good advice on which network to buy. Only interesting thing is how fast each node can read out 64 bytes randomly from RAM of some remote cpu. All nodes do that simultaneously. The faster this can be done the better the algorithmic speedup for parallel search in a chess program (property of YBW, see publications in journal of icga: www.icga.org). This speedup is exponential (or better you get punished exponential compared to single cpu performance). Which network cards considering my small budget are having lowest latencies can be used? quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per card when i altavista'ed online and i wonder how to get more than 2 nodes to work without switch. Perhaps there is low cost switches with reasonable low latency? Please note MPI is probably what i'll use, though i keep finding online information about 'gamma'. Is that faster latency than MPI implementations? Note normal 1Gbit cards for normal network traffic. Each node is a SMP or NUMA node and not only multiprocessor also multithreaded. I welcome any advice, Best regards, Vincent Vincent Diepeveen From rhamann at uccs.edu Wed Feb 2 23:56:13 2005 From: rhamann at uccs.edu (R Hamann) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] MPICH2: Handle Limit? Message-ID: I've been having some strange problems with a program using the MPICH2 library. When I added some new datatypes for ghost cell exchange, the program would hang. I figured out that any number of handles over 84 would cause this. Fortunately, I could delete some handles that I no longer needed, but it still seemed strange. Are my calculations correct that for each process there is an 84 handle limit? or am I seeing some other problem? Ron From maurice at harddata.com Thu Feb 3 16:05:19 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Re: Botting from flash ( was Re: Re: real hard drive failures) In-Reply-To: <200501311938.j0VJc3lt003632@bluewest.scyld.com> References: <200501311938.j0VJc3lt003632@bluewest.scyld.com> Message-ID: <4202BC3F.4070401@harddata.com> >From: Mark Hahn >Subject: Re: [Beowulf] Re: real hard drive failures >... > >on that note, though - does anyone have comments about booting >machines from flash? > > Compact Flash (CF) IS an ATA device, and requires no specific drivers other than standard kernel ATA driver. CF slot reader/writers are now under $25, and as a matter of fact we offer this as an option on both our tower workstations on in our rack chassis. Recent prices on CF are at $50 or less for 512MB, so a "CD sized" boot image flash device is now trivial. If you look inside a Force10 network switch you will see the OS and firmware are loaded on a flash card. You can even buy CF packaged in a device that is a 40 pin female "dongle" that plugs directly to the motherboard HD IDE slot. These go for around $100 for 512MB Other flash types, like SD, XD, Memory stick, &c do not have the AT interface built in, so a chip and driver are needed to use them, pretty well ruling them out as useful for boot devices, unless you write the driver into BIOS, on, for example, LinuxBIOS. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 This email, message, and content, should be considered confidential, and is the copyrighted property of Hard Data Ltd., unless stated otherwise. From rgb at phy.duke.edu Fri Feb 4 03:55:48 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: Message-ID: On Wed, 2 Feb 2005, Matthew Pratola wrote: > Hi all, > > Can anyone recommend a simple web frontend for submitting SGE jobs? http://www.globus.org/ One stop shopping. rgb > > Thanks, > > Matthew Pratola > M.Sc. Candidate > Dept. of Statistics and Actuarial Science > Simon Fraser University > Vancouver, BC, CANADA > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Fri Feb 4 05:20:48 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: Message-ID: <420376B0.7000107@scalableinformatics.com> Robert G. Brown wrote: > On Wed, 2 Feb 2005, Matthew Pratola wrote: > > >>Hi all, >> >>Can anyone recommend a simple web frontend for submitting SGE jobs? > > > http://www.globus.org/ > > One stop shopping. Did I miss something? Was a tongue planted in cheek with this reply? As far as I know there are very few web interfaces to running SGE (or LSF, or ...) jobs. If I am wrong please do provide links/references. Globus is not a web interface (last I checked), but a large group of middleware to manage something that looks a lot closer to the definition of a grid than SGE. SGE is a job scheduler (with a name "engineered" to make you think it is a one-stop-shop as a grid-in-a-box). My company is interested in (and we are developing) web portals for end user cluster work, so if you know of any, we would like to hear about them. Good open-source platforms that are current/supported could be worth looking at (and will save us time/development effort). There seem to be lots of bits of abandonware in the grid portal/user-interface area. We don't want to re-invent wheels, but at the same time, we don't want to adopt abandoned ones either. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From rene at renestorm.de Fri Feb 4 03:12:39 2005 From: rene at renestorm.de (rene) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: Message-ID: <200502041212.39511.rene@renestorm.de> Hi Matthew, i've been working/thinking on that a year ago and my opinion: "You don't want to do that." But there are some questions Do you want to go public with that little webpage? Do you want to execute common sge jobs or is it just one application? Do these jobs have input data? How complex is your authorization hierarchy? What do you do with the next sge release? How to you share the results and the status with the users? There are webfrontend for cluster apllications out there eg NCBI's blast, but never heard of it for common jobs. Cya > Hi all, > > Can anyone recommend a simple web frontend for submitting SGE jobs? > > Thanks, > > Matthew Pratola > M.Sc. Candidate > Dept. of Statistics and Actuarial Science > Simon Fraser University > Vancouver, BC, CANADA > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- Rene Storm @Cluster From diep at xs4all.nl Fri Feb 4 04:35:22 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050204133518.01007860@pop.xs4all.nl> At 00:29 4-2-2005 -0800, Bill Broadley wrote: >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: >> Good morning! >> >> With the intention to run my chessprogram on a beowulf to be constructed >> here (starting with 2 dual-k7 machines here) i better get some good advice >> on which network to buy. Only interesting thing is how fast each node can >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that >> simultaneously. > >Is there any way to do this less often with a larger transfer? >If you >wrote a small benchmark that did only that (send 64 bytes randomly >from a large array in memory) and make it easy to download, build, run, >and report results, I suspect some people would. One way pingpong with 64 bytes will do great. Shared memory examples i have plenty, but one way pingpong approaches it excellent. Just multiply the time with 2 and one knows the bound :) >> The faster this can be done the better the algorithmic speedup for parallel >> search in a chess program (property of YBW, see publications in journal of >> icga: www.icga.org). This speedup is exponential (or better you get >> punished exponential compared to single cpu performance). >> >> Which network cards considering my small budget are having lowest latencies >> can be used? > >Define small budget. For more than 2 nodes myrinet needs a switch. >Do you expect to be totally network latency bound? How low is enough >to keep the processors busy? CPU's are 100% busy and after i know how many times a second the network can handle in theory requests i will do more probes per second to the hashtable. The more probes i can do the better for the game tree search. >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per >> card when i altavista'ed online and i wonder how to get more than 2 nodes >> to work without switch. Perhaps there is low cost switches with reasonable >> low latency? >Do you know that gigabit is too high latency? The few one way pingpong times i can find online from gigabit cards are not exactly promising, to say it very polite. Something in the order or 50 us one way pingpong time i don't even consider worth taking a look at at the picture. Each years cpu's get faster. For small networks 10 us really is the upper limit. >Can't you send enough >work, like say search 3 moves ahead on the head node, then for each legal >move send that search tree to a different node? Each node would reply with >the highest ranked moves when done. Let's not discuss parallel chess algorithm too much in depth. 100 different algorithms/enhancements get combined with each other. They are not the biggest latency problem. The latency problem is caused by the hashtable. Hashtable is a big cache. The bigger the better. It avoids researching the same tree again. In games like chess and every search terrain (even simulated flight) you can get back to the same spot by different means causing a transposition. Like suppose you start the game with 1.e4,e5 2.d4 that leads to the same position like 1.d4,e5 2.e4. So if we have searched already 1.e4,e5 2.d4 that position P we store into a large cache. Other cpu's first want to know whether we already searched that position. Those hashtable positions get created quite quickly. Deep Blue created them at a 100 million positions a second and simply didn't store vaste majority in hashtable (would be hard as it was in hardware). That's one of the reasons why it searched only 10-12 ply, already in 1999 that was no longer spectacular when 4 processor pc's showed up at world champs. At a PC with a shared hashtable nowadays i get 10-12 ply (ply = half move, full move is when both sides make a move) in a few seconds, searching a 100000 positions per second a cpu. So before we start searching every node (=position) we quickly want to find out whether other cpu's already searched it. At the origin3800 at 512 processors i used a 115 GB hashtable (i started search at 460 processors). Simply because the machine has 512GB ram. So in short you take everything you can get. The search works with internal iterative deepending which means we first search 1 ply, then 2 ply, then 3 ply and so on. The time it takes to get to the next iteration i hereby define as the branching factor (Knuth has a different definition as he just took into account 1 algorithm, the 'todays' definition looks more appropriate). In order to search 1 ply deeper obvious it's important to maintain a good branching factor. I'm very bad in writing out mathematical proofs, but it's obvious that the more memory we use, the more we can reduce the number of legal moves in this position P as next few ply it might be in hashtable, which trivially makes the time needed to search 1 ply deeper shorter. Storing closer to the root (position where we started searching) is of course more important than near the leafs of the search tree. When for example not storing in hashtable last 10 ply near the leafs in an overnight experiment the search depth dropped at 460 processors from 20 ply to 13 ply. Of course each processor of supercomputers is deadslow for game tree search (it's branchy 100% integer work completely knocking down the caches), so compared to pc's you already start at a disadvantage of a factor 16 or so very quickly, before you start searching (in case of TERAS i had to fight with outdated 500Mhz MIPS processors against opterons and high clocked quad Xeons), so upgrading my own networkcards is more clever. Yet getting yourself a network even between a few nodes as quick as those supercomputers is not so easy... Additional your own beowulf network you can first decently test at before playing at a tournament, and without good testing at the machine you play at in tournaments you have a hard 0% chance that it plays well. The only thing in software that matters is testing. >-- >Bill Broadley >Computational Science and Engineering >UC Davis From ilumb at platform.com Fri Feb 4 06:06:14 2005 From: ilumb at platform.com (Ian Lumb) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends Message-ID: <4AB0624F069DAD4E90F18B13A818EEFE016B50A6@catoexm04.noam.corp.platform.com> Open Source GridPort (www.gridport.net) merits consideration. It interfaces with SGE, LSF, PBS, etc., via Globus. And NICE EnginFrame (http://www.enginframe.com) is a commercial offering which already has customizations for the Life Sciences. For the record, we provide our own Web GUI with Platform LSF, and make use of these portals as required. -Ian -----Original Message----- From: beowulf-bounces@beowulf.org [mailto:beowulf-bounces@beowulf.org]On Behalf Of Joe Landman Sent: Friday, February 04, 2005 8:21 AM To: Robert G. Brown Cc: Matthew Pratola; beowulf@beowulf.org Subject: Re: [Beowulf] SGE web frontends Robert G. Brown wrote: > On Wed, 2 Feb 2005, Matthew Pratola wrote: > > >>Hi all, >> >>Can anyone recommend a simple web frontend for submitting SGE jobs? > > > http://www.globus.org/ > > One stop shopping. Did I miss something? Was a tongue planted in cheek with this reply? As far as I know there are very few web interfaces to running SGE (or LSF, or ...) jobs. If I am wrong please do provide links/references. Globus is not a web interface (last I checked), but a large group of middleware to manage something that looks a lot closer to the definition of a grid than SGE. SGE is a job scheduler (with a name "engineered" to make you think it is a one-stop-shop as a grid-in-a-box). My company is interested in (and we are developing) web portals for end user cluster work, so if you know of any, we would like to hear about them. Good open-source platforms that are current/supported could be worth looking at (and will save us time/development effort). There seem to be lots of bits of abandonware in the grid portal/user-interface area. We don't want to re-invent wheels, but at the same time, we don't want to adopt abandoned ones either. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From laurence at scalablesystems.com Fri Feb 4 06:08:09 2005 From: laurence at scalablesystems.com (Laurence Liew) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <420376B0.7000107@scalableinformatics.com> References: <420376B0.7000107@scalableinformatics.com> Message-ID: <420381C9.808@scalablesystems.com> Hi all We have the SGE web interface. It integrates into our Rocks cluster management web interface. That is you NEED to use ROCKS (www.rockscluster.org) You can download RxC from www.scalablesystems.com It is free for non-commercial, academic use. It provides web based: - SGE management - SGE job submission - some basic reporting - and of course managing a Rocks cluster via a web interface. Have fun. Laurence Joe Landman wrote: > > > Robert G. Brown wrote: > >> On Wed, 2 Feb 2005, Matthew Pratola wrote: >> >> >>> Hi all, >>> >>> Can anyone recommend a simple web frontend for submitting SGE jobs? >> >> >> >> http://www.globus.org/ >> >> One stop shopping. > > > Did I miss something? Was a tongue planted in cheek with this reply? > > As far as I know there are very few web interfaces to running SGE (or > LSF, or ...) jobs. If I am wrong please do provide links/references. > > Globus is not a web interface (last I checked), but a large group of > middleware to manage something that looks a lot closer to the definition > of a grid than SGE. SGE is a job scheduler (with a name "engineered" to > make you think it is a one-stop-shop as a grid-in-a-box). > > My company is interested in (and we are developing) web portals for end > user cluster work, so if you know of any, we would like to hear about > them. Good open-source platforms that are current/supported could be > worth looking at (and will save us time/development effort). There seem > to be lots of bits of abandonware in the grid portal/user-interface > area. We don't want to re-invent wheels, but at the same time, we don't > want to adopt abandoned ones either. > > Joe > -- Laurence Liew, CTO Email: laurence@scalablesystems.com Scalable Systems Pte Ltd Web : http://www.scalablesystems.com (Reg. No: 200310328D) 7 Bedok South Road Tel : 65 6827 3953 Singapore 469272 Fax : 65 6827 3922 From brian at cmrl.wustl.edu Fri Feb 4 07:14:02 2005 From: brian at cmrl.wustl.edu (Brian Henerey) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <420376B0.7000107@scalableinformatics.com> References: <420376B0.7000107@scalableinformatics.com> Message-ID: <4203913A.1030202@cmrl.wustl.edu> I don't mean to hijack this thread, but I'd also be interested to know if there are any open source web frontends for launching jobs on clusters. I've mostly written my own anyway, but if something's out there I'd like to know. Thanks, Brian Henerey Joe Landman wrote: > > > Robert G. Brown wrote: > >> On Wed, 2 Feb 2005, Matthew Pratola wrote: >> >> >>> Hi all, >>> >>> Can anyone recommend a simple web frontend for submitting SGE jobs? >> >> >> >> http://www.globus.org/ >> >> One stop shopping. > > > Did I miss something? Was a tongue planted in cheek with this reply? > > As far as I know there are very few web interfaces to running SGE (or > LSF, or ...) jobs. If I am wrong please do provide links/references. > > Globus is not a web interface (last I checked), but a large group of > middleware to manage something that looks a lot closer to the definition > of a grid than SGE. SGE is a job scheduler (with a name "engineered" to > make you think it is a one-stop-shop as a grid-in-a-box). > > My company is interested in (and we are developing) web portals for end > user cluster work, so if you know of any, we would like to hear about > them. Good open-source platforms that are current/supported could be > worth looking at (and will save us time/development effort). There seem > to be lots of bits of abandonware in the grid portal/user-interface > area. We don't want to re-invent wheels, but at the same time, we don't > want to adopt abandoned ones either. > > Joe > From rgb at phy.duke.edu Fri Feb 4 10:02:48 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <420376B0.7000107@scalableinformatics.com> References: <420376B0.7000107@scalableinformatics.com> Message-ID: On Fri, 4 Feb 2005, Joe Landman wrote: > > > Robert G. Brown wrote: > > On Wed, 2 Feb 2005, Matthew Pratola wrote: > > > > > >>Hi all, > >> > >>Can anyone recommend a simple web frontend for submitting SGE jobs? > > > > > > http://www.globus.org/ > > > > One stop shopping. > > Did I miss something? Was a tongue planted in cheek with this reply? Actually, it was a reply I snapped off on my way out the door on the edge of late for teaching. Let me reconsider my answer. You don't like yes/globus, how about "no". At least if you mean really really simple by simple. I would argue that a cluster designed to run primarily embarrassingly parallel jobs, fronted by a web portal/interface, is a not uncommon form of a grid, although perhaps the definition is large enough to include a union of such clusters or some more general structure (certainly access to other kinds of resources than strictly "a cluster"). So I read this question as "I want to make my local cluster into a grid, so users faraway with no direct LAN accounts or access can submit jobs into my local SGE queue after being properly authenticated". And, of course, be notified (with messages) when the jobs crash or terminate normally, facilitate data transfer and resource allocation requests, etc. Not exactly simple... Globus TK is as I understand it a toolkit from which one can build a web interface for generalized remote task submission to "a grid". It has to have lots of moving parts to do that well -- just AUTHENTICATING data transfer and job execution via a web interface isn't really terribly "simple", becaues to do it decently generally requires e.g. stuff like kerberos, ssl, ssh that aren't terribly simple either. So I definitely failed on the "simple" bit. However, simple or not, I believe that Globus does contain the components to do what you want -- provide a very generic web interface for people far away who don't share any LAN components such as mounted filespace, authentication/userid mappings, etc to transfer data and job execution instructions to a system. That system, if it is a front end running SGE and/or stuff like condor (policy, load balance, batch job tools) can then put the job into a queue, run it, and let globus know when it is finished so it can tell the original user. If you look over just their security layer (GSI -- Grid Security Infrastructure) you rapidly come to realize that to run any sort of remote job execution service you NEED most of its components -- authentication (including a Certificate Authority CA), encryption (public/private key, managed with certificates), permissions, etc. Some grid designs I've seen use just this component of Globus and use other tools (like PBS or SGE or custom designed stuff) for other components. Ian Foster seems to have a list of at least some of the major grid projects around the world -- enough to be able to google on them by name -- here: http://www-fp.mcs.anl.gov/~foster/grid-projects/ Perhaps you can find a reusable interface at one of their project websties. You can also check out e.g. the Grid Portal Development Kit: http://doesciencegrid.org/projects/GPDK/ or The Grid Portal Kit: https://gridport.npaci.edu/ or the Open Grid Computing Environment: http://www.ogce.org/index.php all of which I believe use globus as at least part of their middleware for e.g. authentication etc. Some of these are (e.g. the DOE's GPDK) currently unsupported although still available and possibly still reasonably functional. I don't really know the status of the rest of them, and I doubt that this is all of them. So you're right, I should have answered "no" because it isn't simple to offer a web interface to any active service, ESPECIALLY one that permits a remote user to upload arbitrary programs for execution on arbitrary data of arbitrary size where authentication, encryption, data transport, and remote job management become absolutely essential components of the solution. AFAIK, Globus is one of the if not the only middleware toolkits of choice for people who run the big grids -- they probably write their own actual web portal, but they use Globus to do at least some of the heavy lifting that goes on behind the scenes. Maybe one of the "portal projects" above (all open source) will be of use in setting up a "simple" portal to your cluster, but be aware that the problem itself is far from simple. However, I could be wrong and as always cherish being corrected. rgb > > As far as I know there are very few web interfaces to running SGE (or > LSF, or ...) jobs. If I am wrong please do provide links/references. > > Globus is not a web interface (last I checked), but a large group of > middleware to manage something that looks a lot closer to the definition > of a grid than SGE. SGE is a job scheduler (with a name "engineered" to > make you think it is a one-stop-shop as a grid-in-a-box). > > My company is interested in (and we are developing) web portals for end > user cluster work, so if you know of any, we would like to hear about > them. Good open-source platforms that are current/supported could be > worth looking at (and will save us time/development effort). There seem > to be lots of bits of abandonware in the grid portal/user-interface > area. We don't want to re-invent wheels, but at the same time, we don't > want to adopt abandoned ones either. > > Joe > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mwill at penguincomputing.com Fri Feb 4 10:36:51 2005 From: mwill at penguincomputing.com (Michael Will) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Information Reseach Lab In-Reply-To: <200502010848.19789.bvanhaer@sckcen.be> References: <200502010848.19789.bvanhaer@sckcen.be> Message-ID: <200502041036.52020.mwill@penguincomputing.com> It depends on the GIS software used. There was some work to mpi-enable GRASS modules a while back, no idea where it went. Here is something about a parallel version of s.surf.rst: http://skagit.meas.ncsu.edu/~helena/grasswork/grasscontrib/ And of course if you program against the GIS api's you might be able to take advantage of a cluster as well. There is a paper that mentiones they used MPI for paralelizing their GIS/EM4 software on http://www.colorado.edu/research/cires/banff/pubpapers/104/ Michael On Monday 31 January 2005 11:48 pm, Ben Vanhaeren wrote: > On Monday 31 January 2005 11:46, Ziad Shaaban wrote: > > Dear All, > > > > I am planning to have an information lab in our faculty built of: Dell, > > Linux, Oracle and GIS. > > > > Can I use Beowulf to analyze GIS Data and display them on the web using > > ArcIMS, all three vendors said yes, but can I use Beowulf? > > > I think you should read the Beowulf FAQ: > http://www.beowulf.org/overview/faq.html#1 > Beowulf is a concept not a piece of software. > > I don't think you are going to need a beowulf cluster for the kind of > application you want to run (analyzing GIS data). If you want to guarantee > availability of your GIS data or do loadbalancing (distribute the load to > several servers) you should take a look at linux HA project: > http://www.linux-ha.org/ > Apache loadbalancing with mod_backhand: > http://www.backhand.org/ApacheCon2001/US/backhand_course_notes.pdf > and Oracle Real Application Clusters (RAC). > > -- Michael Will, Linux Sales Engineer Tel: 415-954-2822 Toll Free: 888-PENGUIN Fax: 415-954-2899 www.penguincomputing.com Visit us at LinuxWorld 2005! Hynes Convention Center, Boston, MA February 15th-17th, 2005 Booth 609 From rgb at phy.duke.edu Fri Feb 4 10:37:02 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <4203913A.1030202@cmrl.wustl.edu> References: <420376B0.7000107@scalableinformatics.com> <4203913A.1030202@cmrl.wustl.edu> Message-ID: On Fri, 4 Feb 2005, Brian Henerey wrote: > > I don't mean to hijack this thread, but I'd also be interested to know > if there are any open source web frontends for launching jobs on > clusters. I've mostly written my own anyway, but if something's out > there I'd like to know. Same topic. The issue is having a web "portal" that manages stuff like authentication, data transport, job submission/status etc. Running the submissions through SGE rather than something else is just a detail. rgb > > Thanks, > Brian Henerey > > > Joe Landman wrote: > > > > > > Robert G. Brown wrote: > > > >> On Wed, 2 Feb 2005, Matthew Pratola wrote: > >> > >> > >>> Hi all, > >>> > >>> Can anyone recommend a simple web frontend for submitting SGE jobs? > >> > >> > >> > >> http://www.globus.org/ > >> > >> One stop shopping. > > > > > > Did I miss something? Was a tongue planted in cheek with this reply? > > > > As far as I know there are very few web interfaces to running SGE (or > > LSF, or ...) jobs. If I am wrong please do provide links/references. > > > > Globus is not a web interface (last I checked), but a large group of > > middleware to manage something that looks a lot closer to the definition > > of a grid than SGE. SGE is a job scheduler (with a name "engineered" to > > make you think it is a one-stop-shop as a grid-in-a-box). > > > > My company is interested in (and we are developing) web portals for end > > user cluster work, so if you know of any, we would like to hear about > > them. Good open-source platforms that are current/supported could be > > worth looking at (and will save us time/development effort). There seem > > to be lots of bits of abandonware in the grid portal/user-interface > > area. We don't want to re-invent wheels, but at the same time, we don't > > want to adopt abandoned ones either. > > > > Joe > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From landman at scalableinformatics.com Fri Feb 4 11:33:30 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: <420376B0.7000107@scalableinformatics.com> Message-ID: There are web interfaces to SGE, and there are web interfaces to grids ... I think the important aspect of this is the marketing use of the term "Grid" in a name. Way back in high school, they used to teach us that what was in a name was exactly opposite of what it really was... A bit cynical, but amazingly effective at cutting through marketing. Globus is glue. Middleware. There are portals atop globus. SGE (despite its name) is a job scheduler. As is LSF. And others. The short version of things are that in order to get a web interface to SGE, one need not go through the joy of Globus, especially as Globus will not in and of itself get you where you want to go. GridPort I knew of. The other I did not. Joe From josip at lanl.gov Fri Feb 4 11:57:28 2005 From: josip at lanl.gov (Josip Loncaric) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050204133518.01007860@pop.xs4all.nl> References: <3.0.32.20050204133518.01007860@pop.xs4all.nl> Message-ID: <4203D3A8.4070702@lanl.gov> Vincent Diepeveen wrote: > At 00:29 4-2-2005 -0800, Bill Broadley wrote: >> >>Do you know that gigabit is too high latency? Gigabit Ethernet adapters often need tweaking to deliver reasonable latency, bandwidth, and CPU utilization. For example, if your system uses the e1000 driver (Intel's gigabit Ethernet), the default setting is "dynamic Interrupt Throttle Rate" -- which means that the card will delay interrupting the CPU by up to about 130 microseconds after receiving a packet. Moreover, the "dynamic" part causes the network chip microcode to vary this delay in multiples of about 16 microseconds, so that different packets will generally experience different receive delays. For the e1000 driver, https://lists.dulug.duke.edu/pipermail/dulug/2004-August/015415.html recommends using "options e1000 InterruptThrottleRate=80000" (add this line to /etc/modules.conf). Users of this driver may also want to check Intel's parameters for e1000 listed at http://www.intel.com/support/network/sb/cs-009209.htm#parameters -- just don't assume that the default values are appropriate for cluster use. Other gigabit Ethernet adapters have similar interrupt mitigation strategies, all designed to gracefully cope with high packet rates at high network speeds. For cluster use, adjustments are usually advisable. The basic Rx interrupt mitigation scheme is this: the receiver's CPU won't be interrupted until at least N packets have arrived or M microseconds have elapsed (whichever comes first). This clearly adds up to M microseconds to network latency. BTW, one often sees N=6 (otherwise NFS performance can seriously degrade) and M>=16. Other variants of this basic scheme are possible; but they all mean increased latencies. Finally, don't forget the Tx side interrupt mitigation, or else the sending CPU might not be told promptly that it's OK to send more. The default Tx settings are probably fine for full size packets, but if your applications send lots of small packets, tweaking your network driver's Tx settings may help. Sincerely, Josip From atp at piskorski.com Fri Feb 4 12:20:23 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050203045323.01002100@pop.xs4all.nl> References: <3.0.32.20050203045323.01002100@pop.xs4all.nl> Message-ID: <20050204202023.GA32459@piskorski.com> On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > Please note MPI is probably what i'll use, though i keep finding > online information about 'gamma'. Is that faster latency than MPI > implementations? http://www.disi.unige.it/project/gamma/ Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000 gigabit ethernet cards, for use with MPI. It offers much better latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us), but worse than the specialized HPC interconnects (maybe 3 us). The attraction of GAMMA, is that Intel Pro/1000 cards can be had for $11 to $60 or so each (depending on exact model, etc.), and gigabit switches are also pretty cheap, while SCI or Myrinet is somewhere in the $500 to $1500 per node range (I don't keep track). So if your application can benefit from lower latency, but you want something really cheap, GAMMA should be well worth trying. -- Andrew Piskorski http://www.piskorski.com/ From lindahl at pathscale.com Fri Feb 4 12:50:34 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <20050204202023.GA32459@piskorski.com> References: <3.0.32.20050203045323.01002100@pop.xs4all.nl> <20050204202023.GA32459@piskorski.com> Message-ID: <20050204205034.GA18717@greglaptop.internal.keyresearch.com> On Fri, Feb 04, 2005 at 03:20:23PM -0500, Andrew Piskorski wrote: > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > > > Please note MPI is probably what i'll use, though i keep finding > > online information about 'gamma'. Is that faster latency than MPI > > implementations? > > http://www.disi.unige.it/project/gamma/ In addition to gamma, there's also MVAPICH from LBL, and at least two commercial products, one from Scali, and one from the Cluster Competence Center. -- greg From ctierney at HPTI.com Fri Feb 4 14:13:19 2005 From: ctierney at HPTI.com (Craig Tierney) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <20050204202023.GA32459@piskorski.com> References: <3.0.32.20050203045323.01002100@pop.xs4all.nl> <20050204202023.GA32459@piskorski.com> Message-ID: <1107555198.2916.4.camel@localhost.localdomain> On Fri, 2005-02-04 at 13:20, Andrew Piskorski wrote: > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > > > Please note MPI is probably what i'll use, though i keep finding > > online information about 'gamma'. Is that faster latency than MPI > > implementations? > > http://www.disi.unige.it/project/gamma/ > > Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000 > gigabit ethernet cards, for use with MPI. It offers much better > latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us), > but worse than the specialized HPC interconnects (maybe 3 us). See Josip's post on tweaking interrupts on gigE drivers, but I have a small system with Intel gigE cards and a Dell gigE switch. Latency between two nodes through the swtich is 30 us. This is typical of what I see for other gigE cards. A latency of 60-100 is a bit high. Avoiding TCP/IP is still a big improvement. Craig > > The attraction of GAMMA, is that Intel Pro/1000 cards can be had for > $11 to $60 or so each (depending on exact model, etc.), and gigabit > switches are also pretty cheap, while SCI or Myrinet is somewhere in > the $500 to $1500 per node range (I don't keep track). > > So if your application can benefit from lower latency, but you want > something really cheap, GAMMA should be well worth trying. From john.hearns at streamline-computing.com Sat Feb 5 00:58:49 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: References: <420376B0.7000107@scalableinformatics.com> <4203913A.1030202@cmrl.wustl.edu> Message-ID: <1107593929.5504.1.camel@Vigor45> On Fri, 2005-02-04 at 13:37 -0500, Robert G. Brown wrote: > On Fri, 4 Feb 2005, Brian Henerey wrote: > > > > > I don't mean to hijack this thread, but I'd also be interested to know > > if there are any open source web frontends for launching jobs on > > clusters. I've mostly written my own anyway, but if something's out > > there I'd like to know. > > Same topic. The issue is having a web "portal" that manages stuff like > authentication, data transport, job submission/status etc. Running the > submissions through SGE rather than something else is just a detail. I know that the London E-science centre do work in that area. Have a look at GridSAM http://www.lesc.ic.ac.uk/gridsam/index.html Haven never used it myself mind - it was only out in beta last week! And sadly: "The DRMConnector for launching to Grid Engine resource using DRMAA is currently in development and not yet released. " Also, you could ask the same question on the SGE list. http://gridengine.sunsource.net/project/gridengine/maillist.html From kus at free.net Fri Feb 4 09:06:01 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home Beowulf - NIC latencies Message-ID: >Good morning! >With the intention to run my chessprogram on a beowulf to be >constructed here (starting with 2 dual-k7 machines here) i better get some good advice >on which network to buy. Only interesting thing is how fast each node >can read out 64 bytes randomly from RAM of some remote cpu. All nodes >do that simultaneously. I'm very glad that parallelised chessprograms are developed, but I'm regretted that chess programs don't have coarse-grained parallelizm ... :-( I thought that every processor can handle some big part of moves tree. Unfortunatelly I can't win Deep Fritz 8 also w/o parallelization :-) >The faster this can be done the better the algorithmic speedup for >parallel search in a chess program (property of YBW, see publications >in journal of >icga: www.icga.org). This speedup is exponential (or better you get >punished exponential compared to single cpu performance). >Which network cards considering my small budget are having lowest >latencies can be used? >quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro > per card when i altavista'ed online and i wonder how to get more than >2 nodes to work without switch. Perhaps there is low cost switches >with reasonable low latency? One idea for "low price & low latency interconnect infrastructure" may be ATOLL (//www.atoll-net.de), because it has no "external" switches. But I don't know about commercial availability of ATOLL hardware just now. >Please note MPI is probably what i'll use, though i keep finding online >information about 'gamma'. Is that faster latency than MPI >implementations? You can use MPI over GAMMA having more low latencies. Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow >Note normal 1Gbit cards for normal network traffic. >Each node is a SMP or NUMA node and not only multiprocessor also >multithreaded. >I welcome any advice, >Best regards, >Vincent Vincent Diepeveen From nj at hemeris.com Fri Feb 4 09:21:24 2005 From: nj at hemeris.com (Nicolas Jungers) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050203045323.01002100@pop.xs4all.nl> References: <3.0.32.20050203045323.01002100@pop.xs4all.nl> Message-ID: <1107537685.6224.12.camel@lcube.bxl.jungers.net> On Thu, 2005-02-03 at 04:53 +0100, Vincent Diepeveen wrote: > Good morning! > > With the intention to run my chessprogram on a beowulf to be > constructed > here (starting with 2 dual-k7 machines here) i better get some good > advice > on which network to buy. Only interesting thing is how fast each node > can > read out 64 bytes randomly from RAM of some remote cpu. All nodes do > that > simultaneously. > > The faster this can be done the better the algorithmic speedup for > parallel > search in a chess program (property of YBW, see publications in > journal of > icga: www.icga.org). This speedup is exponential (or better you get > punished exponential compared to single cpu performance). > > Which network cards considering my small budget are having lowest > latencies > can be used? > > quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro > per > card when i altavista'ed online and i wonder how to get more than 2 > nodes > to work without switch. Perhaps there is low cost switches with > reasonable > low latency? > > Please note MPI is probably what i'll use, though i keep finding > online > information about 'gamma'. Is that faster latency than MPI > implementations? gamma bypass tcp/ip, then shaving most of the latency. Unfortunately it's not very actively developed, though they "recently" (last year) updated their stack to the e1000 (intel Giga ethernet) NIC. I know that (some at) the CERN use they own communication stack on e1000 similar to gamma, with impressive results. I dunno if it's widely available. Nicolas From ashley at quadrics.com Fri Feb 4 09:31:01 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050204133518.01007860@pop.xs4all.nl> References: <3.0.32.20050204133518.01007860@pop.xs4all.nl> Message-ID: <1107538261.13957.10.camel@localhost.localdomain> On Fri, 2005-02-04 at 13:35 +0100, Vincent Diepeveen wrote: > At 00:29 4-2-2005 -0800, Bill Broadley wrote: > >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > >> Good morning! > >> > >> With the intention to run my chessprogram on a beowulf to be constructed > >> here (starting with 2 dual-k7 machines here) i better get some good advice > >> on which network to buy. Only interesting thing is how fast each node can > >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that > >> simultaneously. > > > >Is there any way to do this less often with a larger transfer? > >If you > >wrote a small benchmark that did only that (send 64 bytes randomly > >from a large array in memory) and make it easy to download, build, run, > >and report results, I suspect some people would. > > One way pingpong with 64 bytes will do great. pingpong is not really the same, adding a random element can slow down comms and ideally it sounds like you want a one-sided operation. Perhaps you should look at tabletoy (cray shmem) or gups (MPI) as a benchmark. > CPU's are 100% busy and after i know how many times a second the network > can handle in theory requests i will do more probes per second to the > hashtable. The more probes i can do the better for the game tree search. Are you overlapping comms and compute or doing blocking reads? If you are overlapping then the issue rate for reads is more important than the raw latency. > >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per > >> card when i altavista'ed online and i wonder how to get more than 2 nodes > >> to work without switch. Perhaps there is low cost switches with reasonable > >> low latency? > >Do you know that gigabit is too high latency? > > The few one way pingpong times i can find online from gigabit cards are not > exactly promising, to say it very polite. Something in the order or 50 us > one way pingpong time i don't even consider worth taking a look at at the > picture. > > Each years cpu's get faster. For small networks 10 us really is the upper > limit. 10us is easily achievable, I've just measured a read time of a little over 3us and a issue rate of 1.33us. > So before we start searching every node (=position) we quickly want to find > out whether other cpu's already searched it. > > At the origin3800 at 512 processors i used a 115 GB hashtable (i started > search at 460 processors). Simply because the machine has 512GB ram. > > So in short you take everything you can get. So is this a parallel algorithm or simply a big "memory farm" you are after? You don't hear much of clusters being used for the latter but in some cases it's a eminently sensible thing to do. Ashley, From monang at gmail.com Fri Feb 4 11:23:13 2005 From: monang at gmail.com (Monang Setyawan) Date: Sat Jul 4 01:03:49 2009 Subject: [Beowulf] Newbie Question Message-ID: <5dc04bbf050204112319fe7fbf@mail.gmail.com> Hi. I'm a newbie in this parallel computing thing. (sorry for my bad english, I'm Indonesian) My current project is a software that analyze DNA/Protein sequence data that needs high performance aspect on it. I plan to deploy this software on network of workstations (mm, may be just about 10 PCs on the network). Am I in wrong place now? I am going to use message passing paradigm (MPI) to write the software. I've read that there are several choice of MPI implementation. The problem is, I'm bad in both C or Fortran (I usually use Java as my favorite language). Some source said that Java (or it's MPI wrapper or pure MPI implementation) isn't good enough to implement a parallel computing solution. Is that right? My third question is, is there any pdf/ps/one file version of "Engineering a Beowulf-style Compute Cluster''? Thanks in advance. -- For the sake of time.. From rhamann at uccs.edu Fri Feb 4 11:31:01 2005 From: rhamann at uccs.edu (R Hamann) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] MPICH2: Handle Limit? In-Reply-To: References: Message-ID: Rob, I thought any limit would be wierd, let alone something like 84 (7 X 12?) Anyway, I thought it was based on the number of MPI variables declared (data_types, windows, requests) because every time I added new declarations, it would hang on Fedora core 2, but run to completion on Scyld (but with erroneous results). If I deleted unused MPI declarations, it would start to work again. I counted all my handles and came up with 84. However, after deleting two 26 element arrays of handles, I thought it would work. When I added more handles, it bombed again. I started to try other things. I added 4 junk ints. I didn't use the variables I declared, but it still bombed. When I converted them to chars, it started working again. Very strange. Have you ever encountered this before? I'm doing a 3d cellular automata, so I need a lot of datatypes for exchange of ghost cells. It's obviously some strange error I've made that's manifesting itself in MPI instead of a runtime or sytax error. I'm gonna try looking for any buffer overruns now, but other than that I'm stumped. GCC on Fedora Core 2 and on Scyld Beowulf MPICH 2 1.0 Thanks, R On Fri, 4 Feb 2005 12:16:17 -0600 (CST) Rob Ross wrote: > Hi Ron, > > There should not be an 84 handle limit. > > Can you tell me what version of MPICH2 this is, and what >architecture and > OS you're running on? Do you have a simple test that exhibits the > problem? > > Thanks, > > Rob > --- > Rob Ross, Mathematics and Computer Science Division, Argonne >National Lab > > > On Thu, 3 Feb 2005, R Hamann wrote: > >> I've been having some strange problems with a program using the >>MPICH2 >> library. When I added some new datatypes for ghost cell exchange, >>the >> program would hang. I figured out that any number of handles over >>84 >> would cause this. Fortunately, I could delete some handles that I >>no >> longer needed, but it still seemed strange. Are my calculations >> correct that for each process there is an 84 handle limit? or am I >> seeing some other problem? >> >> Ron From rodmur at maybe.org Fri Feb 4 12:29:55 2005 From: rodmur at maybe.org (Dale Harris) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] scyld's beorun Message-ID: <20050204202955.GS32046@maybe.org> Hey, I was looking some web page talking about schedulers and Scyld's beowulf, and using the beorun command. I'm not able to find much of any documentation out there about what this command is, or does. Anyone familiar with it? -- Dale Harris rodmur@maybe.org /.-) From diep at xs4all.nl Fri Feb 4 11:39:12 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050204203911.01006630@pop.xs4all.nl> At 17:31 4-2-2005 +0000, Ashley Pittman wrote: >On Fri, 2005-02-04 at 13:35 +0100, Vincent Diepeveen wrote: >> At 00:29 4-2-2005 -0800, Bill Broadley wrote: >> >On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: >> >> Good morning! >> >> >> >> With the intention to run my chessprogram on a beowulf to be constructed >> >> here (starting with 2 dual-k7 machines here) i better get some good advice >> >> on which network to buy. Only interesting thing is how fast each node can >> >> read out 64 bytes randomly from RAM of some remote cpu. All nodes do that >> >> simultaneously. >> > >> >Is there any way to do this less often with a larger transfer? >> >If you >> >wrote a small benchmark that did only that (send 64 bytes randomly >> >from a large array in memory) and make it easy to download, build, run, >> >and report results, I suspect some people would. >> >> One way pingpong with 64 bytes will do great. >pingpong is not really the same, adding a random element can slow down >comms and ideally it sounds like you want a one-sided operation. >Perhaps you should look at tabletoy (cray shmem) or gups (MPI) as a >benchmark. Thank you for your answer, I indeed investigated quadrics cards intensively. Ask your college Daniel Kidger. The shmem is an ideal solution for what searching algorithms are doing. Regrettably seems no one is willing to sell old quadrics cards (QM400). >> CPU's are 100% busy and after i know how many times a second the network >> can handle in theory requests i will do more probes per second to the >> hashtable. The more probes i can do the better for the game tree search. > >Are you overlapping comms and compute or doing blocking reads? If you >are overlapping then the issue rate for reads is more important than the >raw latency. A node (chessposition in this case) eats on average 10 us. sometimes that's 50us other times it's 1us. That's the time a cpu is busy calculating a chesstechnical value how good the position is applying human chesspatterns. Called evaluation function in search world. Before applying evaluation function one is doing a lookup to the cache whether one already searched this position. In case of a 2 node beowulf that means you have 50% odds that this position is in local memory and 50% chance it's a remote lookup. The reason for this is very simple by explaining the hash function which in a lot of different software gets used too (not only search, also encryption and string matching and all types of caches). For each piece at each square take a random value ( long long randomvalue[12][64] ) XOR all values with each other and you have what is called a Zobrist hash from a position. Very effectively. Nothing beats the speed of Zobrist as you can do it incremental. Now suppose we use the lower 20 bits to lookup at 1 million entries. So we AND the hash number with 2^20 - 1 and lookup at that adress in the hashtable. Obviously such cache is distributed across the nodes. Each node having an equal share of the global transpositiontable as it is called officially. Trivially doing this each 10 us will put too much stress on the network. So usually one doesn't do it at the leaves itself (called quiescencesearch). That means only in 20% of the nodes such a thing gets tried. That's already on average once in each 100 us. The slower the network card, the less remote hashtable lookups one tries obviously. Finding for each cluster an optimum search depth when to try it is of course not so difficult to figure out. 1 lookup reads 64 bytes and that's 4 entries where the position could be stored. 1 entry is 16 bytes and stores quite some information. Apart from a lot of bits to identify a chessposition, the score is there (20 bits) and what the best move was in this position. >> >> quadrics/dolphin seems bit out of pricerange. Myrinet is like 684 euro per >> >> card when i altavista'ed online and i wonder how to get more than 2 nodes >> >> to work without switch. Perhaps there is low cost switches with reasonable >> >> low latency? >> >Do you know that gigabit is too high latency? >> >> The few one way pingpong times i can find online from gigabit cards are not >> exactly promising, to say it very polite. Something in the order or 50 us >> one way pingpong time i don't even consider worth taking a look at at the >> picture. >> >> Each years cpu's get faster. For small networks 10 us really is the upper >> limit. > >10us is easily achievable, I've just measured a read time of a little >over 3us and a issue rate of 1.33us. Suppose a 8 node quadrics setup so with a switch in the middle and all processors trying to read nonstop over the network to a random remote processor. Each processor reading out of the 64MB on card. So never in the physical memory of a processor, just at the remote network cards. What speed is achievable then to read 64 bytes? SGI with their supercomputers never get better than 5.8 us there (that's reading 8 bytes) on average (origin3800) when the numaflex routers kick in. Altix3000 is way worse there. More bandwidth optimized i guess. >> So before we start searching every node (=position) we quickly want to find >> out whether other cpu's already searched it. >> >> At the origin3800 at 512 processors i used a 115 GB hashtable (i started >> search at 460 processors). Simply because the machine has 512GB ram. >> >> So in short you take everything you can get. > >So is this a parallel algorithm or simply a big "memory farm" you are >after? You don't hear much of clusters being used for the latter but in >some cases it's a eminently sensible thing to do. I take care the cpu's get nearly 100% load and say am prepared to sacrafice 10% of the scaling at a network to read/write latency to the hashtable. So i just figure out how many reads i can do in 10% system time and fill that with reads. The other 90% system time it has to evaluate chesspositions and be busy with the real stuff. By putting the depth in the search at which it is allowed to read higher or lower, i can manual adjust the traffic over the network. Best regards, Vincent >Ashley, From diep at xs4all.nl Fri Feb 4 12:39:47 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> At 11:38 4-2-2005 -0800, Bill Broadley wrote: >> >> One way pingpong with 64 bytes will do great. >> > >A very similar number I build a circularly linked list and read a value, >add 1 to it, and send it to the next host, with a GigE network: > >compute-0-8.local compute-0-7.local compute-0-2.local compute-0-4.local compute-0-8.local compute-0-7.local compute-0-2.local compute-0-4.local >size= 10, 131072 hops, 8 nodes in 5.30 sec ( 40.4 us/hop) 966 KB/sec > >Oh, you said 64 (I'm sending INTs, so 16): >size= 16, 131072 hops, 8 nodes in 5.35 sec ( 40.8 us/hop) 1531 KB/sec I'm amazed you get it to 40.8 us. Probably you tested at an idle network? How fast is it when the cpu's are 100% busy doing integer work? >> CPU's are 100% busy and after i know how many times a second the network >> can handle in theory requests i will do more probes per second to the >> hashtable. The more probes i can do the better for the game tree search. > >With a gigE network that sounds like 40us or so. With Myrinet or IB >it's in the 4-6us range. If you bought dual opterons with the special At the quadrics and dolphin homepage they both claim 12+ us for Myrinet. For example : http://www.dolphinics.com/pdf/datasheet/Dolphin_socket_4p.pdf >hypertransport slot you could get it down to 1.5us or so. SGI >altix machines can get that down again to around 1.0us. Of course >speed isn't cheap. Altix3000 has worse latency than origin3800 if interpret results well. Altix3000 is 3-4 us one way pingpong at 64 processors, which origin3800 gets at 512 processors. At 64 processors see extensive benchmarking by prof Aad v/d Steen for dutch government organisations. His results are at www.sara.nl in pdf format. Look for his presentation 1 july 2003. When i ran at limited number of cpu's my latency tests (using shared memory) the origin3800 really is a lot faster in latency than altix3000. A problem of altix3000 design is that of course scheduling is very hard thanks to the complex routing as each brick is connected to 2 routers which each connect to other parts of the machine. This causes for immense scheduling problems when there is a 150 users simultaneously on the machine normally spoken which are not there when you can benchmark an entire empty machine with just 1 user. >> The few one way pingpong times i can find online from gigabit cards are not >> exactly promising, to say it very polite. Something in the order or 50 us >> one way pingpong time i don't even consider worth taking a look at at the >> picture. >> >> Each years cpu's get faster. For small networks 10 us really is the upper >> limit. > >Okay, so dolphin, myrinet, or IB. Have URL's from where IB is buyable without needing to buy entire system? >> Let's not discuss parallel chess algorithm too much in depth. 100 different >> algorithms/enhancements get combined with each other. They are not the >> biggest latency problem. The latency problem is caused by the hashtable. >> Hashtable is a big cache. The bigger the better. It avoids researching the >> same tree again. > >Okay, so my question is, which would be better: >* 8 4GB caches that you could query 80 million times a second? This one by far. Actually for the top searches not such big caches are needed. Locally i may allocate 200-400MB a cpu for cache, but a shared cache can be easily as low as 4MB a cpu, no problem. Could get it even down to less than that if needed. 99% of all nodes (chesspositions) that get searched are near the leafs. So if i move up the variable where it also may lookup at remote cpu's from 0 to 2, then already 99% of all nodes don't get looked up remote. >* 1 64GB cache that you could query 200,000 times a second? >> In games like chess and every search terrain (even simulated flight) you >> can get back to the same spot by different means causing a transposition. >> Like suppose you start the game with 1.e4,e5 2.d4 that leads to the same >> position like 1.d4,e5 2.e4. So if we have searched already 1.e4,e5 2.d4 >> that position P we store into a large cache. Other cpu's first want to know >> whether we already searched that position. > >Right. But if you can calculate a few Billion operations per second >sometimes it is faster to recalculate then wait 10-20us for an answer. To look 1 ply deeper in search is exponential. At a 460 cpu search (origin3800) moving the variable from 1 (default so it was already not storing/looking up the leaves remote) to 10, lost me 7 ply search depth. That's about 3^7 = factor 2187 To answer the question, YES 1 fast pc processor would outsearch in such a case handsdown a 512 processor supercomputer. Supercomputers are of course notorious here. It takes a year or so to deliver them and the processor chosen at the time of buying already wasn't the fastest, so when they finally work fine for users the processors are at least 2 times slower than pc processors (for integer work). Clusters are far superior in that respect. >> Those hashtable positions get created quite quickly. Deep Blue created them >> at a 100 million positions a second and simply didn't store vaste majority >> in hashtable (would be hard as it was in hardware). That's one of the >> reasons why it searched only 10-12 ply, already in 1999 that was no longer >> spectacular when 4 processor pc's showed up at world champs. > >Indeed, better algorithms can allow a 4 cpu to compete with a 2000. The Sheikh (one of the princes of the united arab emirates, see www.hydrachess.com) plans on building a 1024 processor chess computer he told me over MSN. He's having bad advisors IMHO. He's using myrinet and a bad parallel search (speedup less than square root out of total number of cpu's). Objectivity and desert sand are a bad combination. >> At a PC with a shared hashtable nowadays i get 10-12 ply (ply = half move, >> full move is when both sides make a move) in a few seconds, searching a >> 100000 positions per second a cpu. >> >> So before we start searching every node (=position) we quickly want to find >> out whether other cpu's already searched it. > >So that operation will cost around 80us with GigE, and 10-16us with IB >or Myri. 80 us is what i read elsewhere too yes for GigE. Is it so hard to make a card with lower latency for a few dollar? I mean if i buy for 135 euro a cpu i can get myself an opteron 1.4Ghz or something. If i buy for 1000 euro i get myself say a 2.4Ghz opteron. Less than factor 2 faster. If you buy for 135 euro a network card it is 80 us. When you buy a highend netwerk card it's factor 10 faster from user viewpoint. That's quite a lot! >> At the origin3800 at 512 processors i used a 115 GB hashtable (i started >> search at 460 processors). Simply because the machine has 512GB ram. > >The origin 3800 has a very healthy interconnect, shared memory lookups >are in the few 100 ns range, and MPI with the newest libraries are >in the 1-2us range. If the interconnects (hubs) of the origin are fine, then they must use real slow routers. It's 5.8 us is a shared memory lookup on average at 460 processors origin3800, no one else at the system (looking up 8 bytes). 3-4 us one way pingpong. That machine is equipped with so called 35ns routers. Lookup to local memory is 280 ns by the way at both itanium2 as well as origin. Of course everything is randomized. It's complete TLB trashing. >> So in short you take everything you can get. > >Of course. > >> The search works with internal iterative deepending which means we first >> search 1 ply, then 2 ply, then 3 ply and so on. >> >> The time it takes to get to the next iteration i hereby define as the >> branching factor (Knuth has a different definition as he just took into >> account 1 algorithm, the 'todays' definition looks more appropriate). >> >> In order to search 1 ply deeper obvious it's important to maintain a good >> branching factor. I'm very bad in writing out mathematical proofs, but it's >> obvious that the more memory we use, the more we can reduce the number of >> legal moves in this position P as next few ply it might be in hashtable, >> which trivially makes the time needed to search 1 ply deeper shorter. >> >> Storing closer to the root (position where we started searching) is of >> course more important than near the leafs of the search tree. >> >> When for example not storing in hashtable last 10 ply near the leafs in an >> overnight experiment the search depth dropped at 460 processors from 20 ply >> to 13 ply. >> >> Of course each processor of supercomputers is deadslow for game tree search >> (it's branchy 100% integer work completely knocking down the caches), so >> compared to pc's you already start at a disadvantage of a factor 16 or so >> very quickly, before you start searching (in case of TERAS i had to fight >> with outdated 500Mhz MIPS processors against opterons and high clocked quad >> Xeons), so upgrading my own networkcards is more clever. > >Interesting. Of course the Origin 3800 is quite dated, not that the >Itanium is an opteron killer, but it is much more competitive, and has >much larger caches. Itanium 1.3Ghz using 24 hours of PGO and after i figured out all kind of options in the compiler to not take shortcuts by default, is same speed like a 1.3Ghz opteron for DIEP. I understand why governments buy them. They are good on paper and have no real weak spots. Horror & co to program for those itaniums. L3 cache sizes for diep are not important. See extensive benchmarking at the different hardware sites of my program. For example by Johan de Gelas or : Aceshardware : http://www.aceshardware.com/read.jsp?id=60000259 Sudhian : http://www.sudhian.com/showdocs.cfm?aid=635&pid=2403 Soon also tested at www.anandtech.com ! >> Yet getting yourself a network even between a few nodes as quick as those >> supercomputers is not so easy... > >Quadrics and Pathscale's infinipath have networks available that are in the >same ballpark as the SGI origin. Even dolphin although I'm not very >familar with them. I am very impressed by the quadrics and dolphin cards. Probably by infinipath too when i check them out. Will do. I'm not so impressed yet by myrinet actually, but if cluster builders can earn a couple of hundreds of dollars more on each node i'm sure they'll do it. >> Additional your own beowulf network you can first decently test at before >> playing at a tournament, and without good testing at the machine you play >> at in tournaments you have a hard 0% chance that it plays well. >> >> The only thing in software that matters is testing. > >Indeed, good luck, thanks for the overview. I'm planning on a cluster >with a very fast (sub 2.5us network), but I won't have it for a few months. > >I had some infiniband hardware on loan, but I had to return it. > >-- >Bill Broadley >Computational Science and Engineering >UC Davis > > From diep at xs4all.nl Fri Feb 4 13:33:47 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050204223345.01009170@pop.xs4all.nl> Thanks for your deep inside, this is very helpful! Vincent www.diep3d.com At 12:57 4-2-2005 -0700, Josip Loncaric wrote: >Vincent Diepeveen wrote: >> At 00:29 4-2-2005 -0800, Bill Broadley wrote: >>> >>>Do you know that gigabit is too high latency? > >Gigabit Ethernet adapters often need tweaking to deliver reasonable >latency, bandwidth, and CPU utilization. > >For example, if your system uses the e1000 driver (Intel's gigabit >Ethernet), the default setting is "dynamic Interrupt Throttle Rate" -- >which means that the card will delay interrupting the CPU by up to about >130 microseconds after receiving a packet. Moreover, the "dynamic" part >causes the network chip microcode to vary this delay in multiples of >about 16 microseconds, so that different packets will generally >experience different receive delays. > >For the e1000 driver, >https://lists.dulug.duke.edu/pipermail/dulug/2004-August/015415.html >recommends using "options e1000 InterruptThrottleRate=80000" (add this >line to /etc/modules.conf). Users of this driver may also want to check >Intel's parameters for e1000 listed at >http://www.intel.com/support/network/sb/cs-009209.htm#parameters -- just >don't assume that the default values are appropriate for cluster use. > >Other gigabit Ethernet adapters have similar interrupt mitigation >strategies, all designed to gracefully cope with high packet rates at >high network speeds. For cluster use, adjustments are usually advisable. > >The basic Rx interrupt mitigation scheme is this: the receiver's CPU >won't be interrupted until at least N packets have arrived or M >microseconds have elapsed (whichever comes first). This clearly adds up >to M microseconds to network latency. BTW, one often sees N=6 >(otherwise NFS performance can seriously degrade) and M>=16. Other >variants of this basic scheme are possible; but they all mean increased >latencies. > >Finally, don't forget the Tx side interrupt mitigation, or else the >sending CPU might not be told promptly that it's OK to send more. The >default Tx settings are probably fine for full size packets, but if your >applications send lots of small packets, tweaking your network driver's >Tx settings may help. > >Sincerely, >Josip >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From rross at mcs.anl.gov Fri Feb 4 10:16:17 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] MPICH2: Handle Limit? In-Reply-To: References: Message-ID: Hi Ron, There should not be an 84 handle limit. Can you tell me what version of MPICH2 this is, and what architecture and OS you're running on? Do you have a simple test that exhibits the problem? Thanks, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Thu, 3 Feb 2005, R Hamann wrote: > I've been having some strange problems with a program using the MPICH2 > library. When I added some new datatypes for ghost cell exchange, the > program would hang. I figured out that any number of handles over 84 > would cause this. Fortunately, I could delete some handles that I no > longer needed, but it still seemed strange. Are my calculations > correct that for each process there is an 84 handle limit? or am I > seeing some other problem? > > Ron From rross at mcs.anl.gov Sat Feb 5 08:05:04 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] MPICH2: Handle Limit? In-Reply-To: References: Message-ID: Hi Ron, Well there *is* a limit, because the handles are represented by an integer, but from a practical perspective you should never have to worry about it. I have not ever encountered this before. I wrote most of that code, so I would very much like to figure out what is happening in your case. I tend to agree that it is probably some sort of buffer overrun. We test on IA32 with gcc as our primary environment. What exactly is happening when it "bombs"? Are you getting a segfault? Is this something where you could capture a core file and get a stack trace? Are there any errors reported? Will the problem manifest itself with a single-process run? If so, you could try valgrind. Actually, while we're discussing it, why do you need "lots" of datatypes to exchange ghost cells? There might be a way to simplify that too. Regards, Rob On Fri, 4 Feb 2005, R Hamann wrote: > I thought any limit would be wierd, let alone something like 84 (7 X > 12?) Anyway, I thought it was based on the number of MPI variables > declared (data_types, windows, requests) because every time I added > new declarations, it would hang on Fedora core 2, but run to > completion on Scyld (but with erroneous results). If I deleted unused > MPI declarations, it would start to work again. I counted all my > handles and came up with 84. > > However, after deleting two 26 element arrays of handles, I thought it > would work. When I added more handles, it bombed again. I started to > try other things. I added 4 junk ints. I didn't use the variables I > declared, but it still bombed. When I converted them to chars, it > started working again. Very strange. > > Have you ever encountered this before? I'm doing a 3d cellular > automata, so I need a lot of datatypes for exchange of ghost cells. > It's obviously some strange error I've made that's manifesting itself > in MPI instead of a runtime or sytax error. I'm gonna try looking for > any buffer overruns now, but other than that I'm stumped. > > GCC on Fedora Core 2 and on Scyld Beowulf > MPICH 2 1.0 > > Thanks, > > R From h.jasak at wikki.co.uk Fri Feb 4 11:13:04 2005 From: h.jasak at wikki.co.uk (Hrvoje Jasak) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] OpenFOAM Message-ID: <4203C940.1000402@wikki.co.uk> Hi Mike, I've just found your post on OpenFOAM. I am one of the (two) main authors/developers of FOAM and have been using it since 1993. Linux is these days the main and most important parallel platforms for FOAM and it is regularly used for large-scale simulations (especially LES). I am still developing the code and doing research/working with students etc. with it - if you've got any questions or would like to get involved in keeping FOAM alive, please feel free to contact me. Regards, Hrvoje Jasak -- Dr. Hrvoje Jasak Wikki Ltd. 10 Palmerston House, Tel: +44 (0)20 7221 9815 60 Kensington Place, E-mail: H.Jasak@wikki.co.uk London W8 7PU, United Kingdom From mprinkey at aeolusresearch.com Fri Feb 4 13:00:02 2005 From: mprinkey at aeolusresearch.com (Michael T. Prinkey) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <20050204205034.GA18717@greglaptop.internal.keyresearch.com> Message-ID: On Fri, 4 Feb 2005, Greg Lindahl wrote: > On Fri, Feb 04, 2005 at 03:20:23PM -0500, Andrew Piskorski wrote: > > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > > > > > Please note MPI is probably what i'll use, though i keep finding > > > online information about 'gamma'. Is that faster latency than MPI > > > implementations? > > > > http://www.disi.unige.it/project/gamma/ > > In addition to gamma, there's also MVAPICH from LBL, and at least two > commercial products, one from Scali, and one from the Cluster > Competence Center. > > -- greg Greg, I think you mean MVICH at LBL. It and MVIA are all but dead, AFAICT: http://old-www.nersc.gov/research/FTG/mvich/index.html Mike From fant at pobox.com Fri Feb 4 13:27:47 2005 From: fant at pobox.com (Andrew D. Fant) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] SGE web frontends In-Reply-To: <4203913A.1030202@cmrl.wustl.edu> References: <420376B0.7000107@scalableinformatics.com> <4203913A.1030202@cmrl.wustl.edu> Message-ID: <4203E8D3.9030509@pobox.com> Brian Henerey wrote: > > I don't mean to hijack this thread, but I'd also be interested to know > if there are any open source web frontends for launching jobs on > clusters. I've mostly written my own anyway, but if something's out > there I'd like to know. > > Thanks, > Brian Henerey Most of this is admittedly not open-source, but it is what I can think of off the top of my head for web/gui cluster front end tools. I think Platform explored a web front end for LSF after they killed off the xlsf tools. The tool I have seen lately that I would be more interested in seeing more of is Auger from the Jefferson Laboratory in Norfolk. Technically it's not a web front end, because it's a java front end tool, but it looks nice in any case. Most of the true web front ends for cluster jobs that I have seen are application specific portals. NCSA has some examples, and PNL has a nice distributed web front end for computation chemistry applications, as well. Andy From nix at petelancashire.com Sat Feb 5 10:07:56 2005 From: nix at petelancashire.com (Pete Lancashire) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] real hard drive failures In-Reply-To: References: Message-ID: <1107626876.3794.17.camel@l1.pdxeng.com> The nice thing and about the only nice thing about using a fan is in this case, the failure of a fan is not going to kill you. If your mother board has a 3-wire fan 'port' not used you can have it report failure. In the past I've built using a 8pin MicroChip a simple failure detector. I would think with some imagination you could take a 555 + transistor + pizo buzzer and create a simple alarm. Another thing to use but I've not seen as an individual item is a heat sink. The Sun SPUD brackets come with a plate that attaches to the bottom of the drive, the plate has been punched with hmmm .. louvers ?. -pete "ah the days of so many fans you could not hear yourself talk" On Tue, 2005-01-25 at 14:26, Mark Hahn wrote: > > > I'm only partially interested in the thread "Cooling vs HW replacement" but > > > the problem with drive failures is a real pain for me. So, I thought I'd > > > share some of my experience. > > > > i'd add 1 or 2 cooling fans per ide disk, esp if its 7200rpm or 10,000 rpm > > disks > > I'm pretty dubious of this: adding two 50Khour moving parts to > improve the airflow around a 1Mhour moving part which only dissipates > 10W in the first place? designing the chassis for proper airflow > with minimum fanage is obviously smarter and probably safer. > > > - if downtime is important, and should be avoidable, than raid > > is the worst thing, since it's 4x slower to bring back up than > > a single disk failure > > eh? you have a raid which is not operational while rebuilding? > > > - raid will NOT prevent your downtime, as that raid box > > will have to be shutdown sooner or later > > ( shutting down sooner ( asap ) prevents data loss ) > > huh? hotspares+hotplug=zero downtime. > > but yes, treating whole servers as your hotspare+hotplug element is > a nice optimization, since hotplug ethernet is pretty cheap vs > $50 hotplug caddies for each and every disk ;) > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From john.hearns at streamline-computing.com Sun Feb 6 00:55:21 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Newbie Question In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com> References: <5dc04bbf050204112319fe7fbf@mail.gmail.com> Message-ID: <1107680121.28574.5.camel@Vigor45> On Sat, 2005-02-05 at 02:23 +0700, Monang Setyawan wrote: > Hi. I'm a newbie in this parallel computing thing. > (sorry for my bad english, I'm Indonesian) > > My current project is a software that analyze DNA/Protein sequence > data that needs high performance aspect on it. I plan to deploy this > software on network of workstations (mm, may be just about 10 PCs on > the network). Am I in wrong place now? > You could start by looking at the BioBrew Linux distribution. It probably has a lot of the tools you want for this work. http://bioinformatics.org/biobrew From john.hearns at streamline-computing.com Sun Feb 6 01:07:05 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> Message-ID: <1107680825.28574.14.camel@Vigor45> On Fri, 2005-02-04 at 21:39 +0100, Vincent Diepeveen wrote: > > > >So that operation will cost around 80us with GigE, and 10-16us with IB > >or Myri. > > 80 us is what i read elsewhere too yes for GigE. > > Is it so hard to make a card with lower latency for a few dollar? > > I mean if i buy for 135 euro a cpu i can get myself an opteron 1.4Ghz or > something. If i buy for 1000 euro i get myself say a 2.4Ghz opteron. We supply turnkey clusters with the SCore environment, which gives excellent latency figures using standard gigabit ethernet NICs. If you are looking for different hardware, Google for 'TOE' - TCP Offload Engine. These are claimed to offer lower latency than onboard adapters. But caveats apply: I've no idea how these work with MPI type applications, as they're probably aimed at high bandwidth applications, and it is probably more cost effective to go Myrinet/Quadrics/IB Actually, it would be worth having the list's opinions on TOE adapters. My guess is that they really don't do much for the latency, but would be very good on webservers and databases servers. From rgb at phy.duke.edu Sun Feb 6 06:15:56 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Newbie Question In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com> References: <5dc04bbf050204112319fe7fbf@mail.gmail.com> Message-ID: On Sat, 5 Feb 2005, Monang Setyawan wrote: > Hi. I'm a newbie in this parallel computing thing. > (sorry for my bad english, I'm Indonesian) > > My current project is a software that analyze DNA/Protein sequence > data that needs high performance aspect on it. I plan to deploy this > software on network of workstations (mm, may be just about 10 PCs on > the network). Am I in wrong place now? > > I am going to use message passing paradigm (MPI) to write the > software. I've read that there are several choice of MPI > implementation. The problem is, I'm bad in both C or Fortran (I > usually use Java as my favorite language). Some source said that Java > (or it's MPI wrapper or pure MPI implementation) isn't good enough to > implement a parallel computing solution. Is that right? > > My third question is, is there any pdf/ps/one file version of > "Engineering a Beowulf-style Compute Cluster''? On my personal website, on brahma, both. Follow the links for beowulf and beowulf book on my personal page, or use google with "beowulf book pdf" to go right there. Also, there are images for both US letter and Euro A4 there, as you might have either kind of printer/paper. rgb > > Thanks in advance. > > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From patrick at myri.com Sat Feb 5 18:27:57 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> Message-ID: <420580AD.5050003@myri.com> Hi Vincent, Vincent Diepeveen wrote: >>>CPU's are 100% busy and after i know how many times a second the network >>>can handle in theory requests i will do more probes per second to the >>>hashtable. The more probes i can do the better for the game tree search. >> >>With a gigE network that sounds like 40us or so. With Myrinet or IB >>it's in the 4-6us range. If you bought dual opterons with the special > > > At the quadrics and dolphin homepage they both claim 12+ us for Myrinet. Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), that includes fibers and a switch in the middle: Length Latency(us) Bandwidth(MB/s) 0 2.684 0.000 1 2.874 0.336 2 2.898 0.690 4 2.978 1.343 8 2.965 2.699 16 2.993 5.347 32 3.409 9.388 64 3.563 17.960 128 3.977 32.185 256 5.699 44.916 Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I didn't hear about noticeable SCI clusters in a long time. > I am very impressed by the quadrics and dolphin cards. Probably by > infinipath too when i check them out. Will do. > > I'm not so impressed yet by myrinet actually, but if cluster builders can > earn a couple of hundreds of dollars more on each node i'm sure they'll do it. I don't think Myrinet would be the cheapest, I am sure you can get a better deal from desperate interconnect vendors. What does not impress you in Myrinet ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From diep at xs4all.nl Sat Feb 5 19:36:20 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> At 21:27 5-2-2005 -0500, Patrick Geoffray wrote: >Hi Vincent, > >Vincent Diepeveen wrote: >>>>CPU's are 100% busy and after i know how many times a second the network >>>>can handle in theory requests i will do more probes per second to the >>>>hashtable. The more probes i can do the better for the game tree search. >>> >>>With a gigE network that sounds like 40us or so. With Myrinet or IB >>>it's in the 4-6us range. If you bought dual opterons with the special >> >> >> At the quadrics and dolphin homepage they both claim 12+ us for Myrinet. > >Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), >that includes fibers and a switch in the middle: > > Length Latency(us) Bandwidth(MB/s) > 0 2.684 0.000 > 1 2.874 0.336 > 2 2.898 0.690 > 4 2.978 1.343 > 8 2.965 2.699 > 16 2.993 5.347 > 32 3.409 9.388 > 64 3.563 17.960 > 128 3.977 32.185 > 256 5.699 44.916 > >Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I >didn't hear about noticeable SCI clusters in a long time. > >> I am very impressed by the quadrics and dolphin cards. Probably by >> infinipath too when i check them out. Will do. >> >> I'm not so impressed yet by myrinet actually, but if cluster builders can >> earn a couple of hundreds of dollars more on each node i'm sure they'll do it. > >I don't think Myrinet would be the cheapest, I am sure you can get a >better deal from desperate interconnect vendors. > >What does not impress you in Myrinet ? Thanks for your kind answer Patrick, Obviously i mentionned that number because i read it elsewhere. Well a number of points bother my mind from which majority is true for others as well. But first let me note that i'm not against myrinet in general. I am just trying to solve a very specific case. For that specific case i'm not so impressed. Note that so far i didn't find any desperate vendor. For sure quadrics doesn't look desperate to me, they aren't even selling old cards anymore though they must have still thousands of them lying at home from returned upgraded networks. Finding second hand highend cards seems to be very seldom. First of all i'm interested in how quick i can get 4-64 bytes from remote memory. So not from some kind of network card cache, as myrinet doesn't have some megabytes on chip, but just a few tens of kilobytes. The memory has to come therefore from the remote nodes main memory, at a random adress in the main memory. No streaming at all happens. that 400 ns extra that the TLB gives is definitely not the problem i guess. The problem for me is to understand: "how do you get that memory at a cluster?" A latency on paper says of course nothing when you can't actually get it within that time. "Paper supports everything." Arturo Ochoa (Caracas, Venezuela) I hope everyone realizes that an important consequence from beowulf clusters is that you actually want to *use* all those cpu's you have to your avail. So every cpu has a program running that eats 100% system time. Because if it wouldn't use 100% system time, you wouldn't need a cluster! >From that 100% system time obviously you must be prepared to give away some to serve other nodes as quickly as possible doing a read. All latencies i see quoted at all hardware sites, it is very hard to figure for me out whether that's a latency that is supported by paper, or whether it's a practical latency i can take into account as a programmer with all software layers overhead when each cpu is 100% running a program. Secondly, but as i'm not a cluster expert i don't know how to avoid that, it's of course a big LOSS in sequential speed if my program each few instructions must check whether there is some MPI message to get handled. If i check a lot that will slow down my program 20 times. If i don't check a lot, other cpu's will have to wait longer and that defeats the purpose of a fast network card. Factor 20 is about the slowdown of the average 'old' supercomputer chessprograms which use MPI type solutions. Zugzwang (Paderborn-Siemens), P.Conners (Paderborn-Siemens), cilkchess (MIT). I've been playing with my own eyes against those programs in world champs and despite that it has happened that i played at the same hardware with a similar amount of cpu's and a program having factor 100 more chessknowledge (which slows down the program *considerable*), the actual speed at which the program searches nodes was up to factor 5-10 faster. Now a few years ago this was not a major problem because for example Cilkchess which obviously ran factor 20-40 times slower than it could, used 1800 processors for example in world champs 1995 (Hong kong) and 512 processors in world champs 1999 (Paderborn). Of course because 1 processor was real real fast compared to the speed of 1 pc processor in those days, they practical were searching a lot deeper than pc programs (and both played excellent for its days, especially Don Dailey needs to get a big compliment for that). However if i show up with 2 pc's and 2 network cards, then it sure matters when i lose a lot of speed. Obviously for embarassingly parallel software this is no issue, but usually for embarrassingly parallel software all you need is gigabit ethernet. There is so many MPI applications which are not exactly embarassingly parallel from which you see that a decent programmer single cpu would be doing that 20 times faster. Or to quote someone who has been doing such rewriting work for some physical applications that run here and there: "I didn't blink my eyes when i managed to speedup an application factor 1000". So it is very interesting for us all and me especially to understand how *fast* you can get that memory under full load of all the logical cpu's. Third each pc has 2 cheapo k7 processors which are a lot slower than opterons. Second problem i have is that i can get easily dual k7 pc's from chessplayers and they can get bought cheap still. Dual k7 is practical same speed like a dual xeon 3.06Ghz Northwood with all memory slots filled with 2-2-2 DIMMS for DIEP. So just compare the price of such a system with a cheapo dual k7 with registered cas3 RAM. Those dual k7's have 64 bits 66Mhz slots, not pci-x as far as i know and also those who do have A64's or P4's usually don't have pci-x onboard either. Sure there is boards that have them and i'm sure that if you make a network Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX mainboards and claim somewhere a paper latency of 1.x us. What is the achieved read speed to remote memory myrinet gets at 64 bits / 66Mhz in software, so ready to use 4-64 bytes for applications? I'm not asking it to be accurate within 400ns, as that's the delay you'll have from TLB trashing the remote node. But accuracy within 1.5 us would be quite nice. First of all for integer intensive applications i'm doing fastest processor is opteron, k7 comes second and P4 comes third. Exception is a P4 machine equipped with the most expensive stuff (2-2-2 ram and all banks filled) good mainboard and northwoods and overclocked at the mainboard. However for that price a dual opteron can get bought and it just blows away that P4 bigtime. Every year that new software gets released of course that P4 gets slower, because newer software only gets more and more complex with more options and will fit less perfectly in P4's small tiny caches, let alone when we get a lot of 64 bits programs. They won't fit at all in those tiny slow caches. So until the dual core opterons arrive at low cost, obviously you can make dual k7 nodes for just a few hundreds of dollar a node. When adding new nodes which in the future no doubt are dual opteron, you still run further with those dual k7 nodes and want to mix them obviously with dual opterons. Is that possible? >Patrick >-- > >Patrick Geoffray >Myricom, Inc. >http://www.myri.com > > From diep at xs4all.nl Sun Feb 6 07:10:39 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Newbie Question Message-ID: <3.0.32.20050206161035.01013bd0@pop.xs4all.nl> At 02:23 5-2-2005 +0700, Monang Setyawan wrote: >Hi. I'm a newbie in this parallel computing thing. >(sorry for my bad english, I'm Indonesian) > >My current project is a software that analyze DNA/Protein sequence >data that needs high performance aspect on it. I plan to deploy this >software on network of workstations (mm, may be just about 10 PCs on >the network). Am I in wrong place now? >I am going to use message passing paradigm (MPI) to write the >software. I've read that there are several choice of MPI >implementation. The problem is, I'm bad in both C or Fortran (I >usually use Java as my favorite language). Some source said that Java >(or it's MPI wrapper or pure MPI implementation) isn't good enough to >implement a parallel computing solution. Is that right? You definitely want to write it in C. Basically protein research, which might touch a field which is forbidden to research in EU countries, but not forbidden to research in USA, Israel and i must admit i'm amazed that's legal in Indonesia usually is heavily floating point oriented. Just calculating what i would classify as matrix invariants to determine origins and consequences of modifications. In C there is superb libraries you want to consider. Certain calculations can get speeded up bigtime by FFT, but not always, as sometimes you just want accurate results and not approximations. C is ideal because it's easier to use SSE2 for it which is what you need of course. Please note both P4 and A64/Opteron have that functionality and Opteron is 2 times faster than P4 there, but perhaps you can get the P4 hardware factor 2 cheaper, which would make it very attractive for such a cluster. In all cases such software is embarassingly parallel. gigabit ethernet is more than sufficient. Yet taking care the pc's have relative fast floating point possibilities is very relevant. Cheapest gflop per dollar might be probably surprising hardware. A beowulf definitely is ideal for this type of software. >My third question is, is there any pdf/ps/one file version of >"Engineering a Beowulf-style Compute Cluster''? > >Thanks in advance. > >-- >For the sake of time.. >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From wytsang at clustertech.com Sun Feb 6 19:24:43 2005 From: wytsang at clustertech.com (Clotho) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] ifort MPI_FILE_OPEN err with romio testsuite Message-ID: <4206DF7B.3040705@clustertech.com> In MPICH-1.2.6, romio directory, there is a test program called "fcoll_test.f". The test program run successfully with gcc compiler. However, with ifort (8.0/8.1) compiler, the program fails. After debugging, I find that the function MPI_FILE_OPEN fails (ierr is non-zero). But change the size of character array from 1024 to 200 can solve the problem. I have found another people with similar experience as me: (in Chinese) http://www.lasg.ac.cn/cgi-bin/forum/view.cgi?forum=4&topic=2519 Here is the full program : http://clustertech.com/~wytsang/fcoll_test.f Here is the simplier version of the program. program main implicit none include 'mpif.h' integer nprocs integer mynod integer fh, ierr character*1024 str ! used to store the filename c character*200 str ! this will work integer writebuf(1) call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, nprocs, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, mynod, ierr) str = 'test' writebuf(0) = 0 call MPI_FILE_OPEN(MPI_COMM_WORLD, str, & & MPI_MODE_CREATE+MPI_MODE_RDWR, MPI_INFO_NULL, fh, ierr) print *,ierr call MPI_FINALIZE(ierr) stop end From patrick at myri.com Mon Feb 7 00:11:58 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> Message-ID: <420722CE.5010408@myri.com> Vincent, Vincent Diepeveen wrote: > Thanks for your kind answer Patrick, > > Obviously i mentionned that number because i read it elsewhere. I know, I have seen worse. > Note that so far i didn't find any desperate vendor. For sure quadrics > doesn't look desperate to me, they aren't even selling old cards anymore > though they must have still thousands of them lying at home from returned > upgraded networks. Finding second hand highend cards seems to be very seldom. Tip: desperate companies are usually young and spend a lot of VC money on marketing. Quadrics does not fit, I am afraid, they have been around too long :-) Furthermore, selling old hardware is not very cost effective for a vendor: compatibility troubles with newer machines, require to support old hardware in new drivers and new middlewares, tap in inventory reserved for replacement parts, etc. > First of all i'm interested in how quick i can get 4-64 bytes from remote > memory. So not from some kind of network card cache, as myrinet doesn't > have some megabytes on chip, but just a few tens of kilobytes. The memory > has to come therefore from the remote nodes main memory, at a random adress > in the main memory. No streaming at all happens. that 400 ns extra that the > TLB gives is definitely not the problem i guess. Myrinet has 2 MB of SRAM in standard, used by firmware code, data and buffers. What you want to do basically is a Get. In practice, the origin of the Get will send a small packet with a virtual address or a RDMA handle and an offset, the NIC on the target side converts it in a physical address, fetches the data by DMA and sends it back to the origin side. > All latencies i see quoted at all hardware sites, it is very hard to figure > for me out whether that's a latency that is supported by paper, or whether > it's a practical latency i can take into account as a programmer with all > software layers overhead when each cpu is 100% running a program. No, it's not likely to fit your usage. Vendors quote MPI latency on pingpong. That's pretty much the cost of sending/receiving an MPI message from user space to user space. Often, this is also with only 2 nodes, optimal conditions and everybody holding their breath. You want RMA Get. The latency for a Get is larger than for a MPI send. For 64 bytes, it is basically the MPI latency for 0 bytes (for the Get request) + the latency for 64 bytes (for the reply). Assuming that you don't Get all over the host memory, the virtual/physical translation will be hot in the target NIC so the translation cost will be very small. You want less than 3us per Get of 64 Bytes ? I don't know if even Quadrics can do it. The good news is that you can pipeline it very well. So it may cost more than 3 us for one Get, but you may complete a Get every 0.5 us if you post a bunch of them. > Secondly, but as i'm not a cluster expert i don't know how to avoid that, > it's of course a big LOSS in sequential speed if my program each few > instructions must check whether there is some MPI message to get handled. If you want perfect overlap and if you are ready to go as low level as possible, one-sided communication are for you (no host CPU involved on the target side). All low level communication interfaces support one-sided communications (not yet released for MX on Myrinet, but GM has it). > However if i show up with 2 pc's and 2 network cards, then it sure matters > when i lose a lot of speed. > > Obviously for embarassingly parallel software this is no issue, but usually > for embarrassingly parallel software all you need is gigabit ethernet. If you can and know how to overlap, latency is irrelevant. It's hard to do on complex irregular codes, but you can usually do it if you can use one-sided communications. Don't put your communications in the critical path. Post them early and post many of them concurrently, pipelining will hide the latency of the critical path. That's why desperate vendors use pipelined pingpong to get better curves. > There is so many MPI applications which are not exactly embarassingly > parallel from which you see that a decent programmer single cpu would be > doing that 20 times faster. Or to quote someone who has been doing such Most of the times, you go parallel to go bigger, not faster. If the problem size fits in one node, don't use a cluster, use a multi-processor nodes. You will have more bangs for your bucks. > So it is very interesting for us all and me especially to understand how > *fast* you can get that memory under full load of all the logical cpu's. Using one-sided communications, there is little difference if the CPUs are loaded or not on the target side. > Third each pc has 2 cheapo k7 processors which are a lot slower than opterons. IO bus is more important for the communications part. I don't know of cheapo k7 machines with a decent PCI bus. However, for 64 bytes, even a cheesy PCI will not slow things down that much. > Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX > mainboards and claim somewhere a paper latency of 1.x us. How long can you hold your breath ? > What is the achieved read speed to remote memory myrinet gets at 64 bits / > 66Mhz in software, so ready to use 4-64 bytes for applications? I have no idea, I am not even sure that I have a 64 bits/66 Mhz machine around to measure it. With GM, I would say at least 10 us. Certainely more. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Mon Feb 7 01:48:22 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com> References: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com> Message-ID: <42073966.7090009@myri.com> Hi Duncan, duncan.roweth@quadrics.com wrote: > This example reports the average time for 1000 > blocking get calls. Patrick's description of the > mechanism is essentially correct, apart from the > detail that we have a fast path for short operations > that avoids the need to set up a DMA. How can you do one-sided operations without a DMA on the target side ?!? The only way that I can think of is to map the host virtual memory into the NIC memory space and let all memory writes generates PIO writes to actually modify the NIC memory. Surely, you must be talking about another DMA. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From john.hearns at streamline-computing.com Mon Feb 7 01:50:58 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <420722CE.5010408@myri.com> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> <420722CE.5010408@myri.com> Message-ID: <1107769858.12606.57.camel@Vigor45> On Mon, 2005-02-07 at 03:11 -0500, Patrick Geoffray wrote: > Vincent, > > Tip: desperate companies are usually young and spend a lot of VC money > on marketing. Quadrics does not fit, I am afraid, they have been around > too long :-) Furthermore, selling old hardware is not very cost > effective for a vendor: compatibility troubles with newer machines, > require to support old hardware in new drivers and new middlewares, tap > in inventory reserved for replacement parts, etc. Why would Quadrics have old/second hand hardware to sell anyway? If they have older model cards unsold they would be holding them as spares for customers who are still running those models, as Patrick says. Clusters which have been upgraded or scrapped are unlikely to be returned to Quadrics/Myricom. Clusters are usually bought as completely integrated systems, from companies such as ourselves. We install and configure the Myrinet networking for customers - they don't buy direct from Myricom. And, like many companies on this list, we provide continuing support and advice. So I'd say there is no conspiracy against you - if you are seeking second hand high performance networking gear, look on eBay or ask nicely on this list. I was surprised recently to see small fibre channel switches go very cheaply on eBay - not so long ago you would pay $$$$ for them. From jcownie at etnus.com Mon Feb 7 08:26:29 2005 From: jcownie at etnus.com (James Cownie) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: Message from Patrick Geoffray of "Mon, 07 Feb 2005 04:48:22 EST." <42073966.7090009@myri.com> References: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com> <42073966.7090009@myri.com> Message-ID: <20050207162629.C478D1C826@amd64.cownie.net> > Patrick Geoffray wrote: > duncan.roweth@quadrics.com wrote: > > This example reports the average time for 1000 > > blocking get calls. Patrick's description of the > > mechanism is essentially correct, apart from the > > detail that we have a fast path for short operations > > that avoids the need to set up a DMA. > > How can you do one-sided operations without a DMA on the target side ?!? > > The only way that I can think of is to map the host virtual memory > into the NIC memory space and let all memory writes generates PIO > writes to actually modify the NIC memory. Surely, you must be talking > about another DMA. I think you're talking at cross-purposes. Patrick is right that in the target machine there is a DMA operation initiated by the NIC. However Duncan is saying that Quadrics don't send a DMA request packet over their network, but have a more optimised less general request that they can issue without having to build a full DMA descriptor in the host machine and transfer it to the target. Therefore in Quadrics' terms no DMA operation is sent over the net, whereas from Patricks' viewpoint a DMA operation _does_ occur. -- -- Jim -- James Cownie Etnus, LLC. +44 117 9071438 http://www.etnus.com From duncan.roweth at quadrics.com Mon Feb 7 01:29:15 2005 From: duncan.roweth at quadrics.com (duncan.roweth@quadrics.com) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <30062B7EA51A9045B9F605FAAC1B4F627D505C@exch01.quadrics.com> Patrick, Vincent Some input into your discussion. Here is the data on get latency for Elan4 in an Opteron cluster quorumi: prun -N2 pgping -f get 0 64 1: 4 bytes 2.36 uSec 1.69 MB/s 1: 8 bytes 2.35 uSec 3.40 MB/s 1: 16 bytes 2.38 uSec 6.73 MB/s 1: 32 bytes 2.37 uSec 13.50 MB/s 1: 64 bytes 2.43 uSec 26.30 MB/s This example reports the average time for 1000 blocking get calls. Patrick's description of the mechanism is essentially correct, apart from the detail that we have a fast path for short operations that avoids the need to set up a DMA. You can probably do a bit better on the very fastest nodes, but this is what I see on the system we have in the office. > You want less than 3us per Get of 64 Bytes ? I don't > know if even Quadrics can do it. Yes we can! > The good news is that you can pipeline it very well. Indeed. There is lots of parallelism in the hardware so you can me processing multiple requests at the same time. In this sequence of short jobs I measure the average time for 8 byte gets 2 at a time, 4 at a time etc. quorumi: prun -N2 pgping -f get -b2 8 1: 8 bytes 1.32 uSec 6.07 MB/s quorumi: prun -N2 pgping -f get -b4 8 1: 8 bytes 1.04 uSec 7.66 MB/s quorumi: prun -N2 pgping -f get -b8 8 1: 8 bytes 0.84 uSec 9.47 MB/s quorumi: prun -N2 pgping -f get -b16 8 1: 8 bytes 0.82 uSec 9.79 MB/s quorumi: prun -N2 pgping -f get -b32 8 1: 8 bytes 0.79 uSec 10.18 MB/s The limiting factor is the rate at which the remote NIC can read data over the PCI bus. Best Wishes Duncan Roweth Quadrics Limited P.S. Clearly our sales people focus on the current product (Elan4 NICs) but we will be supporting the installed base of Elan3 systems for some years yet. Most of the big systems have extended warranties, so we keep a stock of spares, but there are a few hundred adapters and associated switches. Drop us some mail if you are interested. From duncan.roweth at quadrics.com Mon Feb 7 02:01:26 2005 From: duncan.roweth at quadrics.com (duncan.roweth@quadrics.com) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <30062B7EA51A9045B9F605FAAC1B4F627D505E@exch01.quadrics.com> Patrick Thanks for your mail. > How can you do one-sided operations without a DMA on the > target side ?!? Gets are done by telling the remote adapter to perform a put back to the source. This can be a request to start a DMA (for large transfers) or it can be a request to the the Short Transaction ENgine (STEN). The STEN is a fast path for short puts that can be used from either the main CPU or from the adapter. It can generate network packets from a stream of commands and data written either by the main CPU (as PIO writes) or directly by the adapter. There are more details are in the "Hot Chips" paper that we wrote with Fabrizio Petrini of Los Alamos. http://www.c3.lanl.gov/~fabrizio/papers/hot03.pdf Best Wishes Duncan Roweth Quadrics Limited From rcmanglekar at rediffmail.com Mon Feb 7 06:17:05 2005 From: rcmanglekar at rediffmail.com (Rahul Manglekar) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] How-TO Mysql on Lam-cluster? Message-ID: <20050207141705.26471.qmail@webmail29.rediffmail.com> hi all.., i have setup up LAM-MPI cluster on 3 machine for testing. i want do put mysql on cluster..,, such that if mysql need more processor power , it can use processor power of all nodes that are present in cluster. i am using MySQL-4.0. can u guide me please.. thank you in advance.. -- Rahul.. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050207/7f5ae763/attachment.html From mark.westwood at ohmsurveys.com Mon Feb 7 06:39:42 2005 From: mark.westwood at ohmsurveys.com (Mark Westwood) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Newbie Question In-Reply-To: <5dc04bbf050204112319fe7fbf@mail.gmail.com> References: <5dc04bbf050204112319fe7fbf@mail.gmail.com> Message-ID: <42077DAE.8020806@ohmsurveys.com> Hi Monang Here's my contribution to your decision about which language you program in for your cluster: Suppose that you know Java well, but not C. Suppose that it will take you 6 months to learn C well enough to be able to write your programs in it. In those 6 months you can do an awful lot of computing in Java. If your project is intended to last, say, 9 months, then you might decide that you will program in Java because you will get more computing done that way than by learning a new language. If your project will last much longer then you might decide that learning C will be of benefit, because each program will be faster in C than in Java. If you're doing some calculations then I'd suggest that you allow C to be 5 times faster than Java on average for cluster-type computing. Some will tell you that it is more than 10 times as fast (and it is for some types of computation), others that it is no faster (which is true for some types of computation). Another issue (or problem if you look at things that way ) with Java is that the implementations of MPI for Java are non-standard and not as widely used as the implementations for C. You might find it difficult, therefore, to get good support from groups such as this one, for a Java / MPI program. To sum up: If you can write good Java programs to solve your problems on your cluster then you should prefer that to writing bad C (or Fortran) programs. If you find that your Java program is not fast enough then you might think about rewriting parts of it in C (or another compiled language) to achieve specific performance improvements. Hope this helps Mark Monang Setyawan wrote: > Hi. I'm a newbie in this parallel computing thing. > (sorry for my bad english, I'm Indonesian) > > My current project is a software that analyze DNA/Protein sequence > data that needs high performance aspect on it. I plan to deploy this > software on network of workstations (mm, may be just about 10 PCs on > the network). Am I in wrong place now? > > I am going to use message passing paradigm (MPI) to write the > software. I've read that there are several choice of MPI > implementation. The problem is, I'm bad in both C or Fortran (I > usually use Java as my favorite language). Some source said that Java > (or it's MPI wrapper or pure MPI implementation) isn't good enough to > implement a parallel computing solution. Is that right? > > My third question is, is there any pdf/ps/one file version of > "Engineering a Beowulf-style Compute Cluster''? > > Thanks in advance. > -- Mark Westwood Parallel Programmer OHM Ltd The Technology Centre Offshore Technology Park Claymore Drive Aberdeen AB23 8GD United Kingdom +44 (0)870 429 6586 www.ohmsurveys.com From deadline at clusterworld.com Mon Feb 7 07:34:44 2005 From: deadline at clusterworld.com (Douglas Eadline, Cluster World Magazine) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <20050204202023.GA32459@piskorski.com> Message-ID: On Fri, 4 Feb 2005, Andrew Piskorski wrote: > On Thu, Feb 03, 2005 at 04:53:27AM +0100, Vincent Diepeveen wrote: > > > Please note MPI is probably what i'll use, though i keep finding > > online information about 'gamma'. Is that faster latency than MPI > > implementations? > > http://www.disi.unige.it/project/gamma/ > > Gamma is a non-TCP/IP Linux 2.6.x network driver for Intel Pro/1000 > gigabit ethernet cards, for use with MPI. It offers much better > latency (11 us or so) than TCP/IP over ethernet (maybe 60 or 100 us), > but worse than the specialized HPC interconnects (maybe 3 us). The "60-100 us" is incorrect. With proper tuning an e1000 can get 25us latency (using netpipe). (see Jossip's post about tunning parameters) Oh, and by the way this was using a 32 PCI desk top card. A low latency number is not the whole story however, processor load is another issue. The point is that tuning can make a difference. Default values are usually set for maximum throughput and low CPU overhead. It all depends on what your application needs. If you need GAMMA, then that is a good choice, but many applications may work well with proper tuning of NIC parameters. As an aside, Netgear used to sell a low cost desktop NIC (GA302T-tigon3/Broadcom) which had very good numbers as well. I profiled this NIC in the first issue of ClusterWorld. Doug > > The attraction of GAMMA, is that Intel Pro/1000 cards can be had for > $11 to $60 or so each (depending on exact model, etc.), and gigabit > switches are also pretty cheap, while SCI or Myrinet is somewhere in > the $500 to $1500 per node range (I don't keep track). > > So if your application can benefit from lower latency, but you want > something really cheap, GAMMA should be well worth trying. > > -- ---------------------------------------------------------------- Editor-in-chief ClusterWorld Magazine Desk: 610.865.6061 Fax: 610.865.6618 www.clusterworld.com From rokrau at yahoo.com Mon Feb 7 09:01:38 2005 From: rokrau at yahoo.com (Roland Krause) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] memory allocation on x86_64 returning huge addresses Message-ID: <20050207170138.72473.qmail@web52907.mail.yahoo.com> I am trying to dynamically allocate memory for a Fortran-77 code that is supposed to run in I4 R4 mode on an x86_64 running SuSE-9.2 with a kernel.org 2.6.9 kernel. The machine has 8GB memory and memory has to be allocated in one large chunk. The problem is that malloc returns an address that is way beyond 8billion which is not what I had expected. Does anybody why Linux gives me an address that is outside the physical memory range? Does anybody whether there are any kernel parameter that affect this behavior? Any pointers to some good reading about the Linux VM would also be appreciated. Regards Roland __________________________________ Do you Yahoo!? Yahoo! Mail - now with 250MB free storage. Learn more. http://info.mail.yahoo.com/mail_250 From kus at free.net Mon Feb 7 09:37:17 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] How-TO Mysql on Lam-cluster? In-Reply-To: <20050207141705.26471.qmail@webmail29.rediffmail.com> Message-ID: In message from "Rahul Manglekar" (7 Feb 2005 14:17:05 -0000): > >hi all.., > >i have setup up LAM-MPI cluster on 3 machine for testing. > >i want do put mysql on cluster..,, >such that if mysql need more processor power , >it can use processor power of all nodes that are present in cluster. No. Usual MySQL isn't capable to use cluster nodes in parallel. But you may work w/special software which allows to split your database between cluster nodes. You may find the corresponding information at mysql site or also search Beowulf maillist archive (if I remember right, it was some discussion of databases in cluster here). Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > >i am using MySQL-4.0. > >can u guide me please.. > >thank you in advance.. > > >-- Rahul.. From James.P.Lux at jpl.nasa.gov Mon Feb 7 09:55:34 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Newbie Question In-Reply-To: <42077DAE.8020806@ohmsurveys.com> References: <5dc04bbf050204112319fe7fbf@mail.gmail.com> <42077DAE.8020806@ohmsurveys.com> Message-ID: <6.1.1.1.2.20050207093914.027efd38@mail.jpl.nasa.gov> At 06:39 AM 2/7/2005, Mark Westwood wrote: >Hi Monang > >Here's my contribution to your decision about which language you program >in for your cluster: > >Suppose that you know Java well, but not C. Suppose that it will take you >6 months to learn C well enough to be able to write your programs in >it. In those 6 months you can do an awful lot of computing in Java. If >your project is intended to last, say, 9 months, then you might decide >that you will program in Java because you will get more computing done >that way than by learning a new language. > >If your project will last much longer then you might decide that learning >C will be of benefit, because each program will be faster in C than in >Java. If you're doing some calculations then I'd suggest that you allow C >to be 5 times faster than Java on average for cluster-type >computing. Some will tell you that it is more than 10 times as fast (and >it is for some types of computation), others that it is no faster (which >is true for some types of computation). I would agree with Mark. I've been faced by a similar decision.. do we do the calculations in Excel using Visual Basic for Applications (VBA), Visual Basic, or C++, or Matlab, or something else. Various pieces of the puzzle exist in all of these, so the problem is do we translate (for example) the VB into C, or, glue it all together with scripts, or rewrite from scratch. Complicating this is that the people available to work on it have various skill sets which don't map well to any of the approaches (how many people do YOU know who are equally facile in VBA, C++, and Matlab??). In our case, the goal was to demonstrate that a particular capability can exist at all, versus making it really fly, so we went with the cobbled together scripts. It might turn out, after all, that the speed of the software isn't the "rate determining" factor, but that availability of staff is. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From rene at renestorm.de Mon Feb 7 09:53:02 2005 From: rene at renestorm.de (rene) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] mpich future Message-ID: <200502071853.02126.rene@renestorm.de> Hi folks, there are many mpi implementations out there, but which one ist "the best"? As far as I know, there are commercial prodcuts which support different hardware in one library (eg myrinet + ethernet). Which is a nice feature. Is there a working mpich which unites the common channels? Score did that once, but it's a year ago, since I've worked with it. In addition to that I've ran into trouble with the different standarts (1.2, 2.0). It seems to me that Openmpi gets more influence. Is that right? I dont feel like put 20 different preprocessor variables on my applications, like #if MPI_VERSION > 1 for each of that implementation. So my question is: In which direction goes mpi tomorrow? Cu -- Rene Storm @Cluster From daniel.kidger at quadrics.com Mon Feb 7 10:18:33 2005 From: daniel.kidger at quadrics.com (daniel.kidger@quadrics.com) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] memory allocation on x86_64 returning huge addresses Message-ID: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com> Roland, Sigh! :-) malloc can return any address it so wishes. Don't forget that this is a *virtual* address and so is not bounded by physical memory. A 64-bit O/s with say 8GB RAM can easily have stack addresses in the window 1TB - 2TB, and heap address even higher (!) I guess your real problem is that you are porting a (Fortran) program whose authors did not understand that it might ever run on a 64-bit machine. Your code does a malloc and then tries to store this in an I4 Fortran Integer. This would only be gaurunteed to work on a 32-bit architecture like say a Pentium. So Solutions? 1. since this is x86_64 simply run your compile your program with a 32-bit compiler You can still run under under the 64-bit O/S 2. Mend your application to store addresses in I8 variables, but keep I4 for other stuff if you wish. 3. (dubious) only save the lower 32-bits of the addresses in your I4 variables and then when being used add the known offset to yield the original 64-bit address. The offset is likely to be constant for all variables in your programs but ymmv. 4. Port your code away from using malloc() altogether. Recently (well make that 15 years), Fortran has had its own dynamic memory allocation- the allocate() function. Hope this helps, Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com -------------------- > -----Original Message----- > From: Roland Krause [mailto:rokrau@yahoo.com] > Sent: 07 February 2005 17:02 > To: beowulf@beowulf.org > Subject: [Beowulf] memory allocation on x86_64 returning huge > addresses > > > I am trying to dynamically allocate memory for a Fortran-77 code that > is supposed to run in I4 R4 mode on an x86_64 running SuSE-9.2 with a > kernel.org 2.6.9 kernel. The machine has 8GB memory and memory has to > be allocated in one large chunk. > > The problem is that malloc returns an address that is way beyond > 8billion which is not what I had expected. > > Does anybody why Linux gives me an address that is outside > the physical > memory range? > > Does anybody whether there are any kernel parameter that affect this > behavior? > > Any pointers to some good reading about the Linux VM would also be > appreciated. > > > Regards > Roland > > > > > __________________________________ > Do you Yahoo!? > Yahoo! Mail - now with 250MB free storage. Learn more. > http://info.mail.yahoo.com/mail_250 > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) > visit http://www.beowulf.org/mailman/listinfo/beowulf > From daniel.kidger at quadrics.com Mon Feb 7 10:34:28 2005 From: daniel.kidger at quadrics.com (daniel.kidger@quadrics.com) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies Message-ID: <30062B7EA51A9045B9F605FAAC1B4F62812105@exch01.quadrics.com> Duncan wrote (in reply to Patrick) > > The good news is that you can pipeline it very well. > > Indeed. There is lots of parallelism in the hardware > so you can me processing multiple requests at the same > time. In this sequence of short jobs I measure the > average time for 8 byte gets 2 at a time, 4 at a time > etc. > > quorumi: prun -N2 pgping -f get -b2 8 > 1: 8 bytes 1.32 uSec 6.07 MB/s > quorumi: prun -N2 pgping -f get -b4 8 > 1: 8 bytes 1.04 uSec 7.66 MB/s > quorumi: prun -N2 pgping -f get -b8 8 > 1: 8 bytes 0.84 uSec 9.47 MB/s > quorumi: prun -N2 pgping -f get -b16 8 > 1: 8 bytes 0.82 uSec 9.79 MB/s > quorumi: prun -N2 pgping -f get -b32 8 > 1: 8 bytes 0.79 uSec 10.18 MB/s Or for those that distrust quoting pure powers of two in benchmarks and/or know too much bash: [dan@quorumi]$ for ((i=1,j=1;$i<999;i=$i+$j,j=$i)) ;do echo -ne "pipelining \t$i:\t"; prun -N2 pgping -f get -b$i 64|cut -c20-35; done pipelining 1: 2.39 uSec pipelining 2: 1.38 uSec pipelining 3: 1.24 uSec pipelining 5: 1.05 uSec pipelining 8: 0.92 uSec pipelining 13: 0.91 uSec pipelining 21: 0.86 uSec pipelining 34: 0.80 uSec pipelining 55: 0.78 uSec pipelining 89: 0.79 uSec pipelining 144: 0.77 uSec pipelining 233: 0.73 uSec pipelining 377: 0.77 uSec pipelining 610: 0.78 uSec pipelining 987: 0.77 uSec Note that the above is for 64 *byte* reads which iirc is what Vincent was targetting. Daniel. -------------------------------------------------------------- Dr. Dan Kidger, Quadrics Ltd. daniel.kidger@quadrics.com One Bridewell St., Bristol, BS1 2AA, UK 0117 915 5505 ----------------------- www.quadrics.com -------------------- From lindahl at pathscale.com Mon Feb 7 10:41:08 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] memory allocation on x86_64 returning huge addresses In-Reply-To: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com> References: <30062B7EA51A9045B9F605FAAC1B4F62812104@exch01.quadrics.com> Message-ID: <20050207184108.GA1364@greglaptop.internal.keyresearch.com> > The problem is that malloc returns an address that is way beyond > 8billion which is not what I had expected. This e-vile hack makes it produce something lower in memory. What it does is turns off glibc's malloc algorithm's feature that has it mmap() large malloc()s. Stuff into a .c, link the .o into your application. -- greg #include #include static void mem_init_hook(void); static void *mem_malloc_hook(size_t, const void *); static void *(*glibc_malloc)(size_t, const void *); void (*__malloc_initialize_hook)(void) = mem_init_hook; static void mem_init_hook(void) { mallopt (M_MMAP_MAX, 0); } From rross at mcs.anl.gov Mon Feb 7 11:07:43 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] mpich future In-Reply-To: <200502071853.02126.rene@renestorm.de> References: <200502071853.02126.rene@renestorm.de> Message-ID: Hi Rene, You are right that there are a decent number of MPI implementations out there, all with their pros and cons. There is no "best" implementation, and in fact I would say that the existence of multiple implementations is helpful to the community by providing (a) multiple takes on how to build these libraries, and (b) competition between the implementations to be the "best" at what they think is most important. I'm not sure what you mean by "trouble with the different standards"? All implementations should at this point be striving for complete 2.0 compliance, and there are very few things from 1.x that won't work in a 2.0 compliant system (the group defining the standard went to great pains, as do the developers, to maintain this compatibility). So you shouldn't need those preprocessor variables. What functionality are you finding that you need to test for? I would say that at this time MPICH2 has as much influence as any implementation, because it is being used as the basis for multiple Cray platform implementations, the IBM BG/L implementation, the OSU IB implementation, and of course as-is on Windows, OS X, and Linux clusters. Of course I am part of the MPICH2 team, so I am biased :). OpenMPI will undoubtedly be an influential member of the MPI community once the software is made widely available. That group also has a collection of developers with very good track records in this area, and I look forward to being able to compare and contrast the designs and resulting performance. The big buzz in the MPI world right now is fault tolerance. I think this topic is going to be a hot one for some time, and there are definitely differences of opinion on how the MPI implementation should deal with faults and to what degree and how users should be made aware of failures, both transient and catastrophic. Less visible, but at least as important, is figuring out how best to implement the one-sided (RMA) operations that are part of MPI 2.0. My colleague Rajeev Thakur has (in my opinion) done an excellent job of these, building in part on concepts from the BSP system of old. Figuring out how to make collectives as efficient as possible on new, very large machines is also extremely important for those that have access to these new machines. Gheorghe Almasi from IBM had an excellent paper discussing collectives on the BG/L machine in last year's EuroPVM/MPI conference. Rolf Rabenseifner and Jesper Traff both presented improvements to collective algorithms as well. These two were iterative improvements I'd say, so less exciting in some sense, but it is critical that we make these algorithms as efficient as possible, given the scale of upcoming systems. If you are really interested in what is happening in MPI, the best place by far to look is the EuroPVM/MPI series of conferences and their proceedings. This is where everyone that is serious about MPI implementations is publishing and going to talk with colleagues, and every year the conference attendee list is literally a list of the most knowledgable MPI developers in the world (and hangers-on such as myself). Regards, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Mon, 7 Feb 2005, rene wrote: > there are many mpi implementations out there, but which one ist "the best"? > As far as I know, there are commercial prodcuts which support different > hardware in one library (eg myrinet + ethernet). Which is a nice feature. > > Is there a working mpich which unites the common channels? > Score did that once, but it's a year ago, since I've worked with it. > > In addition to that I've ran into trouble with the different standarts (1.2, > 2.0). > It seems to me that Openmpi gets more influence. Is that right? > > I dont feel like put 20 different preprocessor variables on my applications, > like > #if MPI_VERSION > 1 > for each of that implementation. > > So my question is: > In which direction goes mpi tomorrow? > > Cu > > -- > Rene Storm > @Cluster From mwill at penguincomputing.com Mon Feb 7 11:11:46 2005 From: mwill at penguincomputing.com (Michael Will) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] How-TO Mysql on Lam-cluster? In-Reply-To: References: Message-ID: <4207BD72.1070505@penguincomputing.com> MySQL-4.1 has cluster support according to http://dev.mysql.com/downloads/cluster/ but I have not checked out how and what. In any case I would expect it to NOT use MPI for anything. Michael Mikhail Kuzminsky wrote: > In message from "Rahul Manglekar" (7 Feb > 2005 14:17:05 -0000): > >> >> hi all.., >> >> i have setup up LAM-MPI cluster on 3 machine for testing. >> >> i want do put mysql on cluster..,, such that if mysql need more >> processor power , it can use processor power of all nodes that are >> present in cluster. > > No. Usual MySQL isn't capable to use cluster nodes in parallel. > But you may work w/special software which allows to split your > database between cluster nodes. You may find the corresponding > information at mysql site or also search Beowulf maillist archive > (if I remember right, it was some discussion of databases in cluster > here). > > Yours > Mikhail Kuzminsky > Zelinsky Institute of Organic Chemistry > Moscow > >> >> i am using MySQL-4.0. >> >> can u guide me please.. >> >> thank you in advance.. >> >> >> -- Rahul.. > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From rokrau at yahoo.com Mon Feb 7 12:18:00 2005 From: rokrau at yahoo.com (Roland Krause) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] memory allocation on x86_64 returning huge addresses In-Reply-To: <20050207184108.GA1364@greglaptop.internal.keyresearch.com> Message-ID: <20050207201800.97502.qmail@web52909.mail.yahoo.com> Greg, thanks a lot for this hint. I will try it. Quick question: So this will let me sbrk all the available memory then? Is there a way to tell it to allocate all available memory with mmap? I used to hack the kernel and change TASK_UNMAPPED_BASE in the kernel in order to get all memory from the box in one large chunk. I guess I should have instead lowering it raised the value. I really would like to actually find some docs about this... Again thanks! Roland --- Greg Lindahl wrote: > > The problem is that malloc returns an address that is way beyond > > 8billion which is not what I had expected. > > This e-vile hack makes it produce something lower in memory. What it > does > is turns off glibc's malloc algorithm's feature that has it mmap() > large > malloc()s. Stuff into a .c, link the .o into your application. > > -- greg > > #include > #include > > static void mem_init_hook(void); > static void *mem_malloc_hook(size_t, const void *); > static void *(*glibc_malloc)(size_t, const void *); > void (*__malloc_initialize_hook)(void) = mem_init_hook; > > static void mem_init_hook(void) > { > mallopt (M_MMAP_MAX, 0); > } > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > __________________________________ Do you Yahoo!? Yahoo! Mail - Helps protect you from nasty viruses. http://promotions.yahoo.com/new_mail From rene at renestorm.de Mon Feb 7 13:56:27 2005 From: rene at renestorm.de (rene) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] How-TO Mysql on Lam-cluster? In-Reply-To: <4207BD72.1070505@penguincomputing.com> References: <4207BD72.1070505@penguincomputing.com> Message-ID: <200502072256.27614.rene@renestorm.de> HI, > MySQL-4.1 has cluster support according to > http://dev.mysql.com/downloads/cluster/ As far as I know they used the nbd daemon to generate the db-nodes, but every node has the full database access. It isnt shared over the disks. Just in case you have a really huge db. http://www.emicnetworks.com/ has an own implementation too -- Rene Storm @Cluster From rene at renestorm.de Mon Feb 7 15:37:05 2005 From: rene at renestorm.de (rene) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] mpich future In-Reply-To: References: <200502071853.02126.rene@renestorm.de> <200502072246.02830.rene@renestorm.de> Message-ID: <200502080037.05409.rene@renestorm.de> Hi Rob, > The MQbench project does look interesting. Sort of a GUI version of > SkaMPI? It's something like the Pallas benchmark. But there aren't all Mpi Calls implemented yet. But its nice to choose a bunch of nodes and then a second one in the same application and see the differences. Its ordinary C mpi surrounded by a C++ Qt gui. > If you write an MPI program, it should work with all MPI implementations > (modulo missing MPI-2 features). It will not necessarily cleanly link > with any arbitrary library, in the same way that a C program will not > dynamically link with any arbitrary C library. > > So there is always going to be an issue of recompilation; is that your > second concern? Yes it is. Its probably possible to make software packages available for common linux distributions. But if you have to consider several mpi implementations extra, that could be lot of packages. I've you have written a major application like ls-dyna you can say: Take these linux, these compiler and this mpi and you get our compiled version, but nobody will alter their cluster for an add-on program like a mpi-copy tool. So the only choice is to go opensource. But in some areas it is important that (small or large) companies make professional, supported software available. But this isn't easy with mpi. Regards, Rene From hasan at grant.phys.subr.edu Mon Feb 7 18:15:25 2005 From: hasan at grant.phys.subr.edu (Saleem Hasan) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Newbie question on mpich2 installation Message-ID: Hello all, I apologise for what may be a very simple issue but is giving me trouble. I would really appreciate some advice. For learning the setup of a cluster, I have installed mpich2 on a linux machine with Red Hat 8.0. I have a second machine RH 8.0. w2 is the master and w1 is the slave. I have installed mpich2 on w2 (/home/mpi) and used nfs to share /home with w1. I have also setup passwordless ssh between w1 and w2. I am able to bring up mpd on the local machine (w2) and do mpdtrace and mpdallexit. I am following the installation procedure from the MPICH2 home. I am unable to boot mpd on the slave. The first time I ran mpdboot -n 2 -f /home/mpi/mpd.hosts, I got the message that there was no mpd.conf file in w1 and that could be a reason for the mpd not coming up the slave. I added an mpd.conf (secretword) to /etc in the slave also. Now I get a different message [root@w2 mpich2-1.0]# mpdboot -n 2 -f /home/mpi/mpd.hosts mpdboot_w2.maverick.net_0 (mpdboot 357): error trying to start mpd(boot) at 1 w1.maverick.net; output: mpdboot_w1_1 (err_exit 379): mpd failed to start correctly on w1 reason: 1: invalid msg from mpd :{}: mpdboot_w1_1 (err_exit 385): contents of mpd logfile in /tmp: logfile for mpd with pid 1654 mpdboot_w2.maverick.net_0 (err_exit 379): mpd failed to start correctly on w2.maverick.net Even though the message says mpd failed to start coorectly on w2 (last line), mpdtrace gives w2. The log file in w1 (slave) states the following logfile for mpd with pid 1654 w1_1060 failed ; cause: unable to obtain socket for rhs in ring traceback: [('/home/mpi/mpich2-install/bin/mpd.py', '1192', '_enter_existing_ring'), ('/home/mpi/mpich2-install/bin/mpd.py', '173', '_mpd_init'), ('/home/mpi/mpich2-install/bin/mpd.py', '1374', '?')] Thank you very much. Saleem Hasan From list-beowulf at onerussian.com Tue Feb 8 06:16:19 2005 From: list-beowulf at onerussian.com (Yaroslav Halchenko) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] cheap 48 port gigabit ethernet switch w/ jumbo frames? In-Reply-To: <41EFF701.60905@pa.msu.edu> References: <200501201700.j0KH0PfQ032360@bluewest.scyld.com> <41EFF701.60905@pa.msu.edu> Message-ID: <20050208141619.GM2996@washoe.rutgers.edu> On my latest researches on switches I've found SMC8648T (48 ports) which does support jumbo 9K and cost 2400$ and is managed Does anyone has experience with such thing or I should check out also Nortell switches which are approx 50% more expensive -- Yarik On Thu, Jan 20, 2005 at 01:22:57PM -0500, Tom Rockwell wrote: > Hi, > I'm looking for a switch that will be used for NFS traffic on a cluster > of about 40 nodes. The nodes will have Broadcom 5704 ethernet. From > what I've read, jumbo frames is important for getting the best NFS > performance over gigabit ethernet. > D-link and Netgear have newer 48 port switches priced below managed > switches. The D-link is model DGS-1248T > http://dlink.com/products/?sec=2&pid=367 and the Netgear is model GS748T > http://netgear.com/products/details/GS748T.php. Each are about $1200 or > so. I'm unable to find info on their websites specifying whether these > switches support jumbo frames. Anyone know? > Thanks, > Tom Rockwell > Michigan State University > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf -- .-. =------------------------------ /v\ ----------------------------= Keep in touch // \\ (yoh@|www.)onerussian.com Yaroslav Halchenko /( )\ ICQ#: 60653192 Linux User ^^-^^ [175555] Key http://www.onerussian.com/gpg-yoh.asc GPG fingerprint 3BB6 E124 0643 A615 6F00 6854 8D11 4563 75C0 24C8 -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://www.scyld.com/pipermail/beowulf/attachments/20050208/759c457c/attachment.bin From rossen at VerariSoft.Com Tue Feb 8 06:23:57 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> Message-ID: <4208CB7D.6070309@verarisoft.com> Vincent, Your questions related to the actual cost (in terms of processor overhead) of achieving the latency numbers that are posted by the network vendors are very interesting and have important aspects, which are often overlooked or paid little attention to. Warning: This posting is long and may be boring. The ping-pong tests that are often used for measuring the communication latency (from user level) are an extreme and often unrealistic mode of operation of the parallel system. Sending bytes across the software layers and over the network is a fundamental factor for contributing to fast computation but without looking at the cost and the likelihood (as Patrick mentioned "crossing the fingers") of producing the best quoted latencies, you don't usually get the whole picture. Besides the network hardware/firmware, the implementation (and use) of the low-level network messaging layer (GM, ELAN, VERBS, etc) and the MPI library are also of a big importance. The design space of parallel applications is quite large (size of messages, frequency of messages, regularity in space and time, synchrony, communication pattern, etc) in order to hope that any single mode of the entire system would be always optimal. In this regard, the ping-pong latency test, exercising only one of these modes, obviously gives you insufficient information on how to predict the behavior of the communication sub-system in realistic scenarios. In order to address this issue, our MPI/Pro implementation (plug!) has long had different modes of using the network and the low-level messaging layer for all major high-speed networks as well as for TCP/IP communication. We usually support at least 2 modes - one that optimizes short message latency (as many of the other MPI implementations do), at the expense of increased CPU overhead, and one that trades some latency (communication overhead) for low CPU overhead, higher predictability, and much better opportunity for overlapping and pipelining. We have carried out studies for quantifying the degree of overlapping that these different modes can achieve (using only our MPI implementation, e.g., comparing apples to apples) and we have obtained some interesting results. When you combine all of the complexities of the communication sub-system (network hardware/firmware, messaging layer, MPI library), the application, and the OS (let's only take the virtual memory system, process/thread scheduling, and interrupt/signal handling) you get a highly probabilistic system, which is hard to quantify and predict by a single ping-pong latency number. Our experiments have shown that using a different MPI/Pro mode on the same application code, executed on the same parallel system, can yield sometimes substantially different performance results. This shows that the implementation and the use of the middleware alone can have a substantial impact on your performance and scalability. Further, the application code can be written (not always but often) to take advantage of asynchrony, pipelining, and overlapping. Implementing these mechanisms in your code (using MPI) often doesn't cost much, but can speed up your application quite a bit on many parallel systems (running middleware with the right design) and in the worst case give you no benefit (on systems that don't provide adequate support for these mechanisms). So, if you really want to optimize the use of your cluster resources, in addition to the network and compute nodes, you will need to also consider the communication middleware and the design of your application and how they all work together. -- Rossen Dimitrov Verari Systems Software, Inc. http://www.verarisoft.com Vincent Diepeveen wrote: > At 21:27 5-2-2005 -0500, Patrick Geoffray wrote: > >>Hi Vincent, >> >>Vincent Diepeveen wrote: >> >>>>>CPU's are 100% busy and after i know how many times a second the network >>>>>can handle in theory requests i will do more probes per second to the >>>>>hashtable. The more probes i can do the better for the game tree search. >>>> >>>>With a gigE network that sounds like 40us or so. With Myrinet or IB >>>>it's in the 4-6us range. If you bought dual opterons with the special >>> >>> >>>At the quadrics and dolphin homepage they both claim 12+ us for Myrinet. >> >>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), >>that includes fibers and a switch in the middle: >> >> Length Latency(us) Bandwidth(MB/s) >> 0 2.684 0.000 >> 1 2.874 0.336 >> 2 2.898 0.690 >> 4 2.978 1.343 >> 8 2.965 2.699 >> 16 2.993 5.347 >> 32 3.409 9.388 >> 64 3.563 17.960 >> 128 3.977 32.185 >> 256 5.699 44.916 >> >>Quadrics would be lower by a 1.5 us, I don't know about Dolphin, I >>didn't hear about noticeable SCI clusters in a long time. >> >> >>>I am very impressed by the quadrics and dolphin cards. Probably by >>>infinipath too when i check them out. Will do. >>> >>>I'm not so impressed yet by myrinet actually, but if cluster builders can >>>earn a couple of hundreds of dollars more on each node i'm sure they'll > > do it. > >>I don't think Myrinet would be the cheapest, I am sure you can get a >>better deal from desperate interconnect vendors. >> >>What does not impress you in Myrinet ? > > > Thanks for your kind answer Patrick, > > Obviously i mentionned that number because i read it elsewhere. > > Well a number of points bother my mind from which majority is true for > others as well. But first let me note that i'm not against myrinet in > general. I am just trying to solve a very specific case. For that specific > case i'm not so impressed. > > Note that so far i didn't find any desperate vendor. For sure quadrics > doesn't look desperate to me, they aren't even selling old cards anymore > though they must have still thousands of them lying at home from returned > upgraded networks. Finding second hand highend cards seems to be very seldom. > > First of all i'm interested in how quick i can get 4-64 bytes from remote > memory. So not from some kind of network card cache, as myrinet doesn't > have some megabytes on chip, but just a few tens of kilobytes. The memory > has to come therefore from the remote nodes main memory, at a random adress > in the main memory. No streaming at all happens. that 400 ns extra that the > TLB gives is definitely not the problem i guess. > > The problem for me is to understand: "how do you get that memory at a > cluster?" > > A latency on paper says of course nothing when you can't actually get it > within that time. > > "Paper supports everything." > Arturo Ochoa (Caracas, Venezuela) > > I hope everyone realizes that an important consequence from beowulf > clusters is that you actually want to *use* all those cpu's you have to > your avail. > > So every cpu has a program running that eats 100% system time. Because if > it wouldn't use 100% system time, you wouldn't need a cluster! > >>From that 100% system time obviously you must be prepared to give away some > to serve other nodes as quickly as possible doing a read. > > All latencies i see quoted at all hardware sites, it is very hard to figure > for me out whether that's a latency that is supported by paper, or whether > it's a practical latency i can take into account as a programmer with all > software layers overhead when each cpu is 100% running a program. > > Secondly, but as i'm not a cluster expert i don't know how to avoid that, > it's of course a big LOSS in sequential speed if my program each few > instructions must check whether there is some MPI message to get handled. > If i check a lot that will slow down my program 20 times. If i don't check > a lot, other cpu's will have to wait longer and that defeats the purpose of > a fast network card. > > Factor 20 is about the slowdown of the average 'old' supercomputer > chessprograms which use MPI type solutions. Zugzwang (Paderborn-Siemens), > P.Conners (Paderborn-Siemens), cilkchess (MIT). I've been playing with my > own eyes against those programs in world champs and despite that it has > happened that i played at the same hardware with a similar amount of cpu's > and a program having factor 100 more chessknowledge (which slows down the > program *considerable*), the actual speed at which the program searches > nodes was up to factor 5-10 faster. > > Now a few years ago this was not a major problem because for example > Cilkchess which obviously ran factor 20-40 times slower than it could, used > 1800 processors for example in world champs 1995 (Hong kong) and 512 > processors in world champs 1999 (Paderborn). Of course because 1 processor > was real real fast compared to the speed of 1 pc processor in those days, > they practical were searching a lot deeper than pc programs (and both > played excellent for its days, especially Don Dailey needs to get a big > compliment for that). > > However if i show up with 2 pc's and 2 network cards, then it sure matters > when i lose a lot of speed. > > Obviously for embarassingly parallel software this is no issue, but usually > for embarrassingly parallel software all you need is gigabit ethernet. > > There is so many MPI applications which are not exactly embarassingly > parallel from which you see that a decent programmer single cpu would be > doing that 20 times faster. Or to quote someone who has been doing such > rewriting work for some physical applications that run here and there: "I > didn't blink my eyes when i managed to speedup an application factor 1000". > > So it is very interesting for us all and me especially to understand how > *fast* you can get that memory under full load of all the logical cpu's. > > Third each pc has 2 cheapo k7 processors which are a lot slower than opterons. > > Second problem i have is that i can get easily dual k7 pc's from > chessplayers and they can get bought cheap still. Dual k7 is practical same > speed like a dual xeon 3.06Ghz Northwood with all memory slots filled with > 2-2-2 DIMMS for DIEP. So just compare the price of such a system with a > cheapo dual k7 with registered cas3 RAM. > > Those dual k7's have 64 bits 66Mhz slots, not pci-x as far as i know and > also those who do have A64's or P4's usually don't have pci-x onboard > either. Sure there is boards that have them and i'm sure that if you make a > network > > Dolphin can deliver 'bytes' they say at their homepage in 3.3 us at MPX > mainboards and claim somewhere a paper latency of 1.x us. > > What is the achieved read speed to remote memory myrinet gets at 64 bits / > 66Mhz in software, so ready to use 4-64 bytes for applications? > > I'm not asking it to be accurate within 400ns, as that's the delay you'll > have from TLB trashing the remote node. But accuracy within 1.5 us would be > quite nice. > > First of all for integer intensive applications i'm doing fastest processor > is opteron, k7 comes second and P4 comes third. Exception is a P4 machine > equipped with the most expensive stuff (2-2-2 ram and all banks filled) > good mainboard and northwoods and overclocked at the mainboard. However for > that price a dual opteron can get bought and it just blows away that P4 > bigtime. > > Every year that new software gets released of course that P4 gets slower, > because newer software only gets more and more complex with more options > and will fit less perfectly in P4's small tiny caches, let alone when we > get a lot of 64 bits programs. They won't fit at all in those tiny slow > caches. > > So until the dual core opterons arrive at low cost, obviously you can make > dual k7 nodes for just a few hundreds of dollar a node. > > When adding new nodes which in the future no doubt are dual opteron, you > still run further with those dual k7 nodes and want to mix them obviously > with dual opterons. Is that possible? > > > > >>Patrick >>-- >> >>Patrick Geoffray >>Myricom, Inc. >>http://www.myri.com >> >> > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From josip at lanl.gov Tue Feb 8 08:54:07 2005 From: josip at lanl.gov (Josip Loncaric) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <4208CB7D.6070309@verarisoft.com> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> <4208CB7D.6070309@verarisoft.com> Message-ID: <4208EEAF.105@lanl.gov> Rossen Dimitrov wrote: > > So, if you really want to optimize the use of your cluster resources, in > addition to the network and compute nodes, you will need to also > consider the communication middleware and the design of your application > and how they all work together. Are there any projects that would expand the ability of MPI application programmers to provide performance hints to the MPI library? For example, hints indicating that certain messages are latency sensitive whereas others need optimal bandwidth and low CPU overhead? One can already obtain some MPI performance data through the PMPI mechanism, and Rossen is helping develop MPI PERUSE (http://www.mpi-peruse.org/) intended to provide even more detail. I'm asking about the other direction of information flow, i.e. performance hints from the application to the MPI layer... Ideally, such hints would be propagated fairly close to the actual hardware, e.g. application hints would guide the MPI library in selecting improved interrupt mitigation strategies used by the network interfaces (assuming that a suitable API exists for the underlaying hardware). Sincerely, Josip From twilcox at terrascale.com Tue Feb 8 09:35:27 2005 From: twilcox at terrascale.com (Tim Wilcox) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Call for participation StorCloud at SC2005 Message-ID: <011001c50e04$9757f830$a201a8c0@deepthoughthp> Hi all, The StorCloud applications and Challenge submission form is now online at http://www.sc-submissions.org and we are currently accepting submissions the instructions are posted at http://www.vtksolutions.com/StorCloud/2005/StorCloudAppFormHelp.html. The deadline is March 31st. Tim Wilcox Applications Challenge Committee From natorro at fisica.unam.mx Tue Feb 8 10:27:39 2005 From: natorro at fisica.unam.mx (Carlos Lopez Nataren) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] G5 beowulf cluster Message-ID: <1107887259.2124.16.camel@natorro> Hello, we, at the physics institute in Mexico have got four Xserve G5 and we would like to use them as a beowulf, my first doubt is about what operating system to use, we've been using linux for our other clusters, even a G4 one, but I haven't seen anything about G5, is there a linux distribution that runs well on this type of machines? or do I better use the operating system they came with? or are there any documentation out there outlining the way they should be configured to be used as a beowulf? Thank you very much for any help. natorro -- Carlos Lopez Nataren Instituto de Fisica, UNAM From dag at sonsorol.org Tue Feb 8 10:51:25 2005 From: dag at sonsorol.org (Chris Dagdigian) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] G5 beowulf cluster In-Reply-To: <1107887259.2124.16.camel@natorro> References: <1107887259.2124.16.camel@natorro> Message-ID: <42090A2D.4090204@sonsorol.org> The Mac OS X server OS that came with your Xserve G5s is quite good and you'll find all the developer tools, compilers, cluster scheduler systems like Grid Engine or LSF are all working and well supported. The scientific community of G5/Xserve users is growing quite rapidly. If you are familiar with Linux the learning curve for OS X is not all that bad. http://www.apple.com/science/ -- may help http://www.apple.com/server/macosx/ -- also You may want to make the OS decision based on what physics apps you need to run and how *they* are supported on OS X vs Linux. I can't help you there as I'm a life sciences / biology person. In our lab we've installed both Gentoo Linux as well as Yellow Dog on Xserve G5s. Both seemed to install and run smoothly but for production clustering work we still use the OS X Server OS. -Chris Carlos Lopez Nataren wrote: > Hello, we, at the physics institute in Mexico have got four Xserve G5 > and we would like to use them as a beowulf, my first doubt is about what > operating system to use, we've been using linux for our other clusters, > even a G4 one, but I haven't seen anything about G5, is there a linux > distribution that runs well on this type of machines? or do I better use > the operating system they came with? or are there any documentation out > there outlining the way they should be configured to be used as a > beowulf? > > Thank you very much for any help. > natorro > -- Chris Dagdigian, BioTeam - Independent life science IT & informatics consulting Office: 617-665-6088, Mobile: 617-877-5498, Fax: 425-699-0193 PGP KeyID: 83D4310E iChat/AIM: bioteamdag Web: http://bioteam.net From hahn at physics.mcmaster.ca Tue Feb 8 11:26:32 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] G5 beowulf cluster In-Reply-To: <1107887259.2124.16.camel@natorro> Message-ID: > Hello, we, at the physics institute in Mexico have got four Xserve G5 > and we would like to use them as a beowulf, my first doubt is about what > operating system to use, we've been using linux for our other clusters, I would strongly encourage you to try both Linux and Mac OS/X because doing so would permit a VERY interesting and useful comparison. the comparison is interesting because Mac OS/X is tuned very differently from Linux - even more than the differences it inherits from its *BSD heritage. for instance, if you run LMBench on the two machines, you'll see that certain syscalls are drastically different in speed. obviously, Macophiliacs and Apple sales reps would be scandalized at this idea. but the truth is that the Xserve hardware is reasonably competive with dual-xeon alternatives, but in a cluster, no one really cares about PDF imaging models or other traditional Apple qualities. what matters is things like TCP stack efficiency, syscall overheads, etc. > even a G4 one, but I haven't seen anything about G5, is there a linux > distribution that runs well on this type of machines? or do I better use I've heard of yellowdog linux; there are probably many other flavors (perhaps even a fedora version?). ultimately, the distro is almost irrelevant to a cluster, since it's the kernel, booting and FS that matter, not .999 of userspace. incidentally, I've measured the power consumption of a ppc970fx (90nm, 2.0 GHz) system, under load, and found it to be marginally cooler than, say a similar-speed HP DL145 (dual-opteron). we're talking 200 vs 220W. this is old news; what's new is that up-coming 90nm Opterons appear to change the picture fairly dramatically, since the drop the TDP from 89 to 65W. and of course, for those of you who are cache-friendly, dual-core opterons at 95W TDP is rather attractive. (ie, dual ppc970/2.0's with a 3.2 GB/s apiece vs four DC opteron/2.2's with 3.2 GB/s apiece. 100W/p vs maybe 60W/p, hmmm.) regards, mark hahn. From dtj at uberh4x0r.org Tue Feb 8 11:35:49 2005 From: dtj at uberh4x0r.org (Dean Johnson) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] G5 beowulf cluster In-Reply-To: <1107887259.2124.16.camel@natorro> References: <1107887259.2124.16.camel@natorro> Message-ID: <1107891349.5042.10.camel@terra> On Tue, 2005-02-08 at 12:27 -0600, Carlos Lopez Nataren wrote: > Hello, we, at the physics institute in Mexico have got four Xserve G5 > and we would like to use them as a beowulf, my first doubt is about what > operating system to use, we've been using linux for our other clusters, > even a G4 one, but I haven't seen anything about G5, is there a linux > distribution that runs well on this type of machines? or do I better use > the operating system they came with? or are there any documentation out > there outlining the way they should be configured to be used as a > beowulf? > The native OSX should be fine. It sort depends on the applications that you intend on using. Lots of the major apps seem to have efforts to make them work well on the altivec machines. There was a problem, something about semaphores I believe, that caused problems with MPI apps. I ran into it trying to get benchmark numbers for Amber and Gromacs on 4 G5 towers that I was playing with. -Dean From idooley at isaacdooley.com Tue Feb 8 13:55:12 2005 From: idooley at isaacdooley.com (Isaac Dooley) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] G5 beowulf cluster In-Reply-To: <200502082000.j18K09iS022564@bluewest.scyld.com> References: <200502082000.j18K09iS022564@bluewest.scyld.com> Message-ID: <42093540.6010801@isaacdooley.com> I've used the new ~600 node G5 Xserve cluster named turing: http://www.cse.uiuc.edu/turing/ It works, and is using OSX 10.3. I've used YellowDog Linux personally on a few G3 and G4 machines, and have had good experiences. If you want to do very fine grained parallel computation, one important thing to do is to disable all unneeded system daemons. There are a bunch of these in YDL and OSX. Also, depending on your needs for 64-bit addressing, you may need YDL until OSX 10.4 is released(it is available to developers now if you really want it). I'm not sure if you can disable the GUI for OSX, which may be a minor resource waster. Also you may want to consider Darwin without OSX. Darwin is the open source kernel used by OSX. One thing we've noticed with our OSX is that connect() sometimes takes too long to complete. Hopefully I can figure out why this is. Isaac Dooley >Hello, we, at the physics institute in Mexico have got four Xserve G5 >and we would like to use them as a beowulf, my first doubt is about what >operating system to use, we've been using linux for our other clusters, >even a G4 one, but I haven't seen anything about G5, is there a linux >distribution that runs well on this type of machines? or do I better use >the operating system they came with? or are there any documentation out >there outlining the way they should be configured to be used as a >beowulf? > > From ole at scali.com Wed Feb 9 04:24:15 2005 From: ole at scali.com (Ole W. Saastad) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <200502070951.j179oSDB010742@bluewest.scyld.com> References: <200502070951.j179oSDB010742@bluewest.scyld.com> Message-ID: <1107951855.5682.31.camel@pc-2.office.scali.no> Dear all, this thread reminded us, that we promised to post HPCC numbers depicting differences between interconnects, not interconnects and software stacks in combination. The numbers below stems from a fairly old system (400MHz FSB, PCI-X, etc.) and does not reflect the absolute performance achievable on modern hardware. Similar, the NICs used are _not_ the latest and greatest. The intent is simply to show the effect of different interconnects, on the four simple (excluding PTRANS etc) communication metrics measured by HPCC. (see web page http://icl.cs.utk.edu/hpcc/) Gigabit Eth. SCI Myrinet InfiniBand Max Ping Pong Latency : 36.32 4.44 8.65 7.36 Min Ping Pong Bandw. : 117.01 121.31 245.31 359.21 Random Ring Bandw. : 37.59 47.70 69.30 18.02 Random Ring Latency : 42.17 8.91 19.02 9.94 Latency in microseconds and bandwidth in MBytes/s. (1e6 bytes/s). The HPCC version is 0.8 and the very same binary (and Scali MPI Connect library) is used for all interconnects (change of interconnect is done by -net tcp|sci|gm0|ib0 on the command line). Cluster information : 16 x Dell PowerEdge 2650 2.4 GHz Dell PowerConnect 5224 GBE switch. Mellanox HCA Infinicon InfiniIO 3000 Myrinet 2000 Dolphin SCI 4x4 Torus Scali MPI Connect version : scampi-3.3.7-2.rhel3 Mellanox IB driver version : thca-linux-3.2-build-024 GM version : 2.0.14 -- Ole W. Saastad, Dr.Scient. Manager Cluster Expert Center dir. +47 22 62 89 68 fax. +47 22 62 89 51 mob. +47 93 05 74 87 ole@scali.com Scali - www.scali.com High Performance Clustering From joachim at ccrl-nece.de Thu Feb 10 01:15:54 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <420580AD.5050003@myri.com> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> <420580AD.5050003@myri.com> Message-ID: <420B264A.7050004@ccrl-nece.de> Patrick Geoffray wrote: > Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), > that includes fibers and a switch in the middle: > > Length Latency(us) Bandwidth(MB/s) > 0 2.684 0.000 [...] Nice work, Patrick - but such numbers are of little value if the benchmark used to get them is not stated. I'd recommend mpptest (from MPICH). Plus, the compiler etc. is also of interest when it comes to latencies. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From joachim at ccrl-nece.de Thu Feb 10 01:28:27 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <4208EEAF.105@lanl.gov> References: <3.0.32.20050206043617.0100dd80@pop.xs4all.nl> <4208CB7D.6070309@verarisoft.com> <4208EEAF.105@lanl.gov> Message-ID: <420B293B.9060604@ccrl-nece.de> Josip Loncaric wrote: > Are there any projects that would expand the ability of MPI application > programmers to provide performance hints to the MPI library? For > example, hints indicating that certain messages are latency sensitive > whereas others need optimal bandwidth and low CPU overhead? MPI offers a lot of different send modes already. If you use a ready send, the MPI library can assume that you are interested in low-latency delivery; if you use a non-blocking send, it should be o.k. for the library to assume that you are interested in overlapping computation and communication and so on. On the receiving side, a hybrid polling-blocking approach for receiving can be applied. I do not think that there is serious demand for more explicit "steering" of the MPI library. User's make much to little use of the existing ways (that I described above). But, if you really want to do such stuff, you could use (implementation-specific) attributes which you assign to different communicators, one for "low-latency" delivery and one for "low-cpu", or whatever. But this has more effect on the sending side than on the receiving side. I wouldn't invest work into this unless you have very good reasons. Esp. as this would be non-portable, few users would ever take notice. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From landman at scalableinformatics.com Thu Feb 10 13:40:52 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] A thread-safe PRNG for an OpenMP progra Message-ID: <420BD4E4.4080401@scalableinformatics.com> Hi folks: I need to get a thread-safe pseudo-random number generator. All I have found online was SPRNG which is set up for MPI. Anyone have a quick pointer to their favorite thread safe PRNG that works well in OpenMP? Thanks. Joe -- Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com From maurice at harddata.com Thu Feb 10 12:35:06 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> Message-ID: <420BC57A.5060007@harddata.com> ---------------------------------------------------------------------- >Message: 1 >Date: Thu, 10 Feb 2005 10:15:54 +0100 >From: Joachim Worringen >Subject: Re: [Beowulf] Home beowulf - NIC latencies > >Patrick Geoffray wrote: > > >>Seriously, here are MPI latencies with MX on F cards on Opteron (PCI-X), >>that includes fibers and a switch in the middle: >> >> Length Latency(us) Bandwidth(MB/s) >> 0 2.684 0.000 >> >> >[...] > >Nice work, Patrick - but such numbers are of little value if the >benchmark used to get them is not stated. I'd recommend mpptest (from >MPICH). Plus, the compiler etc. is also of interest when it comes to >latencies. > > Joachim > > > True, but it does not change the facts. Further, all of these lovely benchmarks lack one really important detail: Comparisons between different interfaces and drivers MUST show CPU usage while running them. If I have a fantastic device that uses infinitely small time (latency) and moves huge amounts of data (bandwidth) but in doing so it takes 80% of a CPU, we do not have a useful solution.. That is where Myrinet and Quadrics shine, and also this is the detail that the various OB vendors carefully dance around. All the communications performance in the world does not matter if it consumes a large amount of CPU cycles. A further test that some vendors artfully avoid is the actual latency of all nodes in a cluster across the switching device. I have seen a number of "benchmarks" showing great numbers, but on looking closer a great number of them are either on two computers, directly connected, or are on switching networks that use a number of small switches, and they do not show the worst case latency across all the switches, on the greater number of hops. So, your points are excellent, Joachim, but I have to say that even greater degrees of information are needed before any meaningful conclusions may be drawn. What we all need is some form of useful standardized benchmarks that looks like real world code from a number of different disciplines, that we can use to test the hardware, so we may compare results in a meaningful manner. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050210/feb53a3e/attachment.html From lindahl at pathscale.com Thu Feb 10 18:36:20 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sat Jul 4 01:03:50 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420BC57A.5060007@harddata.com> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> <420BC57A.5060007@harddata.com> Message-ID: <20050211023619.GB5174@greglaptop.internal.keyresearch.com> On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote: > Further, all of these lovely benchmarks lack one really important detail: > Comparisons between different interfaces and drivers MUST show CPU usage > while running them. No. If you want to look at that, run a real application and watch the wall time. It's extremely hard to get a good estimate of cpu usage out of a microbenchmark, and running "top" or /bin/time to do it is definitely bogus. > If I have a fantastic device that uses infinitely small time (latency) > and moves huge amounts of data (bandwidth) but in doing so it takes 80% > of a CPU, we do not have a useful solution.. If large cpu usage is a problem, it will show up nicely in real application benchmarks. > What we all need is some form of useful standardized benchmarks that > looks like real world code from a number of different disciplines, that > we can use to test the hardware, so we may compare results in a > meaningful manner. Amen. So use the MM5 t3a benchmark, maybe even SPEC HPC, the canned benchmarks for Amber, Charmm, DL_POLY, etc. The NAS Parallel Benchmarks are also good, they are much closer to real apps than microbenchmarks. -- greg From eugen at leitl.org Fri Feb 11 06:06:43 2005 From: eugen at leitl.org (Eugen Leitl) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] more details on Cell emerge Message-ID: <20050211140642.GV1404@leitl.org> http://www.realworldtech.com/includes/templates/articles.cfm?ArticleID=RWT021005084318&mode=print By: David T. Wang (dwang@realworldtech.com) Updated: 02-10-2005 Back to Basics The fundamental task of a processor is to manage the flow of data through its computational units. However in the past two decades, each successive generation of processors for personal computers has added more transistors dedicated to increasing the performance of spaghetti-like integer code. For example, it is well known that typical integer codes are branchy and that branch mispredict penalties are expensive; in an effort to minimize the impact of branch instructions, transistors were used to develop highly accurate branch predictors. Aside from branch predictors, sophisticated cache hierarchies with large tag arrays and predictive cache prefetch units attempt to hide the complexity of data movement from the software, and further increase the performance of single threaded applications. The pursuit of single threaded performance can be observed in recent years in the proposal of extraordinarily deeply pipelined processors designed primarily to increase the performance of single threaded applications, at the cost of higher power consumption and larger transistor budgets. The fundamental idea of the CELL processor project is to reverse this trend and give up the pursuit of single threaded performance, in favor of allocating additional hardware resources to perform parallel computations. That is, minimal resources are devoted toward the execution of single threaded workloads, so that multiple DSP-like processing elements can be added to perform more parallelizable multimedia-type computations. In the examination of the first implementation of the CELL processor, the theme of the shift in focus from the pursuit of single threaded integer performance to the pursuit of multiply threaded, easily parallelizable multimedia-type performance is repeated throughout. CELL Basics The CELL processor is a collaboration between IBM, Sony and Toshiba. The CELL processor is expected by this consortium to provide computing power an order of magnitude above and beyond what is currently available to its competitors. The International Solid-State Circuits Conference (ISSCC) 2005 was chosen by the group as the location to describe the basic hardware architecture of the processor and announce the first incarnation of the CELL processor family. Members of the CELL processor family share basic building blocks, and depending on the requirement of the application, specific versions of the CELL processor can be quickly configured and manufactured to meed that need. The basic building blocks shared by members of the CELL family of processor are the following: * The PowerPC Processing Element (PPE) * The Synergistic Processing Element (SPE) * The L2 Cache * The internal Element Interconnect Bus(EIB) * The shared Memory Interface Controller (MIC) and * The FlexIO interface Each SPE is in essence a private system-on-chip (SoC), with the processing unit connected directly to 256KB of private Load Store (LS) memory. The PPE is a dual threaded (SMT) PowerPC processor connected to the SPE's through the EIB. The PPE and SPE processing elements access system memory through the MIC, which is connected to two independent channels of Rambus XDR memory, providing 25 GB/s of memory bandwidth. The connection to I/O is done through the FlexIO interface, also provided by Rambus, providing 44.8 GB/s of raw outbound BW and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of 76.8 GB/s. At ISSCC 2005, IBM announced that the first implementation of the CELL processor has been tested to operate at frequencies above 4 GHz. In the CELL processor, each SPE is capable of sustaining 4 FMADD operations per cycle. At an operating frequency of 4 GHz, the CELL processor is thus capable of achieving a peak throughput rate of 256 GFlops from the 8 SPE's. Moreover, the PPE can contribute some amount of additional compute power with its own FP and VMX units. Processor Overview Figure 1 - Die photo of CELL processor with block diagram overlay Figure 1 shows the die photo of the first CELL processor implementation with 8 SPE.s. The sample processor tested was able to operate at a frequency of 4 GHz with Vdd of 1.1V. The power consumption characteristics of the processor were not disclosed by IBM. However, estimates in the range of 50 to 80 Watts @ 4 GHz and 1.1 V were given. One unconfirmed report claims that at the extreme end of the frequency/voltage/power spectrum, one sample CELL processor was observed to operate at 5.6 GHz with 1.4 V Vdd and consumed 180 W of power. As described previously, the CELL processor with 8 SPE.s operating at 4 GHz has a peak throughput rate of over 256 GFlops. To provide the proper balance between processing power and data bandwidth, an enormously capable system interconnects and memory system interface is required for the CELL processor. For that task, the CELL processor was designed as a Rambus Sandwich, with Redwood Rambus Asic Cell (RRAC) acting as the system interface on one end of the CELL processor, and the XDR (formerly Yellowstone) high bandwidth DRAM memory system interface on the other end of the CELL processor. Finally, the CELL processor has 2954 C4 contacts to the 3-2-3 organic package, and the BGA package is 42.5 mm by 42.5 mm in size. The BGA package contains 1236 contacts, 506 of which are signal interconnects and the remainder are devoted to power and ground interconnects. Logic Depth, Circuit Design, Die Size and Process Shrink Figure 2 - Per stage circuit delay depth of 11 FO4 often left only 5~8 FO4 for logic flow The first incarnation of the CELL processor is implemented in a 90nm SOI process. IBM claims that while the logic complexity of each pipeline stage is roughly comparable to other processors with a per stage logic depth of 20 FO4, aggressive circuit design, efficient layout and logic simplification enabled the circuit designers of the CELL processor to reduced the per stage circuit delay to 11 FO4 throughout the entire design. The design methodology deployed for the CELL processor project provides an interesting contrast to that of other IBM processor projects in that the first incarnation of the CELL processor makes use of fully custom design. Moreover, the full custom design includes the use of dynamic logic circuits in critical data paths. In the first implementation of the CELL processor, dynamic logic was deployed for both area minimization as well as performance enhancement to reach the aggressive goal of 11 FO4 circuit delay per stage. Figure 2 shows that with the circuit delay depth of 11 FO4, oftentimes only 5~8 FO4 are left for inter-latch logic flow. The use of dynamic logic presents itself as an interesting issue in that dynamic logic circuits rely on the capability of logic transistors to retain a capacitive load as temporary storage. The decreasing capacitance and increasing leakage of each successive process generation means that dynamic logic design becomes more challenging with each successive process generation. In addition, dynamic circuits are reportedly even more challenging on SOI based process technologies. However, circuit design engineers from IBM believe that the use of dynamic logic will not present itself as an issue in the scalability of the CELL processor down to 65 nm and below. The argument was put forth that since the CELL processor is a full custom design, the task of process porting with dynamic circuits is no more and no less challenging than the task of process porting on a design without dynamic circuits. That is, since the full custom design requires the re-examination and re-optimization of transistor and circuit characteristics for each process generation, if a given set of dynamic logic circuits become impractical for specific functions at a given process node, that set of circuits can be replaced with static circuits as needed. The process portability of the CELL processor design is an interesting topic due to the fact that the prototype CELL processor is a large device that occupies 221 mm2 of silicon area on the 90 nm process. Comparatively, the IBM PPC970FX processor has a die size of 62 mm2 on the 90 nm process. The natural question then arises as to whether Sony will choose to reduce the number of SPE.s to 4 for the version of the CELL processor to appear in the next generation Playstation, or keep the 8 SPE.s and wait for the 65 nm process before it ramps up the production of the next generation Playstation. Although no announcements or hints have been given, IBM.s belief in regards to the process portability of the CEL Figure 6 - SPE pipeline diagram Table 1 - Unit latencies for SPE instructions. Figure 6 shows the pipeline diagram of the SPE and Table 1 shows the unit latency of the SPE. Figure 6 shows that the SPE pipeline makes heavy use of the forward-and-delay concept to avoid the access latency of a register file access in the case of dependent instructions that flow through the pipeline in rapid succession. One interesting aspect of the floating point pipeline is that the same arrays are used for floating point computation as well as integer multiplication. As a result, integer multiplies are sent to the floating point pipeline, and the floating point pipeline bypasses the FP handling and computes the integer multiply. SPE Schmoo Plot Figure 7 - Schmoo plot for the SPE Figure 7 shows the schmoo plot for the SPE. The schmoo plot shows that the SPE can comfortably operate at a frequency of 4 GHz with Vdd of 1.1 V, consuming approximately 4 W. The schmoo plot also reveals that due to the careful segmentation of signal path lengths, the design is far from being wire delay limited. Frequency scaling relative to voltage continues past 1.3 V. This schmoo plot also contributes to the plausibility of the unconfirmed report that the CELL processor could operate at upwards of 5.6 GHz. .Unknown. Functional Units: ATO and RTB Oftentimes when a paper relating to a complex project is written collaboratively by a group of people, details are lost. Still, it appeared as rather humorous that of the six design engineers and architects from the CELL processor project present at Tuesday evening.s chat session, no one could recall what the acronyms ATO and RTB stood for. ATO and RTB are functional blocks labeled in the floorplan of the SPE. However, the functionality of these functional blocks or the meaning of the acronym were neither noted on the floorplan, nor explained in the paper, nor mentioned in the technical presentation. In an effort to cover all the corners, this author placed the question on a list of questions to be asked of the CELL project team members. Hilarity thus ensued as slightly embarrassed CELL project members stared blankly at each other in an attempt to recall the functionality or definition of the acronyms. In all fairness, since the SPE was presented on Monday and the CELL processor itself was presented on Tuesday, CELL project members responsible for the SPE were not present for Tuesday evening.s chat sessions. As a result, the team members responsible for the overall CELL processor and internal system interconnects were asked to recall the meaning of acronyms of internal functional units within the SPE. Hence, the task was unnecessarily complicated by the absence of key personnel that would have been able to provide the answer faster than the CELL processor can rotate a million triangles by 12 degrees about the Z axis. After some discussion (and more wine), it was determined that the ATO unit is most likely the Atomic (memory) unit responsible for coherency observation/interaction with dataflow on the EIB. Then, after the injection of more liquid refreshments (CH3CH2OH), it was theorized that the RTB most likely stood for some sort of Register Translation Block whose precise functionality was unknown to those outside of the SPE. However, this theory would turn out to be incorrect. Finally, after sufficient numbers of hydrocarbon bonds have been broken down into H-OH on Wednesday, a member of the CELL processor team member tracked down the relevant information and he writes: The R in RTB is an internal 1 character identifier that denotes that the RTB block is a unit in the SPE. The TB in RTB stands for "Test Block". It contains the ABIST (Array Built In Self Test) engines for the Local Store and other arrays in the SPE, as well as other test related control functions for the SPE. Element Interconnect Bus The element interconnect bus is the on chip interconnect that ties together all of the processing, memory, and I/O elements on the CELL processor. The EIB is implemented as a set of four concentric rings that is routed through portions of the SPE, where each ring is a 128 bit wide interconnect. To reduce coupling noises, the wires are arranged in groups of four and interleaved with ground and power shields. To further reduce coupling noises, the direction of data flow alternates between each adjacent ring pair. Data travels on the EIB through staged buffer/repeaters at the boundaries of each SPE. That is, data is driven by one set of staged buffer and latched by the buffer at the next stage every clock cycle. Data moving from one SPE through other SPE.s requires the use of repeaters in the intermediary SPE.s for the duration of the transfer. Independently from the buffer/repeater elements, separate data on/off ramps exist in the BIU of the SPE, as data targeted for the LS unit of a given SPE can be off-loaded at the BIU. Similarly, outgoing data can be placed onto the EIB by the BIU. Figure 8 - Counter rotational rings of the EIB - 4 SPE.s shown The design of the EIB is specifically geared toward the scalability of the CELL processor. That is, signal path lengths on the EIB do not change regardless of the number of SPE.s in a given CELL processor configuration. Since the data travels no more than the width of one SPE, more SPE.s on a given CELL processor simply means that the data transport latency increases by the number of additional hops through those SPE.s. Data transfer through the EIB is controlled by the EIB controller, and the EIB controller works with the DMA engine and the channel controllers to reserve the buffers drivers for certain number of cycles for each data transfer request. The data transfer algorithm works by reserving channel capacity for each data transfer, thus providing support for real time applications. Finally, the design and implementation of the EIB has a curious side effect in that it limits the current version of the CELL processor to expand only along the horizontal axis. Thus, the EIB enables the CELL processor to be highly configurable and SPE.s can be quickly and easily added or removed along the horizontal axis, and the maximum number of SPE.s that can be added is set by the maximum width of the chip allowable by the reticule size of the fabrication equipment. The POWERPC Processing Element Neither microarchitectural details nor the performance characteristics of the POWERPC Processing Element were disclosed by IBM during ISSCC 2005. However, what is known is that the PPE processor core is a new core that is fully compliant with the POWERPC instruction set, the VMX instruction set extension inclusive. Additionally, the PPE core is described as a two issue, in-order, 64 bit processor that supports 2 way SMT. The L1 cache sizes of the PPE is reported to be 32KB each, and the unified L2 cache is 512 KB in size. Furthermore, the lineage of the PPE can be traced to a research project commissioned by IBM to examine high speed processor design with aggressive circuit implementations. The results of this research project were published by IBM first in the Journal of Solid State Circuits (JSSC) in 1998, then again in ISSCC 2000. The paper published in JSSC in 1998 described a processor implementation that supported a subset of the POWERPC instruction set, and the paper published in ISSCC 2000 described a processor that supported the complete POWERPC instruction set and operated at 1 GHz on a 0.25?m process technology. The microarchitecture of the research processor was disclosed in some detail in the ISSCC 2000 paper. However, that processor was a single issue processor whose design goal was to reach high operating frequency by limiting pipestage delay to 13 FO4, and power consumption limitations were not considered. For the PPE, several major changes in the design goal dictated changes in the microarchitecture from the research processor disclosed at ISSCC in 2000. Firstly, to further increase frequency, the per stage circuit delay design target was lowered from 13 FO4 to 11 FO4. Secondly, limiting power consumption and minimize leakage current were added as high priority design goals for the PPE. Collectively, these changes limited the per stage logic depth, and the pipeline was lengthened as a result. The addition of SMT and the two issue design goal completed the metamorphosis of the research processor to the PPE. The result is a processing core that operates at a high frequency with relatively low power consumption, and perhaps relatively poorer scalar performance compared to the beefy POWER5 processor core. Rambus XDR Memory System Figure 9 - The two channel XDR Memory System To provide machine balance and support the peak rating of more than 256 SP GFlops (or 25-30 DP GFlops), the CELL processor requires an enormously capable memory system. For that reason, two channels of Rambus XDR memory are used to obtain 25.2 GB/s of memory bandwidth. In the XDR memory system, each channel can support a maximum of thirty-six devices connected to the same command and address bus. The data bus for each device connects to the memory controller through a set of bi-directional point-to-point connections. In the XDR memory system, addresses and commands are sent on the address and command bus at a rate of 800 Mbits per second (Mbps), and the point to point interface operates at a datarate of 3.2 Gbps. Using DRAM devices with 16 bit wide data busses, each channel of XDR memory can sustain a maximum bandwidth of 102.4 Gbps (2 x 16 x 3.2), or 12.6 GB/s. The CELL processor can thus achieve a maximum bandwidth of 25.2 GB/s with a 2 channel, 4 device configuration. The obvious advantage of the XDR memory system is the bandwidth that it provides to the CELL processor. However, in the configuration illustrated in figure 9, the maximum of 4 DRAM devices means that the CELL processor is limited to 256 MB of memory, given that the highest capacity XDR DRAM device is currently 512 Mbits. Fortunately, XDR DRAM devices could in theory be reconfigured in such a way so that more than 36 XDR devices can be connected to the same 36 bit wide channel and provide 1 bit wide data bus each to the 36 bit wide point-to-point interconnect. In such a configuration, a two channel XDR memory can support upwards of 16 GB of ECC protected memory with 256 Mbit DRAM devices or 32 GB of ECC protected memory with 512 Mbit DRAM devices. As a result, the CELL processor could in theory address a large amount of memory if the price premium of XDR DRAM devices could be minimized. IBM did not release detailed information about the configuration of the XDR memory system. One feature to watch for in the future is ECC support in the DRAM memory system. Since ECC support is clearly not a requirement of a processor to be used in a game machine, the presence of ECC support would likely indicate IBM.s ambition to promote the use of CELL processors in applications that require superior reliability, availability and serviceability, such as HPC, workstation or server systems. Incidentally, Toshiba is a manufacturer of XDR DRAM devices. Presumably it brought the XDR memory controller and memory system design expertise to the table, and could ramp up production of XDR DRAM devices as needed. FlexIO System Interface At ISSCC 2005, Rambus presented a paper on the FlexIO interface used on the CELL processor. However, the presentation was limited to describing the physical layer interconnect. Specifically, the difficulties of implementing the Redwood Rambus ASIC Cell on IBM.s 90nm SOI process were examined in some detail. While circuit level issues regarding the challenges of designing high speed I/O interfaces on an SOI based process are in their own right extremely intriguing topics, the focus of this article is geared toward the architectural implications of the high bandwidth interface. As a result, the circuit level details will not be covered here. Interested readers are encouraged to seek out details on Rambus.s Redwood technology separately. What is known about the system interface of the CELL processor is that the FlexIO consists of 12 byte lanes. Each byte lane is a set of 8 bit wide, source synchronous, unidirectional, point-to-point interconnects. The FlexIO makes use of differential signaling to achieve the data rate of 6.4 Gb per second per signal pair, and that data rate in turn translates to 6.4 GB/s per byte lane. The 12 byte lanes are asymmetric in configuration. That is, 7 byte lanes are outbound from the CELL processor, while 5 byte lanes are inbound to the CELL processor. The 12 byte lanes thus provide 44.8 GB/s of raw outbound bandwidth and 32 GB/s of raw inbound bandwidth for total I/O bandwidth of 76.8 GB/s. Furthermore, the byte lanes are arranged into two groups of ports: one group of ports are dedicated to non-coherent off-chip traffic, while the other group of ports are usable for coherent off-chip traffic. It seems clear that Sony itself is unlikely to make use of a coherent, multiple CELL processor configuration for Playstation 3. However, the fact that the PPE and the SPE.s can snoop traffic transported through the EIB, and that coherency traffic can be sent to other CELL processors via a coherent interface, means that the CELL processor can indeed be an interesting processor. If nothing else, the CELL processor should enable startups that propose to build FlexIO based coherency switches to garner immediate interest from venture capitalists. Summary The CELL processor presents an intriguing alternative in its pursuit of performance. It seems to be a forgone conclusion that the CELL processor will be an enormously successful product, and that millions of CELL processors will be sold as the processors that power the next generation Sony Playstation. However, IBM has designed some features into the CELL processor that clearly reveals its ambition in seeking new applications for the CELL processor. At ISSCC 2005, much fanfare has been generated by the rating of 256 GFlops @ 4 GHz for the CELL processor. However, it is the little mentioned double precision capability and the yet undisclosed system level coherency mechanism that appear to be the most intriguing aspects that could enable the CELL processor to find success not just inside the Playstation, but outside of it as well. References [1] J. Silberman et. al., .A 1.0- GHz Single-Issue 64-Bit PowerPC Integer Processor., IEEE Journal of Solid-State Circuits, Vol 33, No.11, Nov 1998. [2] P. Hofstee et. al., .A 1 GHz Single-Issue 64b PowerPC Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2000. [3] N. Rohrer et. al. .PowerPC in 130nm and 90nm Technologies., International Solid-State Circuits Conference Technical Digest, Feb. 2004. [4] B. Flachs et. al. .A Streaming Processing Unit for A CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. [5] D. Pham et. al. .The Design and Implementation of a First-Generation CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. [6] J. Kuang et. al. .A Double-Precision Multiplier with Fine-Grained Clock-Gating Support for a First-Generation CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. [7] S. Dhong et. al. .A 4.8 GHz Fully Pipelined Embedded SRAM in the Streaming Processor of a CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. [8] K. Chang et. al. .Clocking and Circuit Design for a Parallel I/O on a First-Generation CELL Processor., International Solid-State Circuits Conference Technical Digest, Feb. 2005. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050211/c7cfa2c4/attachment.bin From mathog at mendel.bio.caltech.edu Fri Feb 11 08:17:34 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? Message-ID: In designing a computer room two key factors are: 1. Power in (electricity) 2. Power out (A/C) The second term really has two parts: A. the amount of air moved B. the reduction in temperature of that air across the A/C unit The latter part is specified in tons. The A/C guys I've spoken with recently utilize some more or less standard relationship between cubic feet per minute (cfm) and A/C tons for the units they maintain. These run off the campus cold water supply, so it makes sense that heat out is proportional to flow across, assuming that the cold water has a very large heat capacity. However, in terms of cooling the units themselves, the amount of air flow through the racks is also important. That flow is also in cfm. Ideally cfm through the racks would be equal to cfm through the A/C, ie, all air goes once through the racks and then directly through the A/C. Even more ideally cfm through _each_ rack could be modulated somehow, since some racks move much more air than others and putting a low flow rack next to a high flow rack might drive the air the wrong way through the low flow unit. How does one calculate an optimal cfm through a rack? For a specific example with round numbers, let's say it's a 25U rack, dissipates 10kW, and has a single 50 cfm per minute output fan per 1U node. (Ie, all air out must go through that path.) There seem to be a bunch of variables that are hard to deal with. For instance, adding the exhaust fans would be 50*25 = 1250 cfm. Is that all there is to it? But that type of fan only runs at the stated flow rate if the pressures are exactly as specified. Without incredibly careful balancing of the pressure across the rack it won't generally run at 50 cfm. Is cfm the key unit here or should one think in terms of pressure at various points in the room? Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From joachim at ccrl-nece.de Fri Feb 11 10:49:48 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050211023619.GB5174@greglaptop.internal.keyresearch.com> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> <420BC57A.5060007@harddata.com> <20050211023619.GB5174@greglaptop.internal.keyresearch.com> Message-ID: <420CFE4C.6050003@ccrl-nece.de> Greg Lindahl wrote: > On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote: >>If I have a fantastic device that uses infinitely small time (latency) >>and moves huge amounts of data (bandwidth) but in doing so it takes 80% >>of a CPU, we do not have a useful solution.. > > If large cpu usage is a problem, it will show up nicely in real > application benchmarks. True. I always wonder what the low-CPU-usage-advocates want the MPI process to do while i.e. an MPI_Send() is executed. For small messages (which are critical for many applications), it's somewhat like requesting that a local memory-write has to show low CPU usage. Of course, I can think of scenarios in which data transfers w/o CPU usage do promise advantages, and I have implemented and evaluated such techniques myself. But in the end (for the application), it always boiled down to latency and bandwidth as most applications don't honor "true" asynchronous communication. The latest unsuccessful case of uncoupling computation and MPI communication I read about was BG/L when using the second CPU as a message processor. Maybe Myrinet MX will behave differently by making the MPI itself more concurrent on hardware level (is this a correct description, Patrick?) - but it will need matching applications, too. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From rgb at phy.duke.edu Fri Feb 11 11:02:23 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: On Fri, 11 Feb 2005, David Mathog wrote: > In designing a computer room two key factors are: > > 1. Power in (electricity) > 2. Power out (A/C) > > The second term really has two parts: > > A. the amount of air moved > B. the reduction in temperature of that air across the A/C unit > > The latter part is specified in tons. The A/C guys I've spoken > with recently utilize some more or less standard relationship > between cubic feet per minute (cfm) and A/C tons for the units they > maintain. These run off the campus cold water supply, so > it makes sense that heat out is proportional to flow across, assuming > that the cold water has a very large heat capacity. > > However, in terms of cooling the units themselves, the amount of > air flow through the racks is also important. That flow is > also in cfm. Ideally cfm through the racks would be equal to cfm > through the A/C, ie, all air goes once through the racks and then > directly through the A/C. Even more ideally cfm through _each_ rack > could be modulated somehow, since some racks move much more > air than others and putting a low flow rack next to a high flow rack > might drive the air the wrong way through the low flow unit. > > How does one calculate an optimal cfm through a rack? > > For a specific example with round numbers, let's say it's a > 25U rack, dissipates 10kW, and has a single 50 cfm per minute output > fan per 1U node. (Ie, all air out must go through that path.) > > There seem to be a bunch of variables that are hard to deal with. > For instance, adding the exhaust fans would be 50*25 = 1250 cfm. > Is that all there is to it? But that type of fan only runs at > the stated flow rate if the pressures are exactly as specified. > Without incredibly careful balancing of the pressure across the > rack it won't generally run at 50 cfm. > > Is cfm the key unit here or should one think in terms of pressure > at various points in the room? I can't answer all your questions here, but you've pointed out a lot of the problems. You have to arrange for the blower to deliver chilled air to the right places in the room, and you ALSO have to arrange for a warm air return that picks up the warmed air (after it has passed through the systems and cooled them) and returns it to be cooled and cycled again. The overall airflow is determined by those two things -- cool air being delivered at an overpressure, warm air being returned at an underpressure, and the intermediate pressure gradient (interacting with intervening obstacles such as the racks full of equipment) determining the flow pattern. That flow pattern needs to avoid things like "hot spots" that are isolated from the overall cooling flow, especially hot spots that ultimately feed rack intake, and flow that feeds the warmed exhaust from one or more units back into the cool air intake of others. Ultimately, this is a nonlinear problem with turbulence and other factors and hence difficult to make pronouncements on without knowing the geometry of your rack layout and other stuff. One reason that raised floor designs are popular is that it makes establishing a clean circulation pattern a bit simpler -- feed cold air from the botton right into the rack intakess, vent the warmed air from their outflow directly into a warm air return. The cooling air mixes minimally with ambient room air and is relatively easy to balance. In a simpler overhead cold air delivery, warm air return system you'll need to be able to balance the cold air delivery at several points in the room, perhaps blowing it down directly into the front (intake) faces of opposing racks, while letting the warm air get pulled along the ceiling to one or more major return vents. That way you can get a delivered cold-air-down, in-through-rack, out-from-rack, warm-air-up, warm-air-along-ceiling and returned sort of pattern established that is consistent and balancable among the delivery registers throughout the room. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From idooley at isaacdooley.com Fri Feb 11 12:39:29 2005 From: idooley at isaacdooley.com (Isaac Dooley) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <200502112000.j1BK0DNm021457@bluewest.scyld.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> Message-ID: <420D1801.9090206@isaacdooley.com> >True. I always wonder what the low-CPU-usage-advocates want the MPI >process to do while i.e. an MPI_Send() is executed. > They don't want the process to do anything when the call MPI_Send, however carefully using asynchronous or non-blocking messaging ideally would not use the CPU. Using MPI_ISend() allows programs to not waste CPU cycles waiting on the completion of a message transaction. This is critical for some tightly coupled fine grained applications. Also it allows for overlapping computation and communication, which is beneficial. Isaac Dooley From lindahl at pathscale.com Fri Feb 11 13:03:35 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420CFE4C.6050003@ccrl-nece.de> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> <420BC57A.5060007@harddata.com> <20050211023619.GB5174@greglaptop.internal.keyresearch.com> <420CFE4C.6050003@ccrl-nece.de> Message-ID: <20050211210335.GE1256@greglaptop.internal.keyresearch.com> On Fri, Feb 11, 2005 at 07:49:48PM +0100, Joachim Worringen wrote: > The latest unsuccessful case of uncoupling computation and MPI > communication I read about was BG/L when using the second CPU as a > message processor. Yep, "offload" that improves performance is more complicated than it seems. The new InfiniPath adapter aims at raw latency and bandwidth excellence, because this is always helpful. It's also frequently helpful to be able to send directly out of cache, for medium-sized packets, instead of using send dma, which has to flush cache to main memory. Memory bandwidth isn't free. Getting more concurrency, by the way, is as much a hardware issue as a software issue. InfiniPath's hardware is dumb, but highly pipelined. Most offload engines seem to have less pipelining. And cpu software overhead generally scales nicely with additional cpus... -- greg From lindahl at pathscale.com Fri Feb 11 13:21:38 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420D1801.9090206@isaacdooley.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> Message-ID: <20050211212137.GA2278@greglaptop.internal.keyresearch.com> On Fri, Feb 11, 2005 at 02:39:29PM -0600, Isaac Dooley wrote: > Using MPI_ISend() allows programs to not waste > CPU cycles waiting on the completion of a message transaction. This is > critical for some tightly coupled fine grained applications. We do pretty much the same thing for MPI_Send and MPI_ISend for small packets: they're nearly on the wire when the routine returns, and the subsequent MPI_Wait is a no-op. This is actually pretty common among MPI implementations. The problem with trying to generalize about what MPI calls do is that different implementations do different things with them. Reading the standard won't teach you much about implementations. -- greg From rross at mcs.anl.gov Fri Feb 11 13:47:39 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420D1801.9090206@isaacdooley.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> Message-ID: On Fri, 11 Feb 2005, Isaac Dooley wrote: > >True. I always wonder what the low-CPU-usage-advocates want the MPI > >process to do while i.e. an MPI_Send() is executed. > > > They don't want the process to do anything when the call MPI_Send, > however carefully using asynchronous or non-blocking messaging ideally > would not use the CPU. Unless your code is multi-threaded, why do you care what the CPU utilization is during MPI_Send()? Saving on the power bill? When you call MPI_Send() semantically you've said "Hey, send this, and btw I can't do anything else until you are done." Likewise for MPI_Recv(). So the implementation will be built to get things done as quickly as possible. Often the path to lowest latency leads to polling, which leads to the high CPU utilization. Same issue with interrupt mitigation, as mentioned earlier in the thread; you can save CPU by coalescing, or you can get better performance. > Using MPI_ISend() allows programs to not waste CPU cycles waiting on the > completion of a message transaction. No, it allows the programmer to express that it wants to send a message but not wait for it to complete right now. The API doesn't specify the semantics of CPU utilization. It cannot, because the API doesn't have knowledge of the hardware that will be used in the implementation. > This is critical for some tightly coupled fine grained applications. What exactly is critical for tightly coupled, fine grained applications? I would think that extremely low latency communication would be the most important factor, not whether or not we crank on the CPU to get that. > Also it allows for overlapping computation and communication, which is > beneficial. Sure! Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab From rbw at ahpcrc.org Fri Feb 11 14:11:14 2005 From: rbw at ahpcrc.org (Richard Walsh) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050211212137.GA2278@greglaptop.internal.keyresearch.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <20050211212137.GA2278@greglaptop.internal.keyresearch.com> Message-ID: <420D2D82.5050609@ahpcrc.org> Greg Lindahl wrote: >On Fri, Feb 11, 2005 at 02:39:29PM -0600, Isaac Dooley wrote: > > > >>Using MPI_ISend() allows programs to not waste >>CPU cycles waiting on the completion of a message transaction. This is >>critical for some tightly coupled fine grained applications. >> >> > >We do pretty much the same thing for MPI_Send and MPI_ISend for small >packets: they're nearly on the wire when the routine returns, and >the subsequent MPI_Wait is a no-op. This is actually pretty common >among MPI implementations. > >The problem with trying to generalize about what MPI calls do is that >different implementations do different things with them. Reading >the standard won't teach you much about implementations. > >-- greg >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > > Right. Small messages are where latency matters anyway. As the message size dwindles, the remaining overhead is mostly intrinsic to the subroutine call and unavoidable. What is to be done? The only choice is to squeeze out the subroutine call itself with a different programming model (say UPC) and a memory and instruction set architecture that supports single instruction (preferably pipeline with a block/vector length and stride option to hide latency) remote memory addressing. Additions like the STEN on the Quadrics Elan4 and Hypertransport directly from remote processor cache are cluster hardware morphs taking things the direction of GAS systems like the Cray X1 and SGI Altix. rbw From mathog at mendel.bio.caltech.edu Fri Feb 11 14:59:55 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Sat Jul 4 01:03:51 2009 Subject: Fw: Re: [Beowulf] cooling question: cfm per rack? Message-ID: Mike, I've been trying to pick the brains of other folks on the beowulf list who have computer rooms with modern equipment. One problem with the existing air, with regards to future expansion, is apparently the total amount of air that the current A/C can move. This is all horrendously complicated and needs to be looked at carefully by a HVAC consultant. Pretty sure we have enough tons and flow for now, meaning my rack and Deshaies and everything else I know is going in there in a couple of months. More and more convinced that we don't have enough to handle multiple full racks of the next generation of computers. Jim Lux from JPL answered my questions as attached after my signature. His back of the envelope calculations for a 5kW rack (roughly equal to what I have now) give a requirement for 1800 cfm flow through the rack. The current A/C is, according to the A/C guy who was here, good for only 5500 cfm. However, since I don't know what the inlet or outlet temperatures on the rack are going to be (ie, the temperature of the air the A/C returns to the room and how hot the air is coming out the back) the required cfm may be quite different than this. Hmm, let me go borrow a thermometer and measure it, 22 C in, 32 C out the back, on the node in the middle of the rack. So there's 10 degrees across my rack and he assumes 15. Anyway, a safer estimate for the next generation is 10kW, and there are people who predict 20kW, so total airflow through the A/C seems unlikely to be sufficient a few years down the road. Assuming that people put this equipment in the room. Sorry, to be vague, there are just so many unknowns. I also talked to Darryl Willick, who runs a bunch of machine rooms on campus for Chemistry and some of Rees, Bjorkman and Mayo's stuff. His main room is about at capacity now with 6 full racks and a few odds and ends. He has 2 x 250A panels in there and apparently only a 45kW A/C unit. That second number is really odd because they aren't usually rated that way, but that's the number he remembered. If he's right that's 45000/3500=12 tons, roughly the same as the unit currently in the Rees area. He said his had to be serviced recently because they were having overheating problems, but only a belt was changed. Unknown how many cfm it is. He has a small workstation area that is somehow or other connected to his machine room ventilation wise, and apparently when they prop the door open in the workstation area it causes problems in the machine room. So maybe it would make sense to put a small separate A/C unit in the proposed classroom to avoid those sorts of complications in the future. Or maybe it can tap off building air. Darryl did say something interesting though, he said that for some units the A/C people can increase the capacity by changing the pulleys around. Apparently this blows more air, and the cold water isn't limiting, so it effectively upgrades the unit without changing very much. Darryl said that this was done at some point for Mayo's computer room in the subbasement of the BI. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech ------------- Forwarded message follows ------------- At 08:17 AM 2/11/2005, you wrote: >In designing a computer room two key factors are: > >1. Power in (electricity) >2. Power out (A/C) > >The second term really has two parts: > > A. the amount of air moved > B. the reduction in temperature of that air across the A/C unit > >The latter part is specified in tons. The A/C guys I've spoken >with recently utilize some more or less standard relationship >between cubic feet per minute (cfm) and A/C tons for the units they >maintain. These run off the campus cold water supply, so >it makes sense that heat out is proportional to flow across, assuming >that the cold water has a very large heat capacity. > >However, in terms of cooling the units themselves, the amount of >air flow through the racks is also important. That flow is >also in cfm. Ideally cfm through the racks would be equal to cfm >through the A/C, ie, all air goes once through the racks and then >directly through the A/C. Even more ideally cfm through _each_ rack >could be modulated somehow, since some racks move much more >air than others and putting a low flow rack next to a high flow rack >might drive the air the wrong way through the low flow unit. > >How does one calculate an optimal cfm through a rack? Decide on a maximum outlet temperature (say, 30C) Find your inlet air temperature (say, 15C) You know your dissipation.. (say, 5kW) Calculate how much air you need to move using the specific heat of air. (about 1 kJ/(kg K)) 5 kJ/sec means you'd need 5 kg/sec for a 1 degree rise, but here, with a 15 degree rise, you can get by with .33 kg/sec. Turn the kg/sec into cfm... .33 kg * 1.3 m3/kg = .43 cubic meters/sec. There's about 35 cubic feet in a cubic meter, so we need about 15 cubic feet per second. Multiply by 60 and you get a bit more than 900 cfm. Now.. that's idealized, so double it. 1800 cfm or so. Step 2: How big is the duct? Generally, you don't want to go any faster than 1000 linear feet per minute, so your duct will need to be about 2 square feet. (you begin to see why you don't want some little 6" diameter blower...) >For a specific example with round numbers, let's say it's a >25U rack, dissipates 10kW, and has a single 50 cfm per minute output >fan per 1U node. (Ie, all air out must go through that path.) > >There seem to be a bunch of variables that are hard to deal with. >For instance, adding the exhaust fans would be 50*25 = 1250 cfm. >Is that all there is to it? But that type of fan only runs at >the stated flow rate if the pressures are exactly as specified. >Without incredibly careful balancing of the pressure across the >rack it won't generally run at 50 cfm. This is precisely the case. And, of course, the actual circumstances will be nothing like what the design specs are. >Is cfm the key unit here or should one think in terms of pressure >at various points in the room? Trying to come up with an accurate aerodynamic model is a worthy challenge for a very large cluster (computational challenge, not thermal). It's all done by rules of thumb and adding lots of margin. Use the rough sizing technique to get an approximate air flow. Use reasonable sized ducts and air speeds. Measure the actual outlet temperatures. Actually, what most people do is a rough sizing, then call in someone who actually does this for a living (a HVAC contractor) and use their rough sizing to validate what the contractor tells you you should have. >Thanks, > >David Mathog >mathog@caltech.edu >Manager, Sequence Analysis Facility, Biology Division, Caltech >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From mathog at mendel.bio.caltech.edu Fri Feb 11 15:06:49 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Oops Message-ID: Sorry about that message addressed to "Mike", it wasn't supposed to go to the list. Please cancel it if that's possible. Apologies otherwise. David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rross at mcs.anl.gov Fri Feb 11 18:47:22 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420D54DA.8000904@uiuc.edu> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> Message-ID: Hi Isaac, On Fri, 11 Feb 2005, Isaac Dooley wrote: > >>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the > >>completion of a message transaction. > > > >No, it allows the programmer to express that it wants to send a message > >but not wait for it to complete right now. The API doesn't specify the > >semantics of CPU utilization. It cannot, because the API doesn't have > >knowledge of the hardware that will be used in the implementation. > > > That is partially true. The context for my comment was under your > assumption that everyone uses MPI_Send(). These people, as I stated > before, do not care about what the CPU does during their blocking calls. I think that it is completely true. I made no assumption about everyone using MPI_Send(); I'm a late-comer to the conversation. I was not trying to say anything about what people making the calls care about; I was trying to clarify what the standard does and does not say. However, I agree with you that it is unlikely that someone calling MPI_Send() is too worried about what the CPU utilization is during the call. > I was trying to point out that programs utilizing non-blocking IO may > have work that will be adversely impacted by CPU utilization for > messaging. These are the people who care about CPU utilization for > messaging. This I hopes answers your prior question, at least partially. I agree that people using MPI_Isend() and related non-blocking operations are sometimes doing so because they would like to perform some computation while the communication progresses. People also use these calls to initiate a collection of point-to-point operations before waiting, so that multiple communications may proceed in parallel. The implementation has no way of really knowing which of these is the case. Greg just pointed out that for small messages most implementations will do the exact same thing as in the MPI_Send() case anyway. For large messages I suppose that something different could be done. In our implementation (MPICH2), to my knowledge we do not differentiate. You should understand that the way MPI implementations are measured is by their performance, not CPU utilization, so there is pressure to push the former as much as possible at the expense of the latter. > Perhaps your applications demand low latency with no concern for the CPU > during the time spent blocking. That is fine. But some applications > benefit from overlapping computation and communication, and the cycles > not wasted by the CPU on communication can be used productively. I wouldn't categorize the cycles spent on communication as "wasted"; it's not like we code in extraneous math just to keep the CPU pegged :). Regards, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab From hahn at physics.mcmaster.ca Fri Feb 11 21:41:41 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: Message-ID: > The second term really has two parts: > A. the amount of air moved > B. the reduction in temperature of that air across the A/C unit > > The latter part is specified in tons. The A/C guys I've spoken well, I usually think of temperature as a side-effect of the more direct measure, movement of energy. hence, I always think of the tidy relation of 3.517 KW = 1 ton. I usually skip any BTUs... > with recently utilize some more or less standard relationship > between cubic feet per minute (cfm) and A/C tons for the units they > maintain. CFM and delta-t across the machine-to-be-cooled are convolved to give you how much heat you're extracting. no doubt both pressure and humidity are involved to some degree as well, and I don't have a good equation for this. the good thing is that turning down the temperature can partly mitigate minor airflow problems. some reasonable discussion from Intel, (a bit axe-grinding, though): http://www.7x24nw.org/Presentations-folder/Air%20Cooling%20in%20Servers%20and%20IT%20Facilities.pdf a dell 1855 blade chassis spec's 400 CFM for ~4KW. they're talking 6 of those chassis in a rack (24 KW!). then again, that's assuming an unrealistic power-per blade (>400W), which sounds like corporate CYA to me: http://www.dell.com/downloads/global/products/pedge/en/PowerEdge%201855%20DC%20Whitepaper.pdf this is a good overall discussion, though perhaps a bit pessimistic about "typical" machinerooms: http://www.chatsworth.com/uploads/pdf/best_practices_cooling_wp.pdf http://www.chatsworth.com/uploads/pdf/increase_computerrm_cooling_wp.pdf sun recommends 21-23C, 45-50%. 35% min, ESD critical at 30%: http://www.sun.com/products-n-solutions/hardware/docs/html/817-4137-10/2__EnvReq.html to complicate matters, HVAC folk always bring up the issue of "sensible load". as near as I can tell, this is just a way of saying that if you try to impose too much delta-T on humid air, you wind up wasting a lot of energy dehumidifying it... tiles between 500-2000 CFM: http://h200005.www2.hp.com/bc/docs/support/SupportManual/c00064724/c00064724.pdf that also gives: CFM = btu/hr / (1.08 * dT) so for 1 ton = 12000 BTU/hr and 70->90, 555 CFM per ton of cooling. HVAC folk also tend to say 1 tile/ton, which seems about right. > These run off the campus cold water supply, so > it makes sense that heat out is proportional to flow across, assuming > that the cold water has a very large heat capacity. our experience with CW has been disasterous, but we made the huge mistake of not using precision/machineroom chillers (fancoils, actually). our old/existing machineroom, for instance, is supposed to have 2x8ton fancoils, but combined they never moved more than about 20 KW (should be 56). unless you have pretty extreme assurances about WC quality (flow, temp), I would only consider using dual-cool machineroom chillers (DX + CW, usually adds about 15% to price.) > directly through the A/C. Even more ideally cfm through _each_ rack > could be modulated somehow, since some racks move much more > air than others and putting a low flow rack next to a high flow rack > might drive the air the wrong way through the low flow unit. well, the stuff in racks does probably have quite a few fans, which could ideally modulate themselves. my current-gen clusters certainly don't do that, but I'd be quite happy if next-gen did... > How does one calculate an optimal cfm through a rack? > > For a specific example with round numbers, let's say it's a > 25U rack, dissipates 10kW, and has a single 50 cfm per minute output > fan per 1U node. (Ie, all air out must go through that path.) that sounds reasonable to me - 10KW is ~3 tons, and the formula above relates your 1250 CFM total to about 3 tons as well. for my 10KW racks, I'm hoping to push the temperature down a bit (60-65), keep the humidity low to avoid "sensible" wastage, and hope for the best with our tiles. > Is cfm the key unit here or should one think in terms of pressure > at various points in the room? I think the answer is yes. with a good raised floor, you seem to be able to expect fairly even pressure distribution. we turned on our new machineroom yesterday, and the pressure feels similar everywhere (16" raised floor, though with some conduits down there, and 3x30T Liebert deluxe system 3's.) if your pressure is reasonably even, the same tiles should flow the same CFM. I'd LOVE to find some way to measure airflow, since I'd actually consider doing things like adding patches of duct tape to the underside of too-high-flow tiles. I suppose that the empiricist approach is just to sample all your system temperatures, and if some are too high, reduce the airflow to racks which are "too cool". From james.p.lux at jpl.nasa.gov Sat Feb 12 05:45:02 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? References: Message-ID: <000401c51109$0df9a6d0$19f29580@LAPTOP152422> > CFM and delta-t across the machine-to-be-cooled are convolved to give > you how much heat you're extracting. no doubt both pressure and humidity > are involved to some degree as well, and I don't have a good equation > for this. Indeed.. there is no "nice simple" equation for the general case, because of the problem with humidity. You really need to be worrying about enthalpy, etc., and with any sort of significant temperature change, it's neither constant pressure, nor constant volume, not to mention mechanical turbulence, etc.. All that icky thermodynamics stuff. I once spend several weeks trying to figure out if one could make theatrical fog without using liquid nitrogen. They do it by having a big tank of water about half full at around 160-180F, and then they inject liquid nitrogen into the headspace above the water. Turns out that the heat of vaporization of the LN2 is almost exactly balanced by the heat of condensation of the saturated water vapor, and that the volume of nitrogen gas produced, etc, works out to the outlet stream being around 38F, with the water droplets at the same temperature. Very, very tough to do this with mechanical refrigeration for a variety of reasons. So, as you say, unless you're airconditioning a huge building (where the cost of excess capacity is significant, and where there all those hot, water exhaling people inside), you can just do some quasi-worst case approximating. the good thing is that turning down the temperature can partly > mitigate minor airflow problems. > > to complicate matters, HVAC folk always bring up the issue of "sensible > load". as near as I can tell, this is just a way of saying that if you try > to impose too much delta-T on humid air, you wind up wasting a lot of energy > dehumidifying it... Yes.. this is especially true if you're not recirculating, but chilling fresh air from "outside". If you've got a reasonably closed system and there's no people inside, it's less of an issue. > > tiles between 500-2000 CFM: > http://h200005.www2.hp.com/bc/docs/support/SupportManual/c00064724/c00064724 .pdf > that also gives: > CFM = btu/hr / (1.08 * dT) > so for 1 ton = 12000 BTU/hr and 70->90, 555 CFM per ton of cooling. > HVAC folk also tend to say 1 tile/ton, which seems about right. > > > These run off the campus cold water supply, so > > it makes sense that heat out is proportional to flow across, assuming > > that the cold water has a very large heat capacity. Yes, in a theoretical sense. However, there are two factors to be aware of: 1) run the air too fast past the coils and it doesn't have time to exchange the heat; 2) run the air too fast and you consume power (and make heat) in compressing it to overcome the pressure drop. There's also a practical limit on just how much delta T you can get in one pass through the chiller coils. > > our experience with CW has been disasterous, but we made the huge mistake > of not using precision/machineroom chillers (fancoils, actually). > our old/existing machineroom, for instance, is supposed to have 2x8ton > fancoils, but combined they never moved more than about 20 KW (should be 56). > > unless you have pretty extreme assurances about WC quality (flow, temp), > I would only consider using dual-cool machineroom chillers (DX + CW, usually > adds about 15% to price.) > > > directly through the A/C. Even more ideally cfm through _each_ rack > > could be modulated somehow, since some racks move much more > > air than others and putting a low flow rack next to a high flow rack > > might drive the air the wrong way through the low flow unit. > > > > Is cfm the key unit here or should one think in terms of pressure > > at various points in the room? > > > if your pressure is reasonably even, the same tiles should flow the > same CFM. I'd LOVE to find some way to measure airflow, since I'd > actually consider doing things like adding patches of duct tape to > the underside of too-high-flow tiles. I suppose that the empiricist > approach is just to sample all your system temperatures, and if some > are too high, reduce the airflow to racks which are "too cool". Hie thee to a company called Dwyer, who make equipment specifically designed to measure airflow. There are several approaches.. One is using a pitot tube with a Magnehelic differential pressure gauge. Another is to measure the pressure drop across a calibrated orifice (again, using a sensitive pressure gauge). http://www.dwyer-inst.com/ Another is to use a airspeed probe (looks like a wand with a little fan in a hole on the end). The fancy ones will average a bunch of readings over an opening and do the calculation to turn area*average speed into CFM. You can find Magnehelic gauges surplus all the time.. keep your eyes open and when one turns up for $15-20, grab it. They're handy devices that can measure fairly small pressures (few inches of water column), and come with all sorts of weird scales (including some already calibrated in feet per minute or m/sec, all ready for use with a pitot tube). Interesting to measure the pressure in a room (or your house) and see what happens when the heater turns on, or the kids open and close the doors, etc. Some time spent with the Mc-Master Carr catalog (http://www.mcmaster.com/) or the Grainger catalog (http://www.grainger.com/) (both are large suppliers of stuff mechanical, materials, etc.. everyone should have a copy of the several thousand page yellow McMaster Carr catalog on their desk...). Omega (usually associated with temperature measuring) has a fair number of airspeed and volume measuring devices. http://www.omega.com However, your empirical approach of reducing the flow through the coldest racks is probably as good as anything. Jim Lux From rgb at phy.duke.edu Sat Feb 12 06:36:17 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:51 2009 Subject: Fw: Re: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: On Fri, 11 Feb 2005, David Mathog wrote: > Sorry, to be vague, there are just so many unknowns. Always.:-) > > I also talked to Darryl Willick, who runs a bunch of machine rooms > on campus for Chemistry and some of Rees, Bjorkman and Mayo's > stuff. His main room is about at capacity now with > 6 full racks and a few odds and ends. He has 2 x 250A panels > in there and apparently only a 45kW A/C unit. That second > number is really odd because they aren't usually rated that > way, but that's the number he remembered. If he's right that's > 45000/3500=12 tons, roughly the same as the unit currently > in the Rees area. He said his had to be serviced > recently because they were having overheating problems, but only > a belt was changed. Unknown how many cfm it is. He has a small > workstation area that is somehow or other connected to his machine > room ventilation wise, and apparently when they prop the door open > in the workstation area it causes problems in the machine room. > So maybe it would make sense to put a small separate A/C unit > in the proposed classroom to avoid those sorts of complications > in the future. Or maybe it can tap off building air. > > Darryl did say something interesting though, he said that for > some units the A/C people can increase the capacity by changing > the pulleys around. Apparently this blows more air, and the > cold water isn't limiting, so it effectively upgrades the unit > without changing very much. Darryl said that this was done > at some point for Mayo's computer room in the subbasement > of the BI. I'm sure you probably remember this from my posts on this topic before, but there are lots of bad experiences we and others on the list have had with AC that you can profit from. Don't forget things like: * Kill switch for room for the day the AC fails altogether at 2:30 a.m. * Automated monitoring and (if you've got one) a call cycle so that maybe somebody can get there in time to shut things down before the kill switch kicks in EVEN at 2:30 a.m. * The fact that at many places, the physical plant people have this annoying tendency to try to save energy by throttling down the A/C to a standby mode (where the chilled water is allowed to warm up to maybe 18C) in the winter because hey, it's cold outside, right? Often this is done automatically, without human thought or control. Often this triggers events for which the first two interventions are required when it does. This may not apply to you in your generally warm clime (compared to here, anyway) but is worth checking, for sure. * When computing the cost/benefit of power vs AC, be aware (to put into words what you're working toward anyway) that the true optimum is going to be biased towards an excess of AC capacity. This is for several reasons, once you think about it. The most important one is that adding new/additional power is relatively cheap whenever you do it; adding new/additional AC capacity later can be VERY expensive -- as expensive as adding AC at all in the first place. * Surplus capacity can also keep room ambient colder (generally better) while operating in the normal load range and may be cheaper in terms of operating efficiency, as AC COP depends on temperature differentials between delivery and returned chiller water (although the blowers and pumps draw too -- don't know how this all works out in the wash). * Redundancy is good, if you've got the space. If one blower out of three goes, the remaining two may be able to keep the space operational while service is performed, or at least keep it cool enough to avoid an involuntary kill or midnight call. * As you note -- it really helps to get professional advice on this from an engineer or architect who specializes in server room infrastructure design and support. Not that you shouldn't educate yourself in it too -- it's just that they SHOULD have a broad base of personal professional experience to draw on as well as some classroom education on the issues to be faced. Worth paying for. As you note, it is very difficult to know exactly where future power requirements and node densities will go per rack. Maybe blades will take over the universe, and racks will suddenly become very hot indeed. Some non-blade racks can achieve close to double the standard node/CPU densities in terms of floorspace footprint (e.g. Rackable, IIRC). Multiple core CPUs are at the threshold of appearing, and although they also look like they might be power/clock limited BECAUSE of the heat problem, there is still going to be some sort of scaling of power per compute capacity per cubic foot of rack space as the latter goes up. Alternatively, some room designs might install the DUCTWORK now that can support a (say) doubling of future AC capacity in the future and reserve space for the local units to drive this capacity in the facility but leave that space empty. Then you can (eventually) add the units without having to necessarily rip everything apart. This probably works best with raised floor designs (where you just duct per rack location) but one would expect that they could manage it for other kinds of ducted delivery and return if they try. In any infrastructure project, it really pays to think about this stuff ahead of time, as you are. rgb > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > > ------------- Forwarded message follows ------------- > > At 08:17 AM 2/11/2005, you wrote: > >In designing a computer room two key factors are: > > > >1. Power in (electricity) > >2. Power out (A/C) > > > >The second term really has two parts: > > > > A. the amount of air moved > > B. the reduction in temperature of that air across the A/C unit > > > >The latter part is specified in tons. The A/C guys I've spoken > >with recently utilize some more or less standard relationship > >between cubic feet per minute (cfm) and A/C tons for the units they > >maintain. These run off the campus cold water supply, so > >it makes sense that heat out is proportional to flow across, assuming > >that the cold water has a very large heat capacity. > > > >However, in terms of cooling the units themselves, the amount of > >air flow through the racks is also important. That flow is > >also in cfm. Ideally cfm through the racks would be equal to cfm > >through the A/C, ie, all air goes once through the racks and then > >directly through the A/C. Even more ideally cfm through _each_ rack > >could be modulated somehow, since some racks move much more > >air than others and putting a low flow rack next to a high flow rack > >might drive the air the wrong way through the low flow unit. > > > >How does one calculate an optimal cfm through a rack? > > Decide on a maximum outlet temperature (say, 30C) > Find your inlet air temperature (say, 15C) > You know your dissipation.. (say, 5kW) > > Calculate how much air you need to move using the specific heat of air. > (about 1 kJ/(kg K)) > > 5 kJ/sec means you'd need 5 kg/sec for a 1 degree rise, but here, with a 15 > degree rise, you can get by with .33 kg/sec. Turn the kg/sec into cfm... > .33 kg * 1.3 m3/kg = .43 > cubic meters/sec. There's about 35 cubic feet in a cubic meter, so we need > about 15 cubic feet per second. Multiply by 60 and you get a bit more than > 900 cfm. > > Now.. that's idealized, so double it. 1800 cfm or so. > > > Step 2: How big is the duct? Generally, you don't want to go any faster > than 1000 linear feet per minute, so your duct will need to be about 2 > square feet. (you begin to see why you don't want some little 6" diameter > blower...) > > > > >For a specific example with round numbers, let's say it's a > >25U rack, dissipates 10kW, and has a single 50 cfm per minute output > >fan per 1U node. (Ie, all air out must go through that path.) > > > >There seem to be a bunch of variables that are hard to deal with. > >For instance, adding the exhaust fans would be 50*25 = 1250 cfm. > >Is that all there is to it? But that type of fan only runs at > >the stated flow rate if the pressures are exactly as specified. > >Without incredibly careful balancing of the pressure across the > >rack it won't generally run at 50 cfm. > > > This is precisely the case. And, of course, the actual circumstances will > be nothing like what the design specs are. > > > >Is cfm the key unit here or should one think in terms of pressure > >at various points in the room? > > Trying to come up with an accurate aerodynamic model is a worthy challenge > for a very large cluster (computational challenge, not thermal). > > It's all done by rules of thumb and adding lots of margin. > > Use the rough sizing technique to get an approximate air flow. Use > reasonable sized ducts and air speeds. Measure the actual outlet > temperatures. > > Actually, what most people do is a rough sizing, then call in someone who > actually does this for a living (a HVAC contractor) and use their rough > sizing to validate what the contractor tells you you should have. > > > > >Thanks, > > > >David Mathog > >mathog@caltech.edu > >Manager, Sequence Analysis Facility, Biology Division, Caltech > >_______________________________________________ > >Beowulf mailing list, Beowulf@beowulf.org > >To change your subscription (digest mode or unsubscribe) visit > >http://www.beowulf.org/mailman/listinfo/beowulf > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Sat Feb 12 06:49:03 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: > if your pressure is reasonably even, the same tiles should flow the > same CFM. I'd LOVE to find some way to measure airflow, since I'd > actually consider doing things like adding patches of duct tape to > the underside of too-high-flow tiles. I suppose that the empiricist > approach is just to sample all your system temperatures, and if some > are too high, reduce the airflow to racks which are "too cool". Relative airflow can probably be measured with a kid's toy -- one of the little pinwheels -- and counting revolutions with a stopwatch. Normalizing that to absolute airflow in CFM is a bit tricky (since the result depends to some extent on the resistance imposed by the measuring apparatus) but somebody out there may have designed a version of this with a real fan and magnets set so that the counting is done electronically. In fact, I could build something to do this out of OTC parts if I had any way to normalize the count. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From james.p.lux at jpl.nasa.gov Sat Feb 12 07:47:17 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? References: Message-ID: <001401c5111a$3c447b30$32a8a8c0@LAPTOP152422> Sure, one could build it.. but one can probably buy it cheaper/easier Omega: http://www.omega.com/ppt/pptsc.asp?ref=HHF82&Nav=grec06 $89 They have others. Similar devices abound: http://www.nkhome.com/ww/1000/1000.html http://www.windandweather.com/store/Weather_Instruments___Wind_Gauges?Args=& page_number=1 (check out the first one.. $49) Your local sporting goods place (REI, Sport Chalet, Big 5) might have something like this too. So might Sharper Image or Brookstone, or one of those gadget stores Heck, Harbor Freight Salvage, a big retailer of inexpensive moderate quality imported stuff might have them..next time you're down buying cheap imported Chinese machine tools...check that bargain bin next to the register. Other approaches..small propellor on a small DC motor run as a generator (only works for fairly fast flows >several m/sec) run to a DVM. Small propellor and magnet/reedswitch driving a counter (as in your inexpensive DMM). (this is what the commercial units are) The challenge in home fabrication of such devices is getting it to be reasonably orientation insensitive, which implies pretty good balance, and to work in very low flows (<1 m/sec), which implies fairly low friction. I imagine, if you had a LOT of time on your hands, you could probably modify the heated film/wire sensor from an automotive mass air flow sensor for this purpose. (I spent the better part of a year trying to come up with a low budget way to measure velocity profiles across large (decameter scale) artificial tornadoes.. We eventually settled on a pitot tube rake with water manometers using video to do data logging.) ----- Original Message ----- From: "Robert G. Brown" To: "Mark Hahn" Cc: "David Mathog" ; Sent: Saturday, February 12, 2005 6:49 AM Subject: Re: [Beowulf] cooling question: cfm per rack? > > if your pressure is reasonably even, the same tiles should flow the > > same CFM. I'd LOVE to find some way to measure airflow, since I'd > > actually consider doing things like adding patches of duct tape to > > the underside of too-high-flow tiles. I suppose that the empiricist > > approach is just to sample all your system temperatures, and if some > > are too high, reduce the airflow to racks which are "too cool". > > Relative airflow can probably be measured with a kid's toy -- one of the > little pinwheels -- and counting revolutions with a stopwatch. > Normalizing that to absolute airflow in CFM is a bit tricky (since the > result depends to some extent on the resistance imposed by the measuring > apparatus) but somebody out there may have designed a version of this > with a real fan and magnets set so that the counting is done > electronically. In fact, I could build something to do this out of OTC > parts if I had any way to normalize the count. > > rgb From Toufeeq_Hussain at infosys.com Thu Feb 10 20:01:55 2005 From: Toufeeq_Hussain at infosys.com (Toufeeq Hussain) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Porting lam-7.1 to Cygwin (Win 2K) Message-ID: <557E17BE74D22143B7BE70EB60E33E9915BD81F8@shlmsg01.ad.infosys.com> Hi, Trying to compile lam-7.1 on Cygwin. Make fails at this point: make[2]: Entering directory `/lam-7.1.1/otb/lamgrow' /bin/bash ../../libtool --mode=link gcc -O3 -o lamgrow.exe lamgrow.o ../../share/liblam/liblam.la ../../share/libltdl/libltdlc.la -lutil gcc -O3 -o lamgrow.exe lamgrow.o ../../share/liblam/.libs/liblam.a ../../share/libltdl/.libs/libltdlc.a -lutil ../../share/liblam/.libs/liblam.a(ssi_boot_slurm.o)(.text+0x3c8):ssi_boo t_slurm.c: undefined reference to `_inet_ntop' collect2: ld returned 1 exit status make[2]: *** [lamgrow.exe] Error 1 make[2]: Leaving directory `/lam-7.1.1/otb/lamgrow' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/lam-7.1.1/otb' make: *** [all-recursive] Error 1 Is there a cygwin port available ? Any suggestions to the above problem. Regards, Toufeeq Hussain From rwm at absoft.com Fri Feb 11 07:00:00 2005 From: rwm at absoft.com (Rodney Mach) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: thread safe PRNG In-Reply-To: <200502111409.j1BE8vAY013737@bluewest.scyld.com> References: <200502111409.j1BE8vAY013737@bluewest.scyld.com> Message-ID: <420CC870.6020307@absoft.com> > Hi folks: > > I need to get a thread-safe pseudo-random number generator. All I > have found online was SPRNG which is set up for MPI. Anyone have a > quick pointer to their favorite thread safe PRNG that works well in > OpenMP? > > Thanks. > > Joe > Hey Joe, Intel MKL has various thread-safe prng that will work with OpenMP. IMSL also has thread-safe prng, as does IBM ESSL, ditto for AMD ACML. -Rod From henry.gabb at intel.com Fri Feb 11 07:04:31 2005 From: henry.gabb at intel.com (Gabb, Henry) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] RE: A thread-safe PRNG for an OpenMP program Message-ID: Hi Joe, The Intel Math Kernel Library (specifically the Vector Statistical Library within MKL) contains threadsafe random number functions. The following web site has a full description: http://www.intel.com/software/products/mkl/features/vsl.htm. There's an article "Making the Monte Carlo Approach Even Easier and Faster" on Intel Developer Services that describes how to use VSL functions with OpenMP. It's available here: http://www.intel.com/cd/ids/developer/asmo-na/eng/95573.htm. Best regards, Henry Gabb Intel Parallel Applications Center > Hi folks: > > I need to get a thread-safe pseudo-random number generator. All I > have found online was SPRNG which is set up for MPI. Anyone have a > quick pointer to their favorite thread safe PRNG that works well in OpenMP? > > Thanks. > > Joe > > -- > Scalable Informatics LLC, > email: landman@scalableinformatics.com > web : http://www.scalableinformatics.com From diep at xs4all.nl Fri Feb 11 08:59:56 2005 From: diep at xs4all.nl (Vincent Diepeveen) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] A thread-safe PRNG for an OpenMP progra Message-ID: <3.0.32.20050211175956.0102c6c0@pop.xs4all.nl> Perhaps use a local PRNG as that can serve roughly at 2 nanoseconds a number to each cpu. Here is what i modified to 64 bits it's real fast at processors that are 64 bits and have rotating instruction (itanium doesn't have it, but still is faster than k7 here as it's 64 bits). Even at itanium you can consider this a fast PRNG. /* define parameters (R1 and R2 must be smaller than the integer size): */ #define UNIX 1 // otherwise windows #if UNIX #include #define FORCEINLINE __inline /* UNIX and such this is 64 bits unsigned variable: */ #define BITBOARD unsigned long long #else #define FORCEINLINE __forceinline /* in WINDOWS we also want to be 64 bits: */ #define BITBOARD unsigned _int64 #endif #define KK 17 #define JJ 10 #define R1 5 #define R2 3 /* global variables Ranrot */ BITBOARD randbuffer[KK+3] = { /* history buffer filled with some random numbers */ 0x92930cb295f24dab,0x0d2f2c860b685215,0x4ef7b8f8e76ccae7,0x03519154af3ec239, 0x195e36fe715fad23, 0x86f2729c24a590ad,0x9ff2414a69e4b5ef,0x631205a6bf456141,0x6de386f196bc1b7b, 0x5db2d651a7bdf825, 0x0d2f2c86c1de75b7,0x5f72ed908858a9c9,0xfb2629812da87693,0xf3088fedb657f9dd, 0x00d47d10ffdc8a9f, 0xd9e323088121da71,0x801600328b823ecb,0x93c300e4885d05f5,0x096d1f3b4e20cd47, 0x43d64ed75a9ad5d9 }; int r_p1, r_p2; /* indexes into history buffer */ /******************************************************** AgF 1999-03-03 * * Random Number generator 'RANROT' type B * * by Agner Fog * * * * This is a lagged-Fibonacci type of random number generator with * * rotation of bits. The algorithm is: * * X[n] = ((X[n-j] rotl r1) + (X[n-k] rotl r2)) modulo 2^b * * * * The last k values of X are stored in a circular buffer named * * randbuffer. * * * * This version works with any integer size: 16, 32, 64 bits etc. * * The integers must be unsigned. The resolution depends on the integer * * size. * * * * Note that the function RanrotAInit must be called before the first * * call to RanrotA or iRanrotA * * * * The theory of the RANROT type of generators is described at * * www.agner.org/random/ranrot.htm * * * * Optimized for 64 bits usage by Vincent Diepeveen * * diep@xs4all.nl * *************************************************************************/ FORCEINLINE BITBOARD rotl(BITBOARD x,int r) {return(x<>(64-r));} /* returns a random number of 64 bits unsigned */ FORCEINLINE BITBOARD RanrotA(void) { /* generate next random number */ BITBOARD x = randbuffer[r_p1] = rotl(randbuffer[r_p2],R1) + rotl(randbuffer[r_p1], R2); /* rotate list pointers */ if( --r_p1 < 0) r_p1 = KK - 1; if( --r_p2 < 0 ) r_p2 = KK - 1; return x; } /* this function initializes the random number generator. */ void RanrotAInit(void) { int i; /* one can fill the randbuffer here with possible other values here */ randbuffer[0] = 0x92930cb295f24000 | (BITBOARD)ProcessNumber; randbuffer[1] = 0x0d2f2c860b000215 | ((BITBOARD)ProcessNumber<<12); /* initialize pointers to circular buffer */ r_p1 = 0; r_p2 = JJ; /* randomize */ for( i = 0; i < 3000; i++ ) (void)RanrotA(); } At 16:40 10-2-2005 -0500, Joe Landman wrote: >Hi folks: > > I need to get a thread-safe pseudo-random number generator. All I >have found online was SPRNG which is set up for MPI. Anyone have a >quick pointer to their favorite thread safe PRNG that works well in OpenMP? > > Thanks. > >Joe > >-- >Scalable Informatics LLC, >email: landman@scalableinformatics.com >web : http://www.scalableinformatics.com > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From wrankin at ee.duke.edu Fri Feb 11 09:51:42 2005 From: wrankin at ee.duke.edu (Bill Rankin) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: <1108144302.3042.27.camel@localhost.localdomain> > Is cfm the key unit here or should one think in terms of pressure > at various points in the room? The other factor in heat removal (both within the rack as well as within the air chiller) is the intake air temps. The larger the temperature difference the more efficient the heat transfer becomes. Essentially, 50cfm of 20C air cools a lot better than 50cfm of 30C air. Also (as we are currently experiencing) the air handlers are much more efficient at cooling really HOT air, versus warm air. -bill -- bill rankin, ph.d. ........ director, cluster and grid technology group wrankin@ee.duke.edu .......................... center for computational duke university ...................... science engineering and medicine http://www.ee.duke.edu/~wrankin .............. http://www.csem.duke.edu From maurice at harddata.com Fri Feb 11 11:15:05 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Re: Re: Home beowulf - NIC latencies (Greg Lindahl) In-Reply-To: <200502111409.j1BE8vAY013737@bluewest.scyld.com> References: <200502111409.j1BE8vAY013737@bluewest.scyld.com> Message-ID: <420D0439.3000304@harddata.com> Greg Lindahl wrote: >Amen. So use the MM5 t3a benchmark, maybe even SPEC HPC, the canned >benchmarks for Amber, Charmm, DL_POLY, etc. The NAS Parallel >Benchmarks are also good, they are much closer to real apps than >microbenchmarks. > >-- greg > Double Amen. ( is that a long Amen??) ;-) Now if we could only get all those benchmarks to agree with each other a bit! It's classic. Pick your arch, chipset, amount of RAM, clockspeed, NIC, switch, and so on, and you can make a selective case for almost anything.. Although on SMP the Opterons are mainly kicking butt lately due to the fact that their SMP performance is so superior.. And that brings up another can 'o worms: SMP or uni ? One can make a great performance case for either/both depending on your goals. With our best regards, Maurice W. Hilarius Telephone: 01-780-456-9771 Hard Data Ltd. FAX: 01-780-456-9772 11060 - 166 Avenue email:maurice@harddata.com Edmonton, AB, Canada http://www.harddata.com/ T5X 1Y3 This email, message, and content, should be considered confidential, and is the copyrighted property of Hard Data Ltd., unless stated otherwise. From rbbrigh at valeria.mp.sandia.gov Fri Feb 11 12:14:11 2005 From: rbbrigh at valeria.mp.sandia.gov (Ron Brightwell) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420CFE4C.6050003@ccrl-nece.de> References: <200502102000.j1AK0Eb7016772@bluewest.scyld.com> <420BC57A.5060007@harddata.com> <20050211023619.GB5174@greglaptop.internal.keyresearch.com> <420CFE4C.6050003@ccrl-nece.de> Message-ID: <20050211201411.GA10732@ratbert.mp.sandia.gov> On Fri Feb 11, 2005 11:49:48... Joachim Worringen wrote > Greg Lindahl wrote: > >On Thu, Feb 10, 2005 at 01:35:06PM -0700, Maurice Hilarius wrote: > >>If I have a fantastic device that uses infinitely small time (latency) > >>and moves huge amounts of data (bandwidth) but in doing so it takes 80% > >>of a CPU, we do not have a useful solution.. > > > >If large cpu usage is a problem, it will show up nicely in real > >application benchmarks. > > True. I always wonder what the low-CPU-usage-advocates want the MPI > process to do while i.e. an MPI_Send() is executed. For small messages > (which are critical for many applications), it's somewhat like > requesting that a local memory-write has to show low CPU usage. For blocking operations with short messages, low CPU usage shouldn't be the main concern. Measuring latency relative to CPU usage doesn't make much sense. > > Of course, I can think of scenarios in which data transfers w/o CPU > usage do promise advantages, and I have implemented and evaluated such > techniques myself. But in the end (for the application), it always > boiled down to latency and bandwidth as most applications don't honor > "true" asynchronous communication. Yep. We seem to have several micro-benchmarks that determine what the overlap potential of the network is, but I've never seen anything that determines what the overlap potential of an application is. It would be interesting to see what the overlap potential of real applications is. > > The latest unsuccessful case of uncoupling computation and MPI > communication I read about was BG/L when using the second CPU as a > message processor. Maybe Myrinet MX will behave differently by making > the MPI itself more concurrent on hardware level (is this a correct > description, Patrick?) - but it will need matching applications, too. > BG/L is unique is many ways. For example, using the second processor for communications doesn't actually help with progress -- the application still has to make MPI library calls to make progress on outstanding posted operations. So, even if the application was coded to take advantage of overlap, it probably wouldn't gain much by using the second processor. MX should be able to provide overlap and progress, like Quadrics and a few other technologies do. -Ron From bushnell at ultra.chem.ucsb.edu Fri Feb 11 15:57:27 2005 From: bushnell at ultra.chem.ucsb.edu (John Bushnell) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: Message-ID: A few comments below... On Fri, 11 Feb 2005, David Mathog wrote: > Mike, > > I've been trying to pick the brains of other folks on the > beowulf list who have computer rooms with modern equipment. > > One problem with the existing air, with regards to future > expansion, is apparently the total amount of air that the > current A/C can move. This is all horrendously complicated > and needs to be looked at carefully by a HVAC consultant. > Pretty sure we have enough tons and flow for now, meaning > my rack and Deshaies and everything else I know is going in there > in a couple of months. More and more convinced that we don't > have enough to handle multiple full racks of the next generation > of computers. We learned about this after putting in a big new A/C (adding to an old but still functioning one) in our server room. The problem was mitigated by having the vents on the old AC replaced with flanges attached to large flexible vents. They hang near the top/front of two racks, and this has helped quite a bit. Air flow is important! > Jim Lux from JPL answered my questions as attached after > my signature. Thanks go out to Jim for the useful numbers. > Darryl did say something interesting though, he said that for > some units the A/C people can increase the capacity by changing > the pulleys around. Apparently this blows more air, and the > cold water isn't limiting, so it effectively upgrades the unit > without changing very much. Darryl said that this was done > at some point for Mayo's computer room in the subbasement > of the BI. Sounds like a pretty cheap upgrade. It would certainly be nice if we could do that here, as we've been running on the edge in terms of cooling for some time now. Our industial chilled water loop runs at around 16C, so obviously the chilled water is simply acting as a resevoir for dumping heat from a compressor rather than being the direct source of cooling. So the limiting factor is likely the compressor/fluid/heat exchanger with the chilled water, rather than the chilled water itself. I wonder what "changing pulleys around" is really doing? Stay cool - John From idooley2 at uiuc.edu Fri Feb 11 16:59:06 2005 From: idooley2 at uiuc.edu (Isaac Dooley) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> Message-ID: <420D54DA.8000904@uiuc.edu> > > >>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the >>completion of a message transaction. >> >> > >No, it allows the programmer to express that it wants to send a message >but not wait for it to complete right now. The API doesn't specify the >semantics of CPU utilization. It cannot, because the API doesn't have >knowledge of the hardware that will be used in the implementation. > > That is partially true. The context for my comment was under your assumption that everyone uses MPI_Send(). These people, as I stated before, do not care about what the CPU does during their blocking calls. I was trying to point out that programs utilizing non-blocking IO may have work that will be adversely impacted by CPU utilization for messaging. These are the people who care about CPU utilization for messaging. This I hopes answers your prior question, at least partially. Perhaps your applications demand low latency with no concern for the CPU during the time spent blocking. That is fine. But some applications benefit from overlapping computation and communication, and the cycles not wasted by the CPU on communication can be used productively. Isaac Dooley From rossen at VerariSoft.Com Fri Feb 11 22:52:03 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> Message-ID: <420DA793.4000909@verarisoft.com> I think that the mere definition of the term "MPI performance" and focusing too much on it can potentially have a negative impact on the overall discussion of parallel performance. Accepting the premise that all MPI can do is push individual messages between user processes as fast as possible, (as measured by ping pong) regardless of how this is achieved, unnecessarily and, I'd say, unjustifiably restricts the field of discussion. I agree that today MPI libraries are commonly measured by their ping-pong "performance" and not by their CPU utilization or other factors, but it does not necessarily make this form of performance evaluation right. I would support the idea of discussing isolated "MPI performance" but only in the context of a broader performance parameter space, at least including, communication overhead, communication bandwidth, processor overhead, and ability to perform asynchronous communication (i.e., compliance to the MPI Progress Rule). Only in such a broader evaluation space one can hope to fit the large number of combinations of processor/memory/peripheral_fabric architectures, network interconnects, system software/middleware, and application algorithms. Of course, there is always the case of running the actual application code and then evaluating the MPI performance by seeing which MPI library (or library mode) makes the application run faster. Unfortunately, this method for evaluating MPI often suffers from various efficiencies some of which originate from the parallel algorithm developers, who thoughout the years have sometimes adopted the most trivial ways of using MPI. Here a couple of arguments for why it is important to look at MPI (and the whole communication system) from different angles. If certain MPI optimizations are achieved at the cost of excessive use of resources that otherwise could be used for computation or enabling the overall "application_progress", the actual application performance may be below its potential or even degrade. Here are some "application progress" activities that can benefit of having these resources at their disposal: OS/kernel processing, other communication, I/O operations, memory operations (prefetching, etc.), peripheral bus/fabric operations. All of these in one way or another depend on CPU processing. Also, today's processor architectures have many independent processing units and complex memory hierarchies. When the MPI library polls for completion of a communication request, most of this specialized hardware is virtually unused (wasted). The processor architecture trends indicate that this kind of internal CPU concurrency will continue to increase, thus making the cost of MPI polling even higher. In this regard, a parallel application developer might actually very much care what is actually happening in the MPI library even when he makes a call to MPI_Send. If he doesn't, he probably should. Some related topics (not covered here because of bloviating) are: - How an MPI library that maximizes MPI's ping-pong performance alone can cause unexpected behavior and a fully functional parallel system to work far below its realistic efficiency. - What application algorithm developers experience when they attempt to use the ever so nebulous "overlapping" with a polling MPI library and how this experience has contributed to the overwhelming use of MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or (even better) persistent MPI calls, thus killing any hope that these codes can run faster on systems that actually facilitate overlapping. Rossen Rob Ross wrote: > Hi Isaac, > > On Fri, 11 Feb 2005, Isaac Dooley wrote: > > >>>>Using MPI_ISend() allows programs to not waste CPU cycles waiting on the >>>>completion of a message transaction. >>> >>>No, it allows the programmer to express that it wants to send a message >>>but not wait for it to complete right now. The API doesn't specify the >>>semantics of CPU utilization. It cannot, because the API doesn't have >>>knowledge of the hardware that will be used in the implementation. >>> >> >>That is partially true. The context for my comment was under your >>assumption that everyone uses MPI_Send(). These people, as I stated >>before, do not care about what the CPU does during their blocking calls. > > > I think that it is completely true. I made no assumption about everyone > using MPI_Send(); I'm a late-comer to the conversation. > > I was not trying to say anything about what people making the calls care > about; I was trying to clarify what the standard does and does not say. > However, I agree with you that it is unlikely that someone calling > MPI_Send() is too worried about what the CPU utilization is during the > call. > > >>I was trying to point out that programs utilizing non-blocking IO may >>have work that will be adversely impacted by CPU utilization for >>messaging. These are the people who care about CPU utilization for >>messaging. This I hopes answers your prior question, at least partially. > > > I agree that people using MPI_Isend() and related non-blocking operations > are sometimes doing so because they would like to perform some > computation while the communication progresses. People also use these > calls to initiate a collection of point-to-point operations before > waiting, so that multiple communications may proceed in parallel. The > implementation has no way of really knowing which of these is the case. > > Greg just pointed out that for small messages most implementations will do > the exact same thing as in the MPI_Send() case anyway. For large messages > I suppose that something different could be done. In our implementation > (MPICH2), to my knowledge we do not differentiate. > > You should understand that the way MPI implementations are measured is by > their performance, not CPU utilization, so there is pressure to push the > former as much as possible at the expense of the latter. > > >>Perhaps your applications demand low latency with no concern for the CPU >>during the time spent blocking. That is fine. But some applications >>benefit from overlapping computation and communication, and the cycles >>not wasted by the CPU on communication can be used productively. > > > I wouldn't categorize the cycles spent on communication as "wasted"; it's > not like we code in extraneous math just to keep the CPU pegged :). > > Regards, > > Rob > --- > Rob Ross, Mathematics and Computer Science Division, Argonne National Lab > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From sadat_vit at yahoo.co.in Fri Feb 11 23:38:32 2005 From: sadat_vit at yahoo.co.in (sadat khan) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] BEOWULF vs NORMAL CLUSTER Message-ID: <20050212073832.28306.qmail@web8310.mail.in.yahoo.com> I am a new addition to this mailing list.Recently got interesterd in the field of high performance computing. We had Mr.Anand Babu in our college in recently(the creator of the 5th fastest supercomputer in the world THUNDER). And he gave a really good talk on clustering.... First up i would like to enquire as to whether there is any difference between a beowulf and normal cluster??? Or is it jus another name for a cluster... Another thing is what exactly do packages like MPI and PVM do ??? would be highly grateful for the help Yahoo! India Matrimony: Find your life partneronline. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050212/3488d429/attachment.html From topa_007 at yahoo.com Sat Feb 12 05:40:49 2005 From: topa_007 at yahoo.com (Toufeeq Hussain) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Problem executing programs on lam-mpi Message-ID: <20050212134049.70098.qmail@web30209.mail.mud.yahoo.com> Hi, I get the following message while running a MPI program on a 2 node cluster* mpirun: cannot start ./a.out on n0: No such file or directory I'm running mpirun as such : $ mpirun C ./a.out compiled lam as such : ./configure --without-romio --with-rsh="ssh -x" *recon/lamboot execute successfully. topa@debian:~$ lamboot -v hosts LAM 7.1.1/MPI 2 C++ - Indiana University n-1<32615> ssi:boot:base:linear: booting n0 (devian) n-1<32615> ssi:boot:base:linear: booting n1 (debian) n-1<32615> ssi:boot:base:linear: finished *lamnodes gives the following output: topa@debian:~/mpi_progs$ lamnodes n0 devian:1: n1 debian:1:origin,this_node The MPI program is a simple one. #include #include int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello there"); printf("Hello world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } Please help, Toufeeq ===== ############################################ # ring me @ 98401-96690 # # mail me @ toufeeq at computer dot org # # Debian Sarge \w 2.6.10-ck5 # ############################################ From landman at scalableinformatics.com Sat Feb 12 08:52:16 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] RE: A thread-safe PRNG for an OpenMP program In-Reply-To: References: Message-ID: <420E3440.3080108@scalableinformatics.com> Hi Henry: This is for two platforms that are not targets for Intel compilers. I have solved the problem by reworking tt800 a bit, and have that working nicely in OpenMP. Thanks though. Joe Gabb, Henry wrote: > Hi Joe, > The Intel Math Kernel Library (specifically the Vector Statistical > Library within MKL) contains threadsafe random number functions. The > following web site has a full description: > http://www.intel.com/software/products/mkl/features/vsl.htm. There's an > article "Making the Monte Carlo Approach Even Easier and Faster" on > Intel Developer Services that describes how to use VSL functions with > OpenMP. It's available here: > http://www.intel.com/cd/ids/developer/asmo-na/eng/95573.htm. > > Best regards, > > Henry Gabb > Intel Parallel Applications Center > > > >>Hi folks: >> >> I need to get a thread-safe pseudo-random number generator. All I > > >>have found online was SPRNG which is set up for MPI. Anyone have a >>quick pointer to their favorite thread safe PRNG that works well in > > OpenMP? > >> Thanks. >> >>Joe >> >>-- >>Scalable Informatics LLC, >>email: landman@scalableinformatics.com >>web : http://www.scalableinformatics.com > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From dtj at uberh4x0r.org Sat Feb 12 08:53:47 2005 From: dtj at uberh4x0r.org (Dean Johnson) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: References: Message-ID: <1108227228.3853.8.camel@terra> On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote: > > Relative airflow can probably be measured with a kid's toy -- one of the > little pinwheels -- and counting revolutions with a stopwatch. > Normalizing that to absolute airflow in CFM is a bit tricky (since the > result depends to some extent on the resistance imposed by the measuring > apparatus) but somebody out there may have designed a version of this > with a real fan and magnets set so that the counting is done > electronically. In fact, I could build something to do this out of OTC > parts if I had any way to normalize the count. > Could you not use one of those cheapish wind speed devices that amateur weather folks use? That would give you a rating, presumably in miles per hour, and then figure backward based upon the area of the little fan thingy. That would likely be not too expensive and a great deal easier, and more accurate, to deal with than counting a pinwheel. ;-) -Dean From atp at piskorski.com Sat Feb 12 13:02:54 2005 From: atp at piskorski.com (Andrew Piskorski) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: <1108227228.3853.8.camel@terra> References: <1108227228.3853.8.camel@terra> Message-ID: <20050212210254.GA66503@piskorski.com> On Sat, Feb 12, 2005 at 10:53:47AM -0600, Dean Johnson wrote: > On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote: > > > > Relative airflow can probably be measured with a kid's toy -- one of the > Could you not use one of those cheapish wind speed devices that amateur > Could you not use one of those cheapish wind speed devices that amateur > weather folks use? That would give you a rating, presumably in miles per When I asked Jack Wathey (architect of the Ammonite cluster) about the small hand-held anemometers intended for hikers and such, what he said was: On Wed, Nov 10, 2004 at 10:58:16AM -0800, Jack Wathey wrote: > What I found most useful was the Kestrel 2000, which measures wind speed > and temperature. The Kestrel 1000 is cheaper ($80 vs $100) and just > measures windspeed. The Kestrel was the only windmeter I could find that > was sensitive and accurate enough for measuring the flowrate at ammonite's > filters (typically in the 120 to 200 feet per minute range). They are > EXTREMELY DELICATE though! You can wreck the sapphire bearing just by > blowing on it hard (yes, I discovered this the hard way). But the bearing > and impeller are replaceable for about $15, so it's not a disaster. > > http://www.kestrelmeters.com A while back, I purchased one here: http://store.botachtactical.com/ke20pothwime.html -- Andrew Piskorski http://www.piskorski.com/ From emac at cybergps.net Sat Feb 12 11:30:17 2005 From: emac at cybergps.net (Eric Machala) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Home Beowulf Intial Startup Question References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> Message-ID: <003501c51139$496fa9f0$6e45a8c0@masstivy> Hi all im semi new to beowulf's but very knowledge or computer and netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram blah blah nodes wondering first off if master control is recommened to be same or better than nodes and what is recommened Linux O/s redhat or mandrake etc... or anyones recommendations Im also looking for some links or resources for tools aka software like parallel kernel upgrades moniter tools anything for setting up Linux beowulf to make this go smoothly Eric M Network Admin/CF Emac@cybergps.net From steve_heaton at ozemail.com.au Sat Feb 12 15:41:22 2005 From: steve_heaton at ozemail.com.au (Fringe Dweller) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question - dedicated infrastructure In-Reply-To: <200502122000.j1CK096k019160@bluewest.scyld.com> References: <200502122000.j1CK096k019160@bluewest.scyld.com> Message-ID: <420E9422.7080002@ozemail.com.au> An enlightening discussion re aircon peoples. Thanks. A couple of "war stories" :) I think RGB touched on problems with "defaults" on aircon behaviour in cooler climes. We have similar problems in warmer part of this blue marble. Your typical default behaviour is to put aircon into standby overnight. Even in the middle of summer. I mean, nobody's there and it's cooler overnight anyway right? Well yes but if your pushing your IT hard overnight... you can see the consequences. Make sure you "own" your aircon :) Another reason to ensure independence from anything related to the "building" is power. I had a customer in a very large building who's UPS would always trip every weekday morning at 6am and 6:30. Why? 6am => aircon up! 6:30 => lift motors up! The current draw for those two events is staggering. That's why you spend big bucks on the supporting infrastructure =) Stevo From hahn at physics.mcmaster.ca Sun Feb 13 12:41:51 2005 From: hahn at physics.mcmaster.ca (Mark Hahn) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Home Beowulf Intial Startup Question In-Reply-To: <003501c51139$496fa9f0$6e45a8c0@masstivy> Message-ID: > netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram kinda low on ram there, but for a learning cluster, that's plenty. (actually 20 is kinda big for such a cluster...) > blah blah nodes wondering first off if master control is recommened to be > same or better than nodes and what is recommened Linux O/s redhat or > mandrake etc... or anyones recommendations distros don't matter - none of them are significantly different, and they all work. people who care about distros are more interested in desktop decor than getting work done ;) admittedly, I am not a never-reinvent-the-wheel person. NRTW is worse than NIH, IMO. (some wheels desperately need reinvention, all progress comes from reinvention, etc). > Im also looking for some links or resources for tools aka software like > parallel kernel upgrades moniter tools anything for setting up Linux > beowulf to make this go smoothly to me, "smooth" means "no extra load per node". I strongly prefer net-booting, or at least net-root setups. people will tell you that using NFS for this is horribly inefficient, dangerous and causes warts. but it works extremely well, at least for clusters of <= 96 nodes, based on my experience so far. things might be different if you're doing retrocomputing based on a half-duplex 10mbps network or have large IO loads. the benefit is that your cluster acts like you have just one slave node. the cost is that you have to do a pretty minor amount of work to hack something like Fedora to boot diskless (small changes to the initrd.) and of course, it does mean that "incidental" file IO will cause network traffic. it's not clear to me that this is a problem, though, since: - nodes are normally configured to be fairly minimal - you don't have 30 user logins on each one, with people running ls/bash/netscape/gcc all the time. - NFS is not that bad at caching, and you can help this out by upping the per-mount cache parameters a bit.` - it's awefully nice to have a nearly fully functional node even after its disk dies. - my "diskless" nodes actually do have local swap and /tmp. disks are cheap and handy, just don't *depend* on them. - you can easily imagine a hybrid system that boots somehow (PXE or from disk), and does an rsync or rpm/yum/systemimager equivalent. I don't really see the point though. - having your root FS exported read-only is also kind of nice: good security is layered security... From mathog at mendel.bio.caltech.edu Sun Feb 13 13:50:34 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? Message-ID: There are a series of white papers by APC here: http://www.apc.com/tools/mytools/index.cfm?action=wp where they discuss various power and cooling factors. They note a disconnect between the higher densities achieved by blades and similar high density racks and the practicality of actually cooling these beasts. Basically it comes down to you save space on the rack and then give it all back on the cooling system. Think of it minimally in these terms - to move enough cfm at less than 30 feet per minute starts to require a duct larger than the rack itself! In terms of TCO, at the moment, APC rejects the notion that these ultra high density machines are cost effective because they are so very difficult to cool. It seems to me that at a certain power point the racks are going to have to resort to water cooling. Long ago the ECL mainframes were cooled this way, but it's been a long time since most of us have seen water pipes running into the computers in a machine room. Cooling a 10 kW rack well looks to be extremely tough with air, and going much above that would seem to require something approaching a dedicated wind tunnel. Any opinions on how high the power dissipation in racks will go before the manufacturers throw in the air cooling towel and start shipping them with water connections? If you were designing a computer room today (which I am) what would you allow for the maximum power dissipation per rack _to_be_handled_ by_the_room_A/C. The assumption being that in 8 years if somebody buys a 40kW (heaven forbid) rack it will dump its heat through a separate water cooling system. Thanks, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Sun Feb 13 15:47:22 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? In-Reply-To: References: Message-ID: On Sun, 13 Feb 2005, David Mathog wrote: > There are a series of white papers by APC here: > > http://www.apc.com/tools/mytools/index.cfm?action=wp That link doesn't work for me (apc's website barfs on it) but I googled and worked through their gatekeeper to get access. After "logging in" (yuck) I'm going to try to download: WP-5 Cooling Imperatives for Data Centers and Network Rooms Effective next generation data centers and network rooms must address the known needs and problems relating to current and past designs. This paper presents a categorized and prioritized collection of cooling needs and problems as obtained through systematic user interviews. which I'm hoping is the one you are referring to above. > where they discuss various power and cooling factors. They note > a disconnect between the higher densities achieved by blades and > similar high density racks and the practicality of actually > cooling these beasts. Basically it comes down to you save space > on the rack and then give it all back on the cooling system. Think > of it minimally in these terms - to move enough cfm at less than 30 > feet per minute starts to require a duct larger than the rack itself! > > In terms of TCO, at the moment, APC rejects the notion that > these ultra high density machines are cost effective because they > are so very difficult to cool. >From what I learned of bladed systems back when I reviewed them for my own purposes, this isn't terribly surprising, but it is really valuable to have a well-researched document that explains how and why. 10 KW (think 100 100W light bulbs) in what, 2 m^3 -- that's a lot of energy to get rid of, and almost by definition you're removing it from components that are packed as tightly as possible. > It seems to me that at a certain power point the racks are going to > have to resort to water cooling. Long ago the ECL mainframes were > cooled this way, but it's been a long time since most of us have > seen water pipes running into the computers in a machine room. > > Cooling a 10 kW rack well looks to be extremely tough with air, > and going much above that would seem to require something approaching > a dedicated wind tunnel. Any opinions on how high the power > dissipation in racks will go before the manufacturers throw > in the air cooling towel and start shipping them with water > connections? I think you're within a factor of 2 or so of the SANE threshold at 10KW. A rack full of 220 W Opterons is there already (~40 1U enclosures). I'd "believe" that you could double that with a clever rack design, e.g. Rackable's, but somewhere in this ballpark...it stops being sane. > If you were designing a computer room today (which I am) what would > you allow for the maximum power dissipation per rack _to_be_handled_ > by_the_room_A/C. The assumption being that in 8 years if somebody > buys a 40kW (heaven forbid) rack it will dump its heat through > a separate water cooling system. This is a tough one. For a standard rack, ballpark of 10 KW is accessible today. For a Rackable rack, I think that they can not quite double this (but this is strictly from memory -- something like 4 CPUs per U, but they use a custom power distribution which cuts power and a specially designed airflow which avoids recycling used cooling air). I don't know what bladed racks achieve in power density -- the earlier blades I looked at had throttled back CPUs but I imagine that they've cranked them up at this point (and cranked up the heat along with them). Ya pays your money and ya takes your choice. An absolute limit of 25 (or even 30) KW/rack seems more than reasonable to me, but then, I'd "just say no" to rack/serverroom designs that pack more power than I think can sanely be dissipated in any given volume. Note that I consider water cooled systems to be insane a priori for all but a small fraction of server room or cluster operations, "space" generally being cheaper than the expense associated with achieving the highest possible spatial density of heat dissipating CPUs. I mean, why stop at water? Liquid Nitrogen. Liquid Helium. If money is no option, why not? OTOH, when money matters, at some point it (usually) gets to be cheaper to just build another cluster/server room, right? rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From james.p.lux at jpl.nasa.gov Sun Feb 13 16:06:06 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? References: Message-ID: <001a01c51228$fb48ef20$1af69580@LAPTOP152422> ----- Original Message ----- From: "David Mathog" To: Sent: Sunday, February 13, 2005 1:50 PM Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? > There are a series of white papers by APC here: > > http://www.apc.com/tools/mytools/index.cfm?action=wp > > where they discuss various power and cooling factors. They note > a disconnect between the higher densities achieved by blades and > similar high density racks and the practicality of actually > cooling these beasts. Basically it comes down to you save space > on the rack and then give it all back on the cooling system. Think > of it minimally in these terms - to move enough cfm at less than 30 > feet per minute starts to require a duct larger than the rack itself! I think that's 30 ft/second.. 1800 lfpm would be a reasonable duct speed... 30 lfpm is really really slow (that's 1/2 ft/sec, which is a pretty darn gentle breeze) > > In terms of TCO, at the moment, APC rejects the notion that > these ultra high density machines are cost effective because they > are so very difficult to cool. > > It seems to me that at a certain power point the racks are going to > have to resort to water cooling. Long ago the ECL mainframes were > cooled this way, but it's been a long time since most of us have > seen water pipes running into the computers in a machine room. High power density devices (like power electronics or high power vacuum tubes) have always resorted to liquid cooling. It's so much more efficient than trying to cool with air. For a variety of reasons, but primarily because it separates the problem of physical device and radiator surface. Consider liquid vs air cooled internal combustion engines. Really high power density often uses some sort of phase change (ebullient) cooling, although the design challenges are significant. Even some laptops have used liquid or phase change cooling (heat pipes) to move the heat from the CPU to the case. An interesting exception to liquid cooling for high power devices is big generators, which are cooled with hydrogen gas (low viscosity and density, so low aerodynamic drag) But liquid cooling, per se, isn't a crippling thing to work with. And, it actually allows certain design economies: no more do you have to constrain the design for air flow, or conduction through the boards, nor do you have to fool with an array of CPU fans, video card fans, etc. > > Cooling a 10 kW rack well looks to be extremely tough with air, > and going much above that would seem to require something approaching > a dedicated wind tunnel. Any opinions on how high the power > dissipation in racks will go before the manufacturers throw > in the air cooling towel and start shipping them with water > connections? Consider that 10kW is 5-10 times the power dissipation of a hair dryer. Other solutions that might turn up are an internal cooling loop to move heat from inside to a big heatsink on the surface. Modern rack mounted PCs aren't particularly designed for efficient thermal transfer with minimal air flow. (there's no economic incentive for it) There are economies of scale to a common chiller, though, because when you get to large HVAC, cold water is what you get, rather than cold air, because moving cold air is a LOT more expensive than moving cold water. > > If you were designing a computer room today (which I am) what would > you allow for the maximum power dissipation per rack _to_be_handled_ > by_the_room_A/C. The assumption being that in 8 years if somebody > buys a 40kW (heaven forbid) rack it will dump its heat through > a separate water cooling system. There are such things as individual rack chillers, which you would bolt to a rack and then hook up to a centralized cold water source. > > Thanks, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From james.p.lux at jpl.nasa.gov Sun Feb 13 19:33:50 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? References: Message-ID: <000401c51246$00751150$26f49580@LAPTOP152422> > > > > I think you're within a factor of 2 or so of the SANE threshold at 10KW. > A rack full of 220 W Opterons is there already (~40 1U enclosures). I'd > "believe" that you could double that with a clever rack design, e.g. > Rackable's, but somewhere in this ballpark...it stops being sane. > > > If you were designing a computer room today (which I am) what would > > you allow for the maximum power dissipation per rack _to_be_handled_ > > by_the_room_A/C. The assumption being that in 8 years if somebody > > buys a 40kW (heaven forbid) rack it will dump its heat through > > a separate water cooling system. > > This is a tough one. For a standard rack, ballpark of 10 KW is > accessible today. For a Rackable rack, I think that they can not quite > double this (but this is strictly from memory -- something like 4 CPUs > per U, but they use a custom power distribution which cuts power and a > specially designed airflow which avoids recycling used cooling air). I > don't know what bladed racks achieve in power density -- the earlier > blades I looked at had throttled back CPUs but I imagine that they've > cranked them up at this point (and cranked up the heat along with them). > > Ya pays your money and ya takes your choice. An absolute limit of 25 > (or even 30) KW/rack seems more than reasonable to me, but then, I'd > "just say no" to rack/serverroom designs that pack more power than I > think can sanely be dissipated in any given volume. Note that I consider > water cooled systems to be insane a priori for all but a small fraction > of server room or cluster operations, "space" generally being cheaper > than the expense associated with achieving the highest possible spatial > density of heat dissipating CPUs. I mean, why stop at water? Liquid > Nitrogen. Liquid Helium. If money is no option, why not? OTOH, when > money matters, at some point it (usually) gets to be cheaper to just > build another cluster/server room, right? The speed of light starts to set another limit for the physical size, if you want real speed. There's a reason why the old Crays are compact and liquid cooled. It's that several nanoseconds per foot propagation delay. Once you get past a certain threshold, you're actually better off going to very dense form factors and liquid cooling, in many areas. I think that most clusters haven't reached the performance point where it's worth liquid cooling the processors, but it's probably pretty close to the threshold. Adding machine room space is expensive for other reasons. You've already got to have the water chillers for any sort of major sized cluster (to cool the air), so the incremental cost to providing an appropriate interface to the racks and starting to build racks in liquid cooled configurations can't be far away. Liquid cooling is MUCH more efficient than air cooling: better heat transfer, better life (more even temperatures), less real estate required, etc. The hangup now is that nobody makes liquid cooled PCs as a commodity, mass production item. What you'll find is liquid cooling retrofits that don't take advantage of what liquid cooling can get you. If you look at high performance radar or sonar processors and such that use liquid cooling, the layout and physical configuration is MUCH different (partly driven by the fact that the viscosity of liquid is higher than air). Wouldn't YOU like to have, say, 1000 processors in one rack, with a 2-3" flexible pipe to somewhere else? Especially if it was perfectly quiet? And could sit next to your desk? (1000 processors*100W each is 100kW). From rene at renestorm.de Sat Feb 12 20:29:58 2005 From: rene at renestorm.de (rene) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Block send mpi Message-ID: <200502130529.58915.rene@renestorm.de> Hi folks, i know, this isn't a mpi forum, even so allow me a question about block sending. i got some(times) nice SIGSEGVs with that code (C++ implementation). Did I code something totally wrong? I really don't understand this function. // int MPI_Buffer_attach( void *buffer, int size ) int packsize; MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); int bufsize = packsize + (MPI_BSEND_OVERHEAD); void *buf = new (void (*[packsize]) ()); MPI_Buffer_attach (buf, bufsize); ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); MPI_Buffer_detach (&buf, &bufsize); Thanks, -- Rene Storm @Cluster From maurice at harddata.com Sun Feb 13 11:30:43 2005 From: maurice at harddata.com (Maurice Hilarius) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? In-Reply-To: <200502122000.j1CK096j019160@bluewest.scyld.com> References: <200502122000.j1CK096j019160@bluewest.scyld.com> Message-ID: <420FAAE3.9070108@harddata.com> Dean Johnson wrote: > On Sat, 2005-02-12 at 09:49 -0500, Robert G. Brown wrote: > >>> >>> Relative airflow can probably be measured with a kid's toy -- one of the >>> little pinwheels -- and counting revolutions with a stopwatch. >>> Normalizing that to absolute airflow in CFM is a bit tricky (since the >>> result depends to some extent on the resistance imposed by the measuring >>> apparatus) but somebody out there may have designed a version of this >>> with a real fan and magnets set so that the counting is done >>> electronically. In fact, I could build something to do this out of OTC >>> parts if I had any way to normalize the count. >>> > > >Could you not use one of those cheapish wind speed devices that amateur >weather folks use? That would give you a rating, presumably in miles per >hour, and then figure backward based upon the area of the little fan >thingy. That would likely be not too expensive and a great deal easier, >and more accurate, to deal with than counting a pinwheel. ;-) > > -Dean One can also go to an auto wreckers, and from ,any newer models of cars get a Mass Air Flow sensor (MAF) from teh throttle body. Modern cars use these, in conjunction with an O2 sensor on the exhasut, to manage fuel injection. The MAF returns a variable DC voltage, usually in the range of 0 to 5V (depending on air speed). Make a tube, mount the MAF with the probe end in the tube, attach to back of device being measured. Supply 12V DC,Connect to output for measurement. Obviosly this would have to be calibrated. It is cheap, and very accurate and very relaible.. If you want to make it more useful , a lot of modern cars also use a barometric pressure sensor, and the calucs can be done using bioth outputs. This helps a lot as things like current weather conditions and altitude have a large bearing on air pressure. Measuring flow by speed only, and ignoring pressure is a fairly inaccurate method. Lastly, one can measure the humidity, as this also has a pretty large influence on the cooling capacity of the air being moved. For around $25 one can cannibalize the parts and cabling from a modern car wreck. All that is left is to provide a DC 12V source, a computer with a 4 channel A/D chip on a proto board, and some calibration. The calibration will be the toughest challenge as you will need accurate precalibrated instruments for a test session, but at least this is one time, and may be borrowed.. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20050213/f138ce20/attachment.html From landman at scalableinformatics.com Sun Feb 13 21:42:21 2005 From: landman at scalableinformatics.com (Joe Landman) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Block send mpi In-Reply-To: <200502130529.58915.rene@renestorm.de> References: <200502130529.58915.rene@renestorm.de> Message-ID: <42103A3D.8020605@scalableinformatics.com> Rene: More data. Where exactly does it SEGV? At the void *buf line? at the Pack? or the Bsend? Did you compile with -g? Do you have a core dump? Joe rene wrote: > Hi folks, > > i know, this isn't a mpi forum, even so allow me a question about block > sending. > > i got some(times) nice SIGSEGVs with that code (C++ implementation). > Did I code something totally wrong? > I really don't understand this function. > // int MPI_Buffer_attach( void *buffer, int size ) > > int packsize; > MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); > int bufsize = packsize + (MPI_BSEND_OVERHEAD); > void *buf = new (void (*[packsize]) ()); > MPI_Buffer_attach (buf, bufsize); > ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); > MPI_Buffer_detach (&buf, &bufsize); > > Thanks, -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com phone: +1 734 786 8423 fax : +1 734 786 8452 cell : +1 734 612 4615 From james.p.lux at jpl.nasa.gov Sun Feb 13 22:31:26 2005 From: james.p.lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] cooling question: cfm per rack? References: <200502122000.j1CK096j019160@bluewest.scyld.com> <420FAAE3.9070108@harddata.com> Message-ID: <002301c5125e$ede1ef40$32a8a8c0@LAPTOP152422> ----- Original Message ----- From: Maurice Hilarius Dean Johnson mailto: > Relative airflow can probably be measured with a kid's toy -- one of the > little pinwheels -- and counting revolutions with a stopwatch. > Normalizing that to absolute airflow in CFM is a bit tricky (since the > result depends to some extent on the resistance imposed by the measuring > apparatus) but somebody out there may have designed a version of this > with a real fan and magnets set so that the counting is done > electronically. In fact, I could build something to do this out of OTC > parts if I had any way to normalize the count. > Could you not use one of those cheapish wind speed devices that amateur weather folks use? That would give you a rating, presumably in miles per hour, and then figure backward based upon the area of the little fan thingy. That would likely be not too expensive and a great deal easier, and more accurate, to deal with than counting a pinwheel. ;-) -DeanOne can also go to an auto wreckers, and from ,any newer models of cars get a Mass Air Flow sensor (MAF) from teh throttle body. Modern cars use these, in conjunction with an O2 sensor on the exhasut, to manage fuel injection. The MAF returns a variable DC voltage, usually in the range of 0 to 5V (depending on air speed). Make a tube, mount the MAF with the probe end in the tube, attach to back of device being measured. Supply 12V DC,Connect to output for measurement. Obviosly this would have to be calibrated. It is cheap, and very accurate and very relaible.. If you want to make it more useful , a lot of modern cars also use a barometric pressure sensor, and the calucs can be done using bioth outputs. This helps a lot as things like current weather conditions and altitude have a large bearing on air pressure. Measuring flow by speed only, and ignoring pressure is a fairly inaccurate method. Lastly, one can measure the humidity, as this also has a pretty large influence on the cooling capacity of the air being moved. For around $25 one can cannibalize the parts and cabling from a modern car wreck. All that is left is to provide a DC 12V source, a computer with a 4 channel A/D chip on a proto board, and some calibration. The calibration will be the toughest challenge as you will need accurate precalibrated instruments for a test session, but at least this is one time, and may be borrowed.. ---- The problem with automotive mass air flow sensors is sensitivity at low flows. Consider, for a moment, a 1.8 liter engine turning over at 1800 rpm (call it 30 rev/sec..) That's 1.8*15 liters/sec of air (27 liters/sec), being drawn through a tube some 5-10 cm in diameter (call it 60 cm2).. that's 450 cm/sec or 4.5 m/sec... 885 linear ft/minute a fairly fast airflow in HVAC terms.... And that's the bottom of the range for the automotive sensor. From rgb at phy.duke.edu Mon Feb 14 03:18:27 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? In-Reply-To: <000401c51246$00751150$26f49580@LAPTOP152422> References: <000401c51246$00751150$26f49580@LAPTOP152422> Message-ID: On Sun, 13 Feb 2005, Jim Lux wrote: > > I think you're within a factor of 2 or so of the SANE threshold at 10KW. > > A rack full of 220 W Opterons is there already (~40 1U enclosures). I'd > > "believe" that you could double that with a clever rack design, e.g. > > Rackable's, but somewhere in this ballpark...it stops being sane. > > > > > If you were designing a computer room today (which I am) what would > > > you allow for the maximum power dissipation per rack _to_be_handled_ > > > by_the_room_A/C. The assumption being that in 8 years if somebody > > > buys a 40kW (heaven forbid) rack it will dump its heat through > > > a separate water cooling system. > > > > This is a tough one. For a standard rack, ballpark of 10 KW is > > accessible today. For a Rackable rack, I think that they can not quite > > double this (but this is strictly from memory -- something like 4 CPUs > > per U, but they use a custom power distribution which cuts power and a > > specially designed airflow which avoids recycling used cooling air). I > > don't know what bladed racks achieve in power density -- the earlier > > blades I looked at had throttled back CPUs but I imagine that they've > > cranked them up at this point (and cranked up the heat along with them). > > > > Ya pays your money and ya takes your choice. An absolute limit of 25 > > (or even 30) KW/rack seems more than reasonable to me, but then, I'd > > "just say no" to rack/serverroom designs that pack more power than I > > think can sanely be dissipated in any given volume. Note that I consider > > water cooled systems to be insane a priori for all but a small fraction > > of server room or cluster operations, "space" generally being cheaper > > than the expense associated with achieving the highest possible spatial > > density of heat dissipating CPUs. I mean, why stop at water? Liquid > > Nitrogen. Liquid Helium. If money is no option, why not? OTOH, when > > money matters, at some point it (usually) gets to be cheaper to just Keyword: ^^^^^^^ > > build another cluster/server room, right? Sure, I agree with everything below, for bleeding edge work. Or if you're building a cluster in your Manhattan office, where for whatever reason you have to work with a space the size of a broom closet (but where you miraculously have access to a stream of chilled water, or liquid nitrogen, or liquid helium). This just (IMO) pushes you over some sort of magic threshold that (while arbitrary and existing perhaps only in my fevered imagination) separates "COTS clusters" from a "big iron supercomputer". I have a hard time seeing liquid cooled clusters as being a beowulf in the sense I have grown to know and love. COTS clusters have always been about being ABLE to DIY, and while I can (if my life depends on it) do plumbing, it just seems like there would be some highly nonlinear cost and hassle thresholds in there. Also, I just cannot see COTS systems being built with copper pipes and coupling valves where you hook them into your household or office chilled water supply at your desk. I suspect that COTS desktops and even server mobos will continue to be engineered to be air cooled in the forseeable future. Now your observation that racks themselves may start coming with a pair of copper pipes and couplings for a built-in blower and heat exchanger -- so the rack itself is in some sense "liquid cooled", while the actual nodes within are still COTS mobos cooled by air -- I don't know what the cost and volume trade-offs are of this solution. Cooling the air in the rack bases (more likely at the top of each rack and ducting the cold air down to the base) vs cooling the air in a big liebert and piping the cool air around to the bases in a raised floor -- hmmm. One thing to remember (that I think was brought up one of the last times this issue was raised on list -- I know from bitter experience that water couplings are a PITA to reliably get, and keep, tight under pressure. When they leak ("when" because of Murphy), they're going to make God's Own mess and potentially ruin many tens of thousands of dollars worth of hardware. Heat exchangers at the tops of racks also increase the probability that humidity will be a problem -- I also know from bitter experience that overhead cold air ducting has a tendency to sweat unless carefully insulated, and the sweat in a humid climate like NC will inevitably drip into whatever is below. Heat exchangers at the bottom make it harder to move the warm air exhausted at the rack tops back to the bottom for recooling as you're working against an air pressure/density convective flow differential and not with it. Finally, there are likely to be Human Resources and state regulatory issues with liquid cooled electronics -- systems and network engineers somehow are viewed as being competent to manage end-stage electronics from the plug point on even by the unions in all but the most rabid of union shops (although I have heard of places where you have to call a union employee in to do any major plugging or unplugging of certain kinds o hardware). That simply won't be the case with liquid cooled hardware. I may be able to work on my household plumbing (and wiring), but if I set my hand to plumbing at Duke the HR Gods and the State would get Angry, and if anything wet wrong (like a leak causing a short and a fire) I would be Held Liable. This adds another project-staffing human notch to the TCO -- likely a fairly significant one as the heat exchanger/blowers in EACH rack might well need servicing and inspection 1-2x a year (as the room unit does now). None of these things are insurmountable difficulties, and as you note there are certain big, expensive pieces of hot hardware (big lasers, giant magnets, automobile engines) that one DOES plug right into a chilled water loop. With the exception of car engines they tend to be components with 6-8 figure price tags, though, where tacking on a full or part time FTE for managing the plumbing etc is a small fraction of the total marginal cost of operation. I'd expect this to make sense only for clusters in this same category -- really large, already expensive clusters shooting for bleeding edge performance (top 10 of top 500) at very high density someplace where a) physical space is very "expensive" (justifying the trade off economically); or b) speed of light and/or interconnect lengths are indeed an issue. Note that the fixing the latter will likely rely as much on moving out of the COTS arena for the cluster interconnect as it does on cooling alone. High end cluster interconnects are again almost by definition engineered on the assumption of air-cooled node densities and internode latencies that are specified by worst-case assumptions and protocol, not speed of light in the sense that interconnect length is an important parameter in the overall latency. As in 1 usec is pretty good latency for a modern interconnect IIRC, and a light-usecond is 3x10^8 x 10^-6 = 300 meters. I'd guess that very little of the internode latency over fiber is due to speed of light delays per se and nearly all of it is in the interconnects themselves, the switches, and the node bus interface. > The speed of light starts to set another limit for the physical size, if you > want real speed. There's a reason why the old Crays are compact and liquid > cooled. It's that several nanoseconds per foot propagation delay. Once you There's also a reason why old Crays are currently used primarily as lobby art, whereever they haven't been disassembled and bathed in mercury to recover all that gold. Several reasons, actually, but liquid cooling and the hassle and expense it entailed are a big one. Many a Cray was finally decommissioned when one could build and operate a true COTS cluster with as much or more raw horsepower for what it cost for just the infrastructure support for the Cray it supplanted. Like it or not, Moore's Law biases cost-benefit solutions heavily towards the COTS and disposable, and wet-cooling requires a significant and sustained investment in a particular technology that is likely to remain non-mainstream, human-resource intensive, and hence nonlinearly costly in a TCO CBA. One needs significant benefit in order to make it worthwhile. > get past a certain threshold, you're actually better off going to very dense > form factors and liquid cooling, in many areas. I think that most clusters > haven't reached the performance point where it's worth liquid cooling the > processors, but it's probably pretty close to the threshold. Adding machine > room space is expensive for other reasons. You've already got to have the > water chillers for any sort of major sized cluster (to cool the air), so the > incremental cost to providing an appropriate interface to the racks and > starting to build racks in liquid cooled configurations can't be far away. > > Liquid cooling is MUCH more efficient than air cooling: better heat > transfer, better life (more even temperatures), less real estate required, > etc. The hangup now is that nobody makes liquid cooled PCs as a commodity, > mass production item. What you'll find is liquid cooling retrofits that > don't take advantage of what liquid cooling can get you. If you look at high > performance radar or sonar processors and such that use liquid cooling, the > layout and physical configuration is MUCH different (partly driven by the > fact that the viscosity of liquid is higher than air). > > Wouldn't YOU like to have, say, 1000 processors in one rack, with a 2-3" > flexible pipe to somewhere else? Especially if it was perfectly quiet? And > could sit next to your desk? (1000 processors*100W each is 100kW). If somebody else paid for and fed the whole thing, you could multiply the capacity by an order of magnitude and use liquid nitrogen for cooling instead of water and I'd simply love it. And as Austin Powers might add, I'd like a gold-plated potty as well -- but I'm not going to get it...;-) Alas, in the real world it isn't about what I'd "like", it is about what I can afford, about what I can convince a grant agency to pay for. High infrastructure costs come out of node count, and node count matters -- in many projects, it is the PRIMARY thing that matters. High density increases infrastructure costs, often nonlinearly, and hence decreases node count at any fixed budget. In order to for liquid cooling to ever make sense for COTS clusters, it would have to BECOME COTS -- basically, to become cheap in both hardware and human terms. Might happen, might happen, but I'm not holding my breath... rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From john.hearns at streamline-computing.com Mon Feb 14 03:32:33 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Home Beowulf Intial Startup Question In-Reply-To: References: Message-ID: <1108380753.5708.0.camel@localhost.localdomain> On Sun, 2005-02-13 at 15:41 -0500, Mark Hahn wrote: > > netowrk technologies i am building a 20 node dell optiplex 1.9 ghz 256 ram Have a look at the new OReilly book 'High Performance Linux Cluster with Rocks, Oscar and Mosix'. Should be of help to you. I'm doing a review for the UKUUG newsletter. From ashley at quadrics.com Mon Feb 14 08:23:02 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> Message-ID: <1108398183.8243.54.camel@localhost.localdomain> On Fri, 2005-02-11 at 20:47 -0600, Rob Ross wrote: > I agree that people using MPI_Isend() and related non-blocking operations > are sometimes doing so because they would like to perform some > computation while the communication progresses. People also use these > calls to initiate a collection of point-to-point operations before > waiting, so that multiple communications may proceed in parallel. The > implementation has no way of really knowing which of these is the case. Either of these reasons for using non-blocking sends is valid and both will benefit from low CPU use in the Send call. Why would the implementation want to know the reason for using non-blocking sends? > You should understand that the way MPI implementations are measured is by > their performance, not CPU utilization, so there is pressure to push the > former as much as possible at the expense of the latter. It's relatively difficult to measure the CPU overhead of calls, some benchmarks work out the "issue rate" of sends (operations/second) and some measure how much compute (spinning) can be achieved before having a measurable effect on the latency. Both these are valid however the results are harder for the non-technical person to comprehend. Headline latency/bandwidth are just that, Headline figures that don't tell the whole story. Ashley, From rross at mcs.anl.gov Mon Feb 14 09:04:17 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: Message-ID: Hi Mikhail, I don't know all the implementations well enough to comment on them one-by-one. I'm sure that Rossen can talk about their implementation with regards to (a) below, and others will fill in other gaps. In general, to support (a) the implementation must either spawn a thread or have support from the NIC to make progress (this is related to the "Progress Rule" that people occasionally bring up). The standard *does not* specify that progress must be made when not in an MPI_ call. MPICH/MPICH2 do not use an extra thread (for portability one cannot assume that threads are available!). Thus the only overlap that occurs in MPICH2 over TCP is through the socket buffers. Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a sequence of MPI_Sends isn't hard, particularly if the messages are to different ranks. I would guess that every implementation will provide better performance in the case where the user tells the implementation about all these concurrent operations and then MPI_Waits on the bunch. Hope this helps some, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Mon, 14 Feb 2005, Mikhail Kuzminsky wrote: > Let me ask some stupid's question: which MPI implementations allow > really > > a) to overlap MPI_Isend w/computations > and/or > b) to perform a set of subsequent MPI_Isend calls faster than "the > same" set of MPI_Send calls ? > > I say only about sending of large messages. > > I'm interesting (1st of all) in > - Gigabit Ethernet w/LAM MPI or MPICH > - Infiniband (Mellanox equipment) w/NCSA MPI or OSU MPI > > Yours > Mikhail Kuzminsky > Zelinsky Institute of Organic Chemistry > Moscow From rross at mcs.anl.gov Mon Feb 14 09:11:31 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108398183.8243.54.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> Message-ID: On Mon, 14 Feb 2005, Ashley Pittman wrote: > On Fri, 2005-02-11 at 20:47 -0600, Rob Ross wrote: > > I agree that people using MPI_Isend() and related non-blocking operations > > are sometimes doing so because they would like to perform some > > computation while the communication progresses. People also use these > > calls to initiate a collection of point-to-point operations before > > waiting, so that multiple communications may proceed in parallel. The > > implementation has no way of really knowing which of these is the case. > > Either of these reasons for using non-blocking sends is valid and both > will benefit from low CPU use in the Send call. Why would the > implementation want to know the reason for using non-blocking sends? If you used the non-blocking send to allow for overlapped communication, then you would like the implementation to play nicely. In this case the user will compute and eventually call MPI_Test or MPI_Wait (or a flavor thereof). If you used the non-blocking sends to post a bunch of communications that you are going to then wait to complete, you probably don't care about the CPU -- you just want the messaging done. In this case the user will call MPI_Wait after posting everything it wants done. One way the implementation *could* behave is to assume the user is trying to overlap comm. and comp. until it sees an MPI_Wait, at which point it could go into this theoretical "burn CPU to make things go faster" mode. That mode could, for example, tweak the interrupt coalescing on an ethernet NIC to process packets more quickly (I don't know off the top of my head if that would work or not; it's just an example). All of this is moot of course unless the implementation actually has more than one algorithm that it could employ... Rob From James.P.Lux at jpl.nasa.gov Mon Feb 14 09:17:00 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] some thoughts on thermal design, liquid cooling, etc. Message-ID: <6.1.1.1.2.20050214090712.0416fd68@mail.jpl.nasa.gov> It occurs to me that the real limiting factor in producing "cluster oriented thermal design" is the volume of sales. Say you want to design a custom motherboard/package for use in clusters. This is, at a guess, probably a 3-5 million dollar project (maybe down around a million if it's real close to an existing design). Say the cost of a node is around a kilobuck or 2 (in plain, non-custom, commodity trim). If you had a cluster with 1000 of those custom mobos, you're looking at adding $3K/node to the cluster. That's a bit punitive... You could buy a lot of machine room and cooling for that $3 mil. Now, on the other hand, if you had 100 people willing to each buy a cluster of this scale, then it's only adding $30-50/node, which is a lot more reasonable. Compare this to the consumer motherboard market (which, after all, is what we are really using here...) A production run of several million mobos isn't all that huge, so a Dell or HP can and do create customized motherboard designs to meet some peculiar requirement (on-board peripherals, etc.). Such customization only adds a buck to the mobo cost, and presumably, that buck is made up in cheaper packaging, shorter cables, one less manufacturing step, or somewhere. Somehow, I doubt that the total sales of ALL motherboards for clusters, of a given instance of motherboard design, exceeds a million units. Cluster buyers tend to want different processors, different peripherals, etc., and each configuration change would drive a whole new design cycle. There is hope on the horizon. The increasing drive to "media computers" is creating a demand for PCs that have high performance, but are quiet and have good cooling. I have a Motorola Moxi BMC9012 "set top box" at home from the cable company, and it is basically a Linux computer with an 80GB drive and a some custom video hardware. It's also hideously noisy (for something designed to sit in your living room) and dissipates >100W (all the time.. there's no on-off button). There WILL be consumer pressure to make it silent and to do better thermal management. James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875 From steve_heaton at ozemail.com.au Sun Feb 13 21:52:27 2005 From: steve_heaton at ozemail.com.au (steve_heaton@ozemail.com.au) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] A home cluster of mobos Message-ID: <20050214055227.YEMC24369.swebmail02.mail.ozemail.net@localhost> Dear collective of great minds I'd like to humbly introduce my little Beowulf "BORG" (Boring and Old but Real Grunt). http://members.ozemail.com.au/~sheaton/lss/ -> Computing The next performance consideration will be to start and work over TCP. Maybe a jump into GAMMA for a quick squizz? We'll see how it goes. Cheers Stevo This message was sent through MyMail http://www.mymail.com.au From rene at renestorm.de Mon Feb 14 03:04:45 2005 From: rene at renestorm.de (rene) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Block send mpi In-Reply-To: <42103A3D.8020605@scalableinformatics.com> References: <200502130529.58915.rene@renestorm.de> <42103A3D.8020605@scalableinformatics.com> Message-ID: <200502141204.45263.rene@renestorm.de> Hi Joe, here is some output and changes which solves the problem. I don't know, why I created a void buffer and sended an int array. After creating an int buffer I was also able to delete it ;o) Tnx anyway Rene int packsize; MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); int bufsize = packsize + (MPI_BSEND_OVERHEAD); // void *buf = new (void (*[packsize]) ()); int *buf = new (int ([packsize])); for (int az = 0; az < repeat + 1; az++) { MPI_Buffer_attach (buf, bufsize); for (int node = 1; node < rankcount; node++) { bsend->ierr = MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); } MPI_Buffer_detach (&buf, &bufsize); } delete buf; output for the old code: Program received signal SIGSEGV, Segmentation fault. 0: 0x40ad3860 in malloc_consolidate () from /lib/libc.so.6 0: (gdb) kill rank 1 in job 4 xtrem_32898 caused collective abort of all ranks exit status of rank 1: killed by signal 9 rank 0 in job 4 xtrem_32898 caused collective abort of all ranks exit status of rank 0: killed by signal 9 1: aborting job: 1: Fatal error in MPI_Recv: Other MPI error, error stack: 1: MPI_Recv(207): MPI_Recv(buf=0x8186388, count=32, MPI_INT, src=0, tag=0, comm=0x84000002, status=0xbfffee30) failed 1: MPIDI_CH3_Progress_wait(207): an error occurred while handling an event returned by MPIDU_Sock_Wait() 1: MPIDI_CH3I_Progress_handle_sock_event(492): 1: connection_recv_fail(1728): 1: MPIDU_Socki_handle_read(590): connection closed by peer (set=0,sock=1) Am Montag 14 Februar 2005 06:42 schrieb Joe Landman: > Rene: > > More data. Where exactly does it SEGV? At the void *buf line? at > the Pack? or the Bsend? Did you compile with -g? Do you have a core > dump? > > Joe > > rene wrote: > > Hi folks, > > > > i know, this isn't a mpi forum, even so allow me a question about block > > sending. > > > > i got some(times) nice SIGSEGVs with that code (C++ implementation). > > Did I code something totally wrong? > > I really don't understand this function. > > // int MPI_Buffer_attach( void *buffer, int size ) > > > > int packsize; > > MPI_Pack_size (bit, MPI_INT, newcomm, &packsize); > > int bufsize = packsize + (MPI_BSEND_OVERHEAD); > > void *buf = new (void (*[packsize]) ()); > > MPI_Buffer_attach (buf, bufsize); > > ierr =MPI_Bsend (&testdata[0], bit, MPI_INT, node, 0, newcomm); > > MPI_Buffer_detach (&buf, &bufsize); > > > > Thanks, -- Rene Storm @Cluster Linux Cluster Consultant Hamburgerstr. 42e D-22952 Luetjensee mailto:Rene@ReneStorm.de Voice-IP: Skype.com, Rene_Storm From kus at free.net Mon Feb 14 07:47:15 2005 From: kus at free.net (Mikhail Kuzminsky) Date: Sat Jul 4 01:03:51 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: Message-ID: In message from Rob Ross (Fri, 11 Feb 2005 20:47:22 -0600 (CST)): >Hi Isaac, >On Fri, 11 Feb 2005, Isaac Dooley wrote: >> >>Using MPI_ISend() allows programs to not waste CPU cycles waiting >>on the >> >>completion of a message transaction. >> >No, it allows the programmer to express that it wants to send a >>message >> >but not wait for it to complete right now. The API doesn't specify >>the >> >semantics of CPU utilization. It cannot, because the API doesn't >>have >> >knowledge of the hardware that will be used in the implementation. >> That is partially true. The context for my comment was under your >> assumption that everyone uses MPI_Send(). These people, as I stated >> before, do not care about what the CPU does during their blocking >>calls. >I think that it is completely true. I made no assumption about >everyone >using MPI_Send(); I'm a late-comer to the conversation. >I was not trying to say anything about what people making the calls >care >about; I was trying to clarify what the standard does and does not >say. >However, I agree with you that it is unlikely that someone calling >MPI_Send() is too worried about what the CPU utilization is during >the >call. >> I was trying to point out that programs utilizing non-blocking IO >>may >> have work that will be adversely impacted by CPU utilization for >> messaging. These are the people who care about CPU utilization for >> messaging. This I hopes answers your prior question, at least >>partially. >I agree that people using MPI_Isend() and related non-blocking >operations >are sometimes doing so because they would like to perform some >computation while the communication progresses. People also use >these >calls to initiate a collection of point-to-point operations before >waiting, so that multiple communications may proceed in parallel. Let me ask some stupid's question: which MPI implementations allow really a) to overlap MPI_Isend w/computations and/or b) to perform a set of subsequent MPI_Isend calls faster than "the same" set of MPI_Send calls ? I say only about sending of large messages. I'm interesting (1st of all) in - Gigabit Ethernet w/LAM MPI or MPICH - Infiniband (Mellanox equipment) w/NCSA MPI or OSU MPI Yours Mikhail Kuzminsky Zelinsky Institute of Organic Chemistry Moscow > The >implementation has no way of really knowing which of these is the >case. > >Greg just pointed out that for small messages most implementations >will do >the exact same thing as in the MPI_Send() case anyway. For large >messages >I suppose that something different could be done. In our >implementation >(MPICH2), to my knowledge we do not differentiate. > >You should understand that the way MPI implementations are measured >is by >their performance, not CPU utilization, so there is pressure to push >the >former as much as possible at the expense of the latter. > >> Perhaps your applications demand low latency with no concern for the >>CPU >> during the time spent blocking. That is fine. But some applications >> benefit from overlapping computation and communication, and the >>cycles >> not wasted by the CPU on communication can be used productively. > >I wouldn't categorize the cycles spent on communication as "wasted"; >it's >not like we code in extraneous math just to keep the CPU pegged :). > >Regards, > >Rob >--- >Rob Ross, Mathematics and Computer Science Division, Argonne National >Lab > >_______________________________________________ >Beowulf mailing list, Beowulf@beowulf.org >To change your subscription (digest mode or unsubscribe) visit >http://www.beowulf.org/mailman/listinfo/beowulf From mathog at mendel.bio.caltech.edu Mon Feb 14 09:24:29 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? Message-ID: Robert G. Brown wrote: > In order to for liquid cooling to ever > make sense for COTS clusters, it would have to BECOME COTS -- basically, > to become cheap in both hardware and human terms. Shuttle's itty bitty computers have a heat pipe that goes out to a radiator on the back of the case. It isn't much of a step from there to replacing the back radiator with a copper block. That block could in turn mate with another copper block which itself was on a cold water line. Ie, move the radiator even further from the CPU and other heat generating parts of the computer. So a company like shuttle could relatively easily start selling liquid cooled nodes using only minor modifications to its existing hardware. In this sort of a system you might have to pay to have the pros install (plumb) the rack itself, but you could still work on the nodes of the rack, as is true now. It would seem to be relatively straightforward to have the nodes mate up copper block to copper block when fully inserted, so that each node is not itself part of the rack circulation system. The tricky part is that something else would have to be attached to the copper block on the back when the node was serviced on the bench. On the plus side your racks could replace the building's current hot water supply! Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From ashley at quadrics.com Mon Feb 14 09:42:42 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> Message-ID: <1108402962.8265.25.camel@localhost.localdomain> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote: > If you used the non-blocking send to allow for overlapped communication, > then you would like the implementation to play nicely. In this case the > user will compute and eventually call MPI_Test or MPI_Wait (or a flavor > thereof). > > If you used the non-blocking sends to post a bunch of communications that > you are going to then wait to complete, you probably don't care about the > CPU -- you just want the messaging done. In this case the user will call > MPI_Wait after posting everything it wants done. > > One way the implementation *could* behave is to assume the user is trying > to overlap comm. and comp. until it sees an MPI_Wait, at which point it > could go into this theoretical "burn CPU to make things go faster" mode. > That mode could, for example, tweak the interrupt coalescing on an > ethernet NIC to process packets more quickly (I don't know off the top of > my head if that would work or not; it's just an example). Maybe if you were using a channel interface (sockets) and all messages were to the same remote process then it might make sense to coalesce all the sends into a single transaction and just send this in the MPI_Wait call. The latency for a bigger network transaction *might* be lower than the sum of the issue rates for smaller ones. I'd hope that a well written application would bunch all it's sends into a single larger block when possible though if this optimisation was possible though. Given any reasonably fast network not doing anything until the MPI_Wait call however would destroy your latency. It strikes me as this isn't overlapping comms and compute though rather artificially delaying comms to allow compute to finish, seems rather pointless? If you had a bunch of sends to do to N remote processes then I'd expect you to post them in order (non-blocking) and wait for them all at the end, the time taken to do this should be (base_latency + ( (N-1) * M )) where M is the recpipiocal of the "issue rate". You can clearly see here that even for small number of batched sends (even a 2d/3d nearest neighbour matrix) the issue rate (that is how little CPU the send call consumes) is at least as important that the raw latency. > All of this is moot of course unless the implementation actually has more > than one algorithm that it could employ... In my experience there are often dozens of different algorithms for every situation and each has their trade offs. Choosing the right one based on the parameters given is the tricky bit. Ashley, From rgb at phy.duke.edu Mon Feb 14 10:12:09 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? In-Reply-To: References: Message-ID: On Mon, 14 Feb 2005, David Mathog wrote: > Robert G. Brown wrote: > > > In order to for liquid cooling to ever > > make sense for COTS clusters, it would have to BECOME COTS -- basically, > > to become cheap in both hardware and human terms. > > Shuttle's itty bitty computers have a heat pipe that goes out to > a radiator on the back of the case. It isn't much of a step from > there to replacing the back radiator with a copper block. That block > could in turn mate with another copper block which itself was > on a cold water line. Ie, move the radiator even further from > the CPU and other heat generating parts of the computer. So > a company like shuttle could relatively easily start selling > liquid cooled nodes using only minor modifications to its existing > hardware. > > In this sort of a system you might have to pay to have the pros > install (plumb) the rack itself, but you could still work on the > nodes of the rack, as is true now. It would seem to be relatively > straightforward to have the nodes mate up copper block to copper > block when fully inserted, so that each node is not itself part > of the rack circulation system. The tricky part is > that something else would have to be attached to the copper block > on the back when the node was serviced on the bench. I think there are lots of tricky parts, but I agree that it can be done. In face, Eugen found this from Rittal: http://www.enclosureinfo.com/tech/rittal/lit/pdf/LV_lcs_01_01.pdf where it IS being done, in the sense that one can get liquid cooling adjuncts for racks that accept standard ported lq heat sinks for CPUs and maybe a couple of other parts (disks, power supplies?). Their "mini-chiller" per rack is only around 1.3 tons (4500 "cooling watts") which seems small, and running all the supply hoses around in and out of the systems (especially MP motherboard or blade systems) inside enclosures not really designed for them seems like it would be "interesting". I just don't think of this is being mainstream. I didn't get a price from anybody on this, but I'll bet it is an option on your newborn child per rack. The external heat exchanger idea is also "interesting". I agree that better thermal management in motherboards themselves would be desirable, but it takes a biggish chunk of copper to make a heat pipe capable of moving 100 W 20-30 cm at \kappa_Cu = 385 W/(m-K) and keep the end temperature differentials in the 20-30 K range. Maybe what, 0.5 cm in radius? > > On the plus side your racks could replace the building's current > hot water supply! Not unless you permit the max T on the sink in contact with the water to get dangerously high... (taking this as a serious, rather than a wry, remark). Ditto for numerous discussions of using server room waste heat to help heat buildings -- good idea on paper, pretty difficult in practice, and then there is summer. rgb > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From rgb at phy.duke.edu Mon Feb 14 10:41:01 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Reasonable upper limit in kW per rack for air cooling? In-Reply-To: References: Message-ID: On Mon, 14 Feb 2005, Robert G. Brown wrote: > moving 100 W 20-30 cm at \kappa_Cu = 385 W/(m-K) and keep the end > temperature differentials in the 20-30 K range. Maybe what, 0.5 cm in > radius? I meant diameter. Sorry. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From lindahl at pathscale.com Mon Feb 14 10:58:43 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> Message-ID: <20050214185843.GA1359@greglaptop.internal.keyresearch.com> On Mon, Feb 14, 2005 at 05:42:42PM +0000, Ashley Pittman wrote: > If you had a bunch of sends to do to N remote processes then I'd expect > you to post them in order (non-blocking) and wait for them all at the > end, the time taken to do this should be (base_latency + ( (N-1) * M )) > where M is the recpipiocal of the "issue rate". You can clearly see > here that even for small number of batched sends (even a 2d/3d nearest > neighbour matrix) the issue rate (that is how little CPU the send call > consumes) is at least as important that the raw latency. Unless I completely misunderstand your formula, M is not only the CPU the send call consumes. It's easy to find situations (fast cpu, slow network) where the cpu consumed isn't a part of M at all. Even for a modern 1 GByte/sec network, cpu consumed might not be a part of M. Reducing CPU consumed can't hurt. But reasoning about it seems to be less useful than testing actual applications. -- greg From lindahl at pathscale.com Mon Feb 14 11:07:37 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: Message-ID: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > Let me ask some stupid's question: which MPI implementations allow > really > > a) to overlap MPI_Isend w/computations > and/or > b) to perform a set of subsequent MPI_Isend calls faster than "the > same" set of MPI_Send calls ? > > I say only about sending of large messages. For large messages, everyone does (b) at least partly right. (a) is pretty rare. It's difficult to get (a) right without hurting short message performance. One of the commercial MPIs, at first release, had very slow short message performance because they thought getting (a) right was more important. They've improved their short message performance since, but I still haven't seen any real application benchmarks that show benefit from their approach. -- greg From joachim at ccrl-nece.de Mon Feb 14 11:18:51 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: Message-ID: <4210F99B.3040202@ccrl-nece.de> Rob Ross wrote: > Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a > sequence of MPI_Sends isn't hard, particularly if the messages are to > different ranks. I would guess that every implementation will provide > better performance in the case where the user tells the implementation > about all these concurrent operations and then MPI_Waits on the bunch. In this case, the user should think about MPI_Alltoall(v) - there are MPI implementations which do this in a smarter way than Isend/Irecv/Waitall to achieve much better performance than using the naive approach. Especially if you go to large process numbers, some coordination can help a lot, even for a full bisection network like a single-stage full crossbar... Generally, collectives are there to let the library know what kind of communication is coming next. All speculations in the library based on monitoring and predicting non-collective communication will probably only do good in the matching micro-benchmark (my personal experience). Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From rross at mcs.anl.gov Mon Feb 14 11:49:49 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> Message-ID: On Mon, 14 Feb 2005, Ashley Pittman wrote: > On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote: > > If you used the non-blocking send to allow for overlapped communication, > > then you would like the implementation to play nicely. In this case the > > user will compute and eventually call MPI_Test or MPI_Wait (or a flavor > > thereof). > > > > If you used the non-blocking sends to post a bunch of communications that > > you are going to then wait to complete, you probably don't care about the > > CPU -- you just want the messaging done. In this case the user will call > > MPI_Wait after posting everything it wants done. > > > > One way the implementation *could* behave is to assume the user is trying > > to overlap comm. and comp. until it sees an MPI_Wait, at which point it > > could go into this theoretical "burn CPU to make things go faster" mode. > > That mode could, for example, tweak the interrupt coalescing on an > > ethernet NIC to process packets more quickly (I don't know off the top of > > my head if that would work or not; it's just an example). > > Maybe if you were using a channel interface (sockets) and all messages > were to the same remote process then it might make sense to coalesce all > the sends into a single transaction and just send this in the MPI_Wait > call. The latency for a bigger network transaction *might* be lower > than the sum of the issue rates for smaller ones. This is exactly what MPICH2 does for the one-sided calls; see Thakur et. al in EuroPVM/MPI 2004. It can be a very big win in some situations. > I'd hope that a well written application would bunch all it's sends into > a single larger block when possible though if this optimisation was > possible though. We would hope that too, but applications do not always adhere to best practice. > Given any reasonably fast network not doing anything until the MPI_Wait > call however would destroy your latency. It strikes me as this isn't > overlapping comms and compute though rather artificially delaying comms > to allow compute to finish, seems rather pointless? I agree that postponing progress until MPI_Wait for the purposes of providing lower CPU utilization would be pointless. It can be useful for coalescing purposes, as mentioned above. But certainly there will be a latency cost. > If you had a bunch of sends to do to N remote processes then I'd expect > you to post them in order (non-blocking) and wait for them all at the > end, the time taken to do this should be (base_latency + ( (N-1) * M )) > where M is the recpipiocal of the "issue rate". You can clearly see > here that even for small number of batched sends (even a 2d/3d nearest > neighbour matrix) the issue rate (that is how little CPU the send call > consumes) is at least as important that the raw latency. Well I wasn't trying to start an argument about the importance of CPU utilization as it relates to issue rate :). The original question simply asked if there was generally an advantage to doing what you expect people to do anyway! And I think that we agree the answer is yes. > > All of this is moot of course unless the implementation actually has more > > than one algorithm that it could employ... > > In my experience there are often dozens of different algorithms for > every situation and each has their trade offs. Choosing the right one > based on the parameters given is the tricky bit. Absolutely! And which few of those dozens are applicable to a wide-enough range of situations that you want to actually implement/debug them? Rob From rross at mcs.anl.gov Mon Feb 14 13:09:30 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4210F99B.3040202@ccrl-nece.de> References: <4210F99B.3040202@ccrl-nece.de> Message-ID: On Mon, 14 Feb 2005, Joachim Worringen wrote: > Rob Ross wrote: > > Making a sequence of MPI_Isends followed by a MPI_Wait go faster than a > > sequence of MPI_Sends isn't hard, particularly if the messages are to > > different ranks. I would guess that every implementation will provide > > better performance in the case where the user tells the implementation > > about all these concurrent operations and then MPI_Waits on the bunch. > > In this case, the user should think about MPI_Alltoall(v) - there are > MPI implementations which do this in a smarter way than > Isend/Irecv/Waitall to achieve much better performance than using the > naive approach. Especially if you go to large process numbers, some > coordination can help a lot, even for a full bisection network like a > single-stage full crossbar... Yes! We don't see nearly enough of this I think. > Generally, collectives are there to let the library know what kind of > communication is coming next. All speculations in the library based on > monitoring and predicting non-collective communication will probably > only do good in the matching micro-benchmark (my personal experience). I agree. Which is why we don't tend to try to figure out what the user is trying to do, and instead just implement an algorithm to get things done as quickly as we can. Rob From ashley at quadrics.com Mon Feb 14 13:22:19 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> Message-ID: On 14 Feb 2005, at 19:49, Rob Ross wrote: > On Mon, 14 Feb 2005, Ashley Pittman wrote: >> On Mon, 2005-02-14 at 11:11 -0600, Rob Ross wrote: >> >> Maybe if you were using a channel interface (sockets) and all messages >> were to the same remote process then it might make sense to coalesce >> all >> the sends into a single transaction and just send this in the MPI_Wait >> call. The latency for a bigger network transaction *might* be lower >> than the sum of the issue rates for smaller ones. > > This is exactly what MPICH2 does for the one-sided calls; see Thakur > et. > al in EuroPVM/MPI 2004. It can be a very big win in some situations. I'll look it up. Presumably the win is because of higher bandwidth achieved by larger messages over a stream. I guess the MPI_Fence call copies data out of a receive buffer. >> I'd hope that a well written application would bunch all it's sends >> into >> a single larger block when possible though if this optimisation was >> possible though. > > We would hope that too, but applications do not always adhere to best > practice. As someone who maintains a MPI library I hope people do this, it's up to us to provide the functionality and application writers to actually make use of it. There are often times when it may well not be worth doing this, either because time to market demands or simply when experiments with differing algorithms. >> Given any reasonably fast network not doing anything until the >> MPI_Wait >> call however would destroy your latency. It strikes me as this isn't >> overlapping comms and compute though rather artificially delaying >> comms >> to allow compute to finish, seems rather pointless? > > I agree that postponing progress until MPI_Wait for the purposes of > providing lower CPU utilization would be pointless. It can be useful > for > coalescing purposes, as mentioned above. But certainly there will be a > latency cost. So potentially there is an optimization choice to me made, do you make the "noddy" application run faster at the cost of real performance for applications tuned to the particular library? That sounds like a whole can of worms. >>> All of this is moot of course unless the implementation actually has >>> more >>> than one algorithm that it could employ... >> >> In my experience there are often dozens of different algorithms for >> every situation and each has their trade offs. Choosing the right one >> based on the parameters given is the tricky bit. > > Absolutely! And which few of those dozens are applicable to a > wide-enough > range of situations that you want to actually implement/debug them? Implement? Most of them. Debug/support? no more than two or three seems optimal. There are some algorithms that just don't work on a given network and some that will only be best in corner cases. Then it's just a case of choosing the correct thresholds between the remaining few. For a given call *best* is absolute however for a given application tradeoffs have to be made. Ashley, From ashley at quadrics.com Mon Feb 14 13:29:09 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050214185843.GA1359@greglaptop.internal.keyresearch.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <20050214185843.GA1359@greglaptop.internal.keyresearch.com> Message-ID: <2799056564ab963d97483d0d1d926351@quadrics.com> On 14 Feb 2005, at 18:58, Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 05:42:42PM +0000, Ashley Pittman wrote: >> If you had a bunch of sends to do to N remote processes then I'd >> expect >> you to post them in order (non-blocking) and wait for them all at the >> end, the time taken to do this should be (base_latency + ( (N-1) * M >> )) >> where M is the recpipiocal of the "issue rate". You can clearly see >> here that even for small number of batched sends (even a 2d/3d nearest >> neighbour matrix) the issue rate (that is how little CPU the send call >> consumes) is at least as important that the raw latency. > > Unless I completely misunderstand your formula, M is not only the CPU > the send call consumes. It's easy to find situations (fast cpu, slow > network) where the cpu consumed isn't a part of M at all. Even for a > modern 1 GByte/sec network, cpu consumed might not be a part of M. I'm talking about our (Quadrics) network here which has a CPU offload, a Wait call is simply a few function calls, a memory read (to test completion), a mutex lock/unlock cycle and a linked list insertion, nothing more. Some CPU is used in the send call as I said but outside the two calls there is zero CPU usage although potentially reduced memory to CPU bandwidth. > Reducing CPU consumed can't hurt. But reasoning about it seems to be > less useful than testing actual applications. I do that as well. Ashley, From eugen at leitl.org Mon Feb 14 13:31:18 2005 From: eugen at leitl.org (Eugen Leitl) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] new company, looking for people (fwd from treese@acm.org) Message-ID: <20050214213117.GQ1404@leitl.org> (I presume a single job announcement in all these years is tolerable). ----- Forwarded message from Win Treese ----- [snip] Last fall I joined a startup called SiCortex, where we're building a new Linux cluster computer. We're ramping up the software team, so if you know anyone who is really good and looking for something new, let me know. Here's the short blurb and some job descriptions; feel free to get in touch with me for more details. You can pass this along (minus headers, of course). - Win SiCortex is a new computer company developing a line of Linux cluster computers for demanding scientific and technical applications. The company is based in Maynard, Massachusetts. Senior Software Developers Software developers to work on designing, porting, and qualifying a Linux-based software stack for technical computing clusters. Responsibilities include: * Analyzing one or more sub-projects * Recommending overall approach (buy, port, build) for sub-project(s) * Preparing design and/or implementation plans and schedules for sub-project(s) * Executing implementation plan for sub-project(s) * Executing test and verification strategy for sub-project(s) Areas of expertise being sought include: * Linux porting, drivers, network stack * Parallel file systems * Cluster middleware (job scheduling, single system image) * Compilers and tools * Math and communications libraries * Firmware, diagnostics, and system bring-up * Technical application analysis and tuning Desired skills and experience: * 5+ years industry experience in software development * Deep knowledge of Linux (preferred) or general Unix * Exposure to technical computing * Expertise in multiple areas of software development * Track record of successful results in small teams Software Director Team leader for group of 8-10 software developers designing, porting, and qualifying a Linux-based software stack for technical computing clusters. Responsibilities include: * Analyzing required work and resources * Recruiting and managing team members * Qualifying, recommending, and managing potential third-party vendors (companies and/or consultants) * Preparing and monitoring team schedules * Reviewing and managing team results * Interfacing with potential and actual customers * Technical design and implementation in selected areas Software design task encompasses: * Linux operating system, including kernel, drivers, and networking * Parallel file systems * Cluster middleware * Compilers and tools * Libraries * Applications analysis * Firmware, diagnostics, and other hardware-related software Desired skills and experience: * 10+ years software development, with strong background in Linux (preferred) or Unix * Broad exposure rather than specialist experience in software development * Prior experience in software team leadership or management * Track record of successful results with constrained resources Ideal candidate will have prior exposure to startup environments, pragmatic approach to make vs buy decisions, good understanding of Open Source environments, and excellent people and leadership skills. ----- End forwarded message ----- -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050214/1b68ef97/attachment.bin From patrick at myri.com Mon Feb 14 13:57:03 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <420B264A.7050004@ccrl-nece.de> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> <420580AD.5050003@myri.com> <420B264A.7050004@ccrl-nece.de> Message-ID: <42111EAF.5050709@myri.com> Hi Joachim, Joachim Worringen wrote: > Patrick Geoffray wrote: > >> Seriously, here are MPI latencies with MX on F cards on Opteron >> (PCI-X), that includes fibers and a switch in the middle: >> >> Length Latency(us) Bandwidth(MB/s) >> 0 2.684 0.000 > > [...] > > Nice work, Patrick - but such numbers are of little value if the > benchmark used to get them is not stated. I'd recommend mpptest (from > MPICH). Plus, the compiler etc. is also of interest when it comes to > latencies. Thanks. Such numbers have always little value coming from a vendor. My point was simply that >10us was not really today's ballpark. For your curiosity, it was using an in-house MPI Pingpong (one message at a time, not a bogus pipelined pingpong used to confuse people and make big pipes look good). For very small messages, most of Pingpong codes are similar, compiler has no impact (it was using the gcc that was installed on the machine at that time). For asymptotic bandwidth, the major difference is the way you compute 1 MB, either 1024*1024 Bytes, or 1000000 Bytes. In the networking world, it tends to be 1000000 Bytes. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Mon Feb 14 15:09:27 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> Message-ID: <42112FA7.7010900@myri.com> Hi Greg, Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > > >>Let me ask some stupid's question: which MPI implementations allow >>really >> >>a) to overlap MPI_Isend w/computations >>and/or >>b) to perform a set of subsequent MPI_Isend calls faster than "the >>same" set of MPI_Send calls ? >> >>I say only about sending of large messages. > > > For large messages, everyone does (b) at least partly right. (a) is > pretty rare. It's difficult to get (a) right without hurting short > message performance. One of the commercial MPIs, at first release, had Many believe you just need RDMA support to overlap com and comp, but it's not enough. Zero-copy is needed because the copy is obviously a waste of host CPU (along with cache trashing), but the real problem is matching. Ron did a lot of work in Portals to offload the matching, because it is a big synchronization point: if you send a message and you need the CPU on the receive side to find the appropriate receive buffer, you cannot tell the user that it can have the CPU between the time he posts the MPI_Irecv() and the time he checks on it with MPI_Wait(). What will happen is the matching occurs in the MPI_Wait() and overlap goes to the toilettes. There are several ways to work around it: 1) You can have a thread on the receive side and wake it up with an interrupt. If you do that for all receives, then you add ~10 us in the critical path and the small message latency goes to the same place the overlap went before. This was what I believe the commercial MPI was doing at first. 2) If you can take decisions at the NIC level, you can receive small messages eagerly (with a copy) and fire an interrupt only for large messages (you want to steal some CPU cycles for matching). This is not bad, you steal (~5 us + cost of matching) worth of CPU cycles for large messages, that's not much for most people. 3) You can have the NIC doing the matching. Obviously the NIC is not as fast as the host CPU, so it's more expensive: you don't want to do that for small messages, it will hurt your latency. But you still has to do it for all messages to keep the matching order. One solution is to still receive small messages eagerly but match them in the shadow of the NIC->host DMA just to keep the list of posted receives consistent. For large messages, you match in the NIC in the critical path and you don't need the host CPU (assuming that the matched receive is in the small number that is kept on the NIC). It's still not obvious if 3) is worth it, it's much more complex to implement and 5us per large receive is not that big. And you can reduce that overhead with MSIs (on PCIe, only the Alpha Marvel provided MSI on PCI-X, AFAIK). There are more exotic work-arounds, like using 1) and polling at the same time, and hiding the interrupt overhead with some black magic on another processor. The one with the best potential would be to use HyperThreading on Intel chips to have a polling thread burning cycles continuously; it will run in-cache, won't use the FP unit or waste memory cycles. A perfect use for the otherwise useless HT feature. I wonder why nobody went that way... > right was more important. They've improved their short message > performance since, but I still haven't seen any real application > benchmarks that show benefit from their approach. That's the classical chicken-egg problem: Are people not trying to overlap in MPI because it is not implemented, or MPI implementations don't implement it because applications don't try to overlap ? I think it's the later, too complicated for most. Do you know the story/joke about the Physicist and unexpected messages ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From mprinkey at aeolusresearch.com Mon Feb 14 12:39:06 2005 From: mprinkey at aeolusresearch.com (Michael T. Prinkey) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> Message-ID: Greg, based on your evaluation of the available MPI libraries, does this imply that overlapping communication and computation can really only be done by explicitly building two separate threads? Mike On Mon, 14 Feb 2005, Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > > > Let me ask some stupid's question: which MPI implementations allow > > really > > > > a) to overlap MPI_Isend w/computations > > and/or > > b) to perform a set of subsequent MPI_Isend calls faster than "the > > same" set of MPI_Send calls ? > > > > I say only about sending of large messages. > > For large messages, everyone does (b) at least partly right. (a) is > pretty rare. It's difficult to get (a) right without hurting short > message performance. One of the commercial MPIs, at first release, had > very slow short message performance because they thought getting (a) > right was more important. They've improved their short message > performance since, but I still haven't seen any real application > benchmarks that show benefit from their approach. > > -- greg > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > From mhyoung at valdosta.edu Mon Feb 14 12:50:18 2005 From: mhyoung at valdosta.edu (michael young) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Poor man's SANS Message-ID: <42110F0A.10408@valdosta.edu> Hi, Can I use beowulf or some other Linux cluster or HA Linux solution to pool harddrive space together from differrent computers to make a kinda "poor man's SANS"? thank you Michael From rossen at VerariSoft.Com Mon Feb 14 14:32:57 2005 From: rossen at VerariSoft.Com (Rossen Dimitrov) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> Message-ID: <42112719.4060500@verarisoft.com> Greg Lindahl wrote: > On Mon, Feb 14, 2005 at 06:47:15PM +0300, Mikhail Kuzminsky wrote: > > >>Let me ask some stupid's question: which MPI implementations allow >>really >> >>a) to overlap MPI_Isend w/computations >>and/or >>b) to perform a set of subsequent MPI_Isend calls faster than "the >>same" set of MPI_Send calls ? >> >>I say only about sending of large messages. > > > For large messages, everyone does (b) at least partly right. (a) is > pretty rare. It's difficult to get (a) right without hurting short > message performance. One of the commercial MPIs, at first release, had > very slow short message performance because they thought getting (a) > right was more important. They've improved their short message > performance since, but I still haven't seen any real application > benchmarks that show benefit from their approach. There is quite a bit of published data that for a number of real application codes modest increase of MPI latency for very short messages has no impact on the application performance. This can also be seen by doing traffic characterization, weighing the relative impact of the increased latency, and taking into account the computation/communication ratio. On the other hand, what you give the application developers with an interrupt-driven MPI library is a higher potential for effective overlapping, which they could chose to utilize or not, but unless they send only very short messages, they will not see a negative performance impact from using this library. There is evidence that re-coding the MPI part of an application to take advantage of overlapping and asynchrony when the MPI library (and network) supports these well actually leads to real performance benefit. There is evidence that even without changing anything in the code, but by just running the same code with an MPI library that plays nicer to the system leads to better application performance by improving the overall "application progress" - a loose term I used to describe all of the complex system activities that need to occur during the life-cycle of a parallel application not only on a single node, but on all nodes collectively. The question of short message latency is connected to system scalability in at least one important scenario - running the same problem size as fast as possible by adding more processors. This will lead to smaller messages, much more sensitive to overhead, thus negatively impacting scalability. In other practical scenarios though, users increase the problem size as the cluster size grows, or they solve multiple instances of the same problem concurrently, thus keeping the message sizes away from the extremely small sizes resulting from maximum scale runs, thus limiting the impact of shortest message latency. I have seen many large clusters whose only job run across all nodes is HPL for the top500 number. After that, the system is either controlled by a job scheduler, which limits the size of jobs to about 30% of all processors (an empirically derived number that supposedly improves the overall job throughput), or it is physically or logically divided into smaller sub-clusters. All this being said, there is obviously a large group of codes that use small messages no matter what size problem they solve or what the cluster size is. For these, the lowest latency will be the most important (if not the only) optimization parameter. For these cases, users can just run the MPI library in polling mode. With regard to the assessment that every MPI library does (a) partly right I'd like to mention that I have seen behavior where attempting to overlap computation and communication can lead to no performance improvement at all, or even worse, to performance degradation. This is one example of how a particular implementation of a standard API can affect the way users code against it. I use a metric called "degree of overlapping" which for "good" systems approaches 1, for "bad" systems approaches 0, and for terrible systems becomes negative... Here goodness is measured as how well the system facilitates overlapping. Rossen > > -- greg > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rross at mcs.anl.gov Mon Feb 14 20:52:51 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: <42110F0A.10408@valdosta.edu> References: <42110F0A.10408@valdosta.edu> Message-ID: Yes! PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My group at ANL along with Clemson University and Ohio Supercomputer Center and others are developing this. It's entirely open source and open development, and is in production use at ANL, OSC, and the University of Utah CHPC, among other places. GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that RPMs are available for it now through one source or another. This used to be Sistina's product, who was subsequently bought by RedHat. I'm sure this is used in production in many business environments, and we use it at ANL also. Can someone provide a URL for this one? Lustre (www.lustre.org) is another option. This one is heavily funded by the DOE ASC laboratories and is in use on some very large parallel machines. But unless you have a relationship with CFS you can only get a crippled version of the source, so it's probably not a good option for average joe. If they change their policy on releasing source code, this would be worth reconsidering. Regards, Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Mon, 14 Feb 2005, michael young wrote: > Hi, > Can I use beowulf or some other Linux cluster or HA Linux solution > to pool harddrive space together from differrent computers to make a > kinda "poor man's SANS"? > > thank you > Michael From rross at mcs.anl.gov Mon Feb 14 21:12:36 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <42112719.4060500@verarisoft.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> Message-ID: Rossen, It would be good to mention that you work for a company that sells an implementation specifically designed for facilitating overlapping, in case people don't know that. Clearly you guys have thought a lot about this. The last two Scalable OS workshops (the only two I've had a chance to attend), there was a contingent of people that are certain that MPI isn't going to last too much longer as a programming model for very large systems. The issue, as they see it, is that MPI simply imposes too much latency on communication, and because we (as MPI implementors) cannot decrease that latency fast enough to keep up with processor improvements, MPI will soon become too expensive to be of use on these systems. Now, I don't personally think that this is going to happen as quickly as some predict, but it is certainly an argument that we should be paying very careful attention to the latency issue, because as MPI implementors this is an argument that never seems to end. Also, there is additional overhead in the Isend()/Wait() pair over the simple Send() (two function calls rather than one, allocation of a Request structure at the least) that means that a naive attempt at overlapping communication and computation will result in a slower application. So that doesn't surprise me at all. I think that the theme from this thread should be that "it's a good thing that we have more than one MPI implementation, because they all do different things best." Rob --- Rob Ross, Mathematics and Computer Science Division, Argonne National Lab On Mon, 14 Feb 2005, Rossen Dimitrov wrote: > There is quite a bit of published data that for a number of real > application codes modest increase of MPI latency for very short messages > has no impact on the application performance. This can also be seen by > doing traffic characterization, weighing the relative impact of the > increased latency, and taking into account the computation/communication > ratio. On the other hand, what you give the application developers with > an interrupt-driven MPI library is a higher potential for effective > overlapping, which they could chose to utilize or not, but unless they > send only very short messages, they will not see a negative performance > impact from using this library. > > There is evidence that re-coding the MPI part of an application to take > advantage of overlapping and asynchrony when the MPI library (and > network) supports these well actually leads to real performance benefit. > > There is evidence that even without changing anything in the code, but > by just running the same code with an MPI library that plays nicer to > the system leads to better application performance by improving the > overall "application progress" - a loose term I used to describe all of > the complex system activities that need to occur during the life-cycle > of a parallel application not only on a single node, but on all nodes > collectively. > > The question of short message latency is connected to system scalability > in at least one important scenario - running the same problem size as > fast as possible by adding more processors. This will lead to smaller > messages, much more sensitive to overhead, thus negatively impacting > scalability. > > In other practical scenarios though, users increase the problem size as > the cluster size grows, or they solve multiple instances of the same > problem concurrently, thus keeping the message sizes away from the > extremely small sizes resulting from maximum scale runs, thus limiting > the impact of shortest message latency. I have seen many large clusters > whose only job run across all nodes is HPL for the top500 number. After > that, the system is either controlled by a job scheduler, which limits > the size of jobs to about 30% of all processors (an empirically derived > number that supposedly improves the overall job throughput), or it is > physically or logically divided into smaller sub-clusters. > > All this being said, there is obviously a large group of codes that use > small messages no matter what size problem they solve or what the > cluster size is. For these, the lowest latency will be the most > important (if not the only) optimization parameter. For these cases, > users can just run the MPI library in polling mode. > > With regard to the assessment that every MPI library does (a) partly > right I'd like to mention that I have seen behavior where attempting to > overlap computation and communication can lead to no performance > improvement at all, or even worse, to performance degradation. This is > one example of how a particular implementation of a standard API can > affect the way users code against it. I use a metric called "degree of > overlapping" which for "good" systems approaches 1, for "bad" systems > approaches 0, and for terrible systems becomes negative... Here goodness > is measured as how well the system facilitates overlapping. > > Rossen From patrick at myri.com Mon Feb 14 22:20:52 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <420DA793.4000909@verarisoft.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> Message-ID: <421194C4.5050808@myri.com> Hi Rossen, Rossen Dimitrov wrote: > Of course, there is always the case of running the actual application > code and then evaluating the MPI performance by seeing which MPI library > (or library mode) makes the application run faster. Unfortunately, this > method for evaluating MPI often suffers from various efficiencies some > of which originate from the parallel algorithm developers, who thoughout > the years have sometimes adopted the most trivial ways of using MPI. So if you run an MPI application and it sucks, this is because the application is poorly written ? You don't want to benchmark an application to evaluate MPI, you want to benchmark an application to find the best set of resources to get the job done. If the code stinks, it's not an excuse. Good MPI implementations are good with poorly written applications, but still let smart people do smart things if they want. > these in one way or another depend on CPU processing. Also, today's > processor architectures have many independent processing units and > complex memory hierarchies. When the MPI library polls for completion of > a communication request, most of this specialized hardware is virtually > unused (wasted). The processor architecture trends indicate that this > kind of internal CPU concurrency will continue to increase, thus making > the cost of MPI polling even higher. When you poll, you have nothing else to do: you are stuck in a Wait or in a blocking call (collectives for example). Why do you care about the lost cycles ? The only way to rescue them would be to oversubscribe your processor, and hope than the cycles you recycle (no punt intended) are worth the context switches and the associated cache trashing. I would argue that polling should be the cheapest MPI operations ever (if nothing is found). This is the case of most half decent MPI implementation. > In this regard, a parallel application developer might actually very > much care what is actually happening in the MPI library even when he > makes a call to MPI_Send. If he doesn't, he probably should. He absolutely should not. It's one thing to work around clueless developers, but it's way more difficult to work around someone who assume wrong things about the MPI implementation. > - What application algorithm developers experience when they attempt to > use the ever so nebulous "overlapping" with a polling MPI library and Overlaping is completely orthogonal with polling. Overlaping means that you split the communication initiation from the communication completion. Polling means that you test for completion instead of wait for completion. You can perfectly overlap and check for completion of the asynchronous requests by polling, nothing wrong with that. > how this experience has contributed to the overwhelming use of > MPI_Send/MPI_Recv even for codes that can benefit from non-blocking or > (even better) persistent MPI calls, thus killing any hope that these > codes can run faster on systems that actually facilitate overlapping. There is 2 reasons why developers use blocking operations rather than non-blocking one: 1) they don't know about non-blocking operations. 2) MPI_Send is shorter than MPI_Isend(). Looking for overlaping is actually not that hard: a) look for medium/large messages, don't waste time on small ones. b) replace all MPI_Send() by a pair MPI_Isend() + MPI_Wait() c) move the MPI_Isend() as early as possible (as soon as data is ready). d) move the MPI_Wait() as late as possible (just before the buffer is needed). e) do same for receive. Most of the time, that would speed up things quite a bit, or not change anything. I am still looking for some tuning tool to do that automatically though. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From john.hearns at streamline-computing.com Mon Feb 14 23:20:21 2005 From: john.hearns at streamline-computing.com (John Hearns) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: <33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com> > Yes! > > PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My > group at ANL along with Clemson University and Ohio Supercomputer Center > and others are developing this. It's entirely open source and open > development, and is in production use at ANL, OSC, and the University of > Utah CHPC, among other places. > > GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that > RPMs are available for it now through one source or another. This used to > be Sistina's product, who was subsequently bought by RedHat. I'm sure > this is used in production in many business environments, and we use it at > ANL also. Can someone provide a URL for this one? Source RPMs are of course available from RedHat, and ou can get support for their version. The Scientific Linux distribution has prebuilt RPMs ftp://ftp.scientificlinux.org/linux/scientific/304/i386/SL/RPMS/ From patrick at myri.com Mon Feb 14 23:48:47 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> Message-ID: <4211A95F.2010709@myri.com> Hi Rob, Rob Ross wrote: > The last two Scalable OS workshops (the only two I've had a chance to > attend), there was a contingent of people that are certain that MPI isn't > going to last too much longer as a programming model for very large Were they advocating shared memory paradigms, one sided operations, something more "natural" to program with ? I heard that before :-) > systems. The issue, as they see it, is that MPI simply imposes too much > latency on communication, and because we (as MPI implementors) cannot > decrease that latency fast enough to keep up with processor improvements, > MPI will soon become too expensive to be of use on these systems. This is just wrong. How much of the latency in high speed interconnect is due to MPI ? Very very little. The core of it is in the hardare (IO bus, NICs, crossbars and wires). Doing pure RDMA in hardware is easy for the chip designers, but it's hell for irregular applications when you actually don't know where to remotely read or write. > Also, there is additional overhead in the Isend()/Wait() pair over the > simple Send() (two function calls rather than one, allocation of a Request > structure at the least) that means that a naive attempt at overlapping > communication and computation will result in a slower application. So > that doesn't surprise me at all. What is the cost of one function call and an allocation in a slab ? At several GHz, 50 ns ? And most of the time, blocking calls are implemented on top of non-blocking routines, so the CPU overhead is the same. > I think that the theme from this thread should be that "it's a good thing > that we have more than one MPI implementation, because they all do > different things best." I would say having more than one MPI implementations is a bad thing as long as you cannot easily replace one by another. Let's define a standard MPI header and a standard API for spawning and such, and then having more than one implementation will actually be manageable. That would also remove the needs for swiss-army-knife MPI implementations that want to support all interconnect with the same binary. These implementations are, IMHO, a bad thing as they work at the lowest common denominator and are in essence inefficient for all devices. While we are at it, here is my wish list for the next MPI specs: a) only non-blocking calls. If there are no blocking calls, nobody will use them. b) non-blocking calls for collectives too, there is no excuse. Yes, even an asynchronous barrier. c) ban of the ANY_SENDER wildcard: a world of optimization goes away with this convenience. d) throw away the user defined datatypes, or at least restrict it to regular strides. e) get rid of one-sided communications: if someone is serious about it, it uses something like ARMCI or UPC or even low level vendor interfaces. Rob, you are politically connected, could you make it happen, please ? :-) Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From joachim at ccrl-nece.de Tue Feb 15 00:20:48 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211A95F.2010709@myri.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> Message-ID: <4211B0E0.6030007@ccrl-nece.de> Patrick Geoffray wrote: > While we are at it, here is my wish list for the next MPI specs: > > a) only non-blocking calls. If there are no blocking calls, nobody will > use them. While this makes sense technically, nobody will probably offer an MPI implementation without MPI_Send for the next 20 years for compatibility reasons, so we can just forget about it. > b) non-blocking calls for collectives too, there is no excuse. Yes, even > an asynchronous barrier. No problem here - barrier_enter() and barrier_leaver() are not new. > c) ban of the ANY_SENDER wildcard: a world of optimization goes away > with this convenience. I think this could best be achieved with an assertion like those for one-sided and I/O. There are situations where ANY_SENDER is needed, or at least avoids large programming overheads. > d) throw away the user defined datatypes, or at least restrict it to > regular strides. This is nonsense: user-defined datatypes do not cause any overhead if you don't use them, there are ways to implemenent them very efficiently, and you can't do without in many situations (like MPI-IO). > e) get rid of one-sided communications: if someone is serious about it, > it uses something like ARMCI or UPC or even low level vendor interfaces. Instead, I propose to rework the MPI one-sided communications for a more simple and flexible semantic. The current definition does not match todays network capabilities, but was designed to allow a simple implemenentation for slow/non-RDMA networks. > Rob, you are politically connected, could you make it happen, please ? > :-) One person alone can't do this. The best place to discuss such things is the MPI users group meeting (EuroPVM/MPI, this year in Capri/Italy). Also, adding mpi.h to the standard to define an ABI is a good thing. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From joachim at ccrl-nece.de Tue Feb 15 00:53:37 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108402962.8265.25.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> Message-ID: <4211B891.6020406@ccrl-nece.de> Ashley Pittman wrote: > If you had a bunch of sends to do to N remote processes then I'd expect > you to post them in order (non-blocking) and wait for them all at the > end, the time taken to do this should be (base_latency + ( (N-1) * M )) > where M is the recpipiocal of the "issue rate". You can clearly see > here that even for small number of batched sends (even a 2d/3d nearest > neighbour matrix) the issue rate (that is how little CPU the send call > consumes) is at least as important that the raw latency. This is an interesting issue. If you look at what Greg mentioned about dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted for ScaMPI on different interconnects (all(?) accessed through uDAPL), you see that the dumb interface SCI has the lowest latency for both, pingpong and random, with random being about twice of pingpong. In contrast, the "smart" NIC Myrinet, which has much less CPU utilization, has twice the pingpong latency, and a slightly worse random-to-pingpong ratio. Why this? Maybe better pipelining in SCI, because it's write-and-forget for the CPU, with 16 outstanding transactions on the network level, while Myrinet obviously behaves differently here (although GM should also be PIO-write to the NIC memory for small messages). Then there is Infiniband, which has a much better random-to-pingpong ratio, which is striking. Would be nice to see Quadrics or InfiniPath in this context. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From gmpc at sanger.ac.uk Tue Feb 15 01:22:21 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: > PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My > GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that > Lustre (www.lustre.org) is another option. This one is heavily funded by You missed out GPFS from IBM. It is no-cost free for academic institutions. You can use it with or without SAN hardware. http://publib.boulder.ibm.com/clresctr/windows/public/gpfsbooks.html Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From joachim at ccrl-nece.de Tue Feb 15 01:47:32 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Home beowulf - NIC latencies In-Reply-To: <42111EAF.5050709@myri.com> References: <3.0.32.20050204213943.010127d0@pop.xs4all.nl> <420580AD.5050003@myri.com> <420B264A.7050004@ccrl-nece.de> <42111EAF.5050709@myri.com> Message-ID: <4211C534.7070608@ccrl-nece.de> Patrick Geoffray wrote: > For your curiosity, it was using an in-house MPI Pingpong (one message > at a time, not a bogus pipelined pingpong used to confuse people and > make big pipes look good). For very small messages, most of Pingpong > codes are similar, ...but not equal and give different results. Just compare PMB and mpptest. > compiler has no impact (it was using the gcc that was > installed on the machine at that time). I experienced differences of more than 2 us depending on whether using shared or static libraries, compiler version/options etc. on both scalar and vector machines. > For asymptotic bandwidth, the > major difference is the way you compute 1 MB, either 1024*1024 Bytes, or > 1000000 Bytes. In the networking world, it tends to be 1000000 Bytes. I tend to use MB for 10^6 and MiB for 2^10. This is a somewhat official no Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From patrick at myri.com Tue Feb 15 01:48:09 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211B0E0.6030007@ccrl-nece.de> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> Message-ID: <4211C559.8070100@myri.com> Joachim, Joachim Worringen wrote: > Patrick Geoffray wrote: > >> While we are at it, here is my wish list for the next MPI specs: >> >> a) only non-blocking calls. If there are no blocking calls, nobody >> will use them. > > > While this makes sense technically, nobody will probably offer an MPI > implementation without MPI_Send for the next 20 years for compatibility > reasons, so we can just forget about it. Throw away compatibility. If you keep the legacy API, you have no incentive for change. I don't want MPI-3, I want MPI-light. We are against a wall because the MPI spec was too rich and developers took the lazy path. The weight of legacy will make shared memory paradigms the only proposal for the next step. If you believe we have to support the whole MPI semantics in the next message passing standards, then we are doomed. >> c) ban of the ANY_SENDER wildcard: a world of optimization goes away >> with this convenience. > > > I think this could best be achieved with an assertion like those for > one-sided and I/O. There are situations where ANY_SENDER is needed, or > at least avoids large programming overheads. It's used because it's there, there is no other reason. If you don't know who sends you what in a message passing application, then you cannot get either performance or robustness. If really you cannot do otherwise (and I don't believe that), you can always use unexpected messages (post the receive after Probe()ing), That's ugly, but you get what you deserved :-) >> d) throw away the user defined datatypes, or at least restrict it to >> regular strides. > > > This is nonsense: user-defined datatypes do not cause any overhead if > you don't use them, there are ways to implemenent them very efficiently, > and you can't do without in many situations (like MPI-IO). I know this item would itch, you spend a lot of time working on that. If you don't use user-defined datatypes, then you don't need it and it should not be there in the first place. It's a temptation, it's too easy. No, there is no ways to implement them efficiently unless they are regular, and this is what I am willing to keep: strided types with long segments. Everything else leads to memory copies. The developer should wipe his own bottom instead of asking the message passing interface to work around bad data layout. Sending a column of blocs, yes, that's regular stride and it makes a lot of sense. Sending non-contiguous irregular structure ? As we used to say in France, $100 and a chocolate bar with that ? Oh, BTW, I would gut MPI-IO and make a separate interface. Only a small subset of applications use it and the core semantics are quite different that pure message passing. Man, it's not MPI, it's emacs... >> e) get rid of one-sided communications: if someone is serious about >> it, it uses something like ARMCI or UPC or even low level vendor >> interfaces. > > > Instead, I propose to rework the MPI one-sided communications for a more > simple and flexible semantic. The current definition does not match > todays network capabilities, but was designed to allow a simple > implemenentation for slow/non-RDMA networks. I don't know about that. I just would took it out of the Message Passing Interface because it's not message passing. There would certainly be a need for a pure RMA interface, and there is already a lot of existing work and experience to build upon. >> Rob, you are politically connected, could you make it happen, please ? >> :-) > > > One person alone can't do this. The best place to discuss such things is > the MPI users group meeting (EuroPVM/MPI, this year in Capri/Italy). Nothing that radical would ever come out of EuroPVM/MPI (I heard that Capri is a really nice place, I will definitively beg my boss) or any other users group. > Also, adding mpi.h to the standard to define an ABI is a good thing. Just achieving that would be beyond my greatest expectations. It would certainly be fun to watch. We could organize fist fights on the beach in Capri... Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Tue Feb 15 02:12:35 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211B891.6020406@ccrl-nece.de> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> Message-ID: <4211CB13.3050902@myri.com> Joachim Worringen wrote: > This is an interesting issue. If you look at what Greg mentioned about > dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted > for ScaMPI on different interconnects (all(?) accessed through uDAPL), > you see that the dumb interface SCI has the lowest latency for both, Which is the original hardware Scali built its MPI upon, btw. > pingpong and random, with random being about twice of pingpong. In > contrast, the "smart" NIC Myrinet, which has much less CPU utilization, > has twice the pingpong latency, and a slightly worse random-to-pingpong > ratio. No, it's not Myrinet, it's GM/Myrinet. There are many things that come from the GM side of the equation, believe me. > Why this? Maybe better pipelining in SCI, because it's write-and-forget > for the CPU, with 16 outstanding transactions on the network level, > while Myrinet obviously behaves differently here (although GM should > also be PIO-write to the NIC memory for small messages). Nope, no PIO for small messages with GM, DMA for everything. A last remark. I really think that the argument of using the same swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or even MPI/Pro to infere interconnect characteristics is even worse that looking at latency and bandwidth alone. These implementations are never going to be designed to use all hardware efficiently, their design is either historic (Scali used to provided software for SCI alone) or politicaly motivated (Intel is using uDapl, hummm, wonder why), or both. They are by-products of the MPI forum failure to make the Standard practical (compatible ABI). Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From jcownie at etnus.com Tue Feb 15 05:06:56 2005 From: jcownie at etnus.com (James Cownie) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: Message from Patrick Geoffray of "Tue, 15 Feb 2005 05:12:35 EST." <4211CB13.3050902@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> Message-ID: <20050215130656.8572F1C818@amd64.cownie.net> > A last remark. I really think that the argument of using the same > swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or > even MPI/Pro to infere interconnect characteristics is even worse that > looking at latency and bandwidth alone. These implementations are > never going to be designed to use all hardware efficiently, their > design is either historic (Scali used to provided software for SCI > alone) or politicaly motivated (Intel is using uDapl, hummm, wonder > why), or both. They are by-products of the MPI forum's failure to make > the Standard practical (compatible ABI). As someone who was on the MPI Forum, and sat through an awful lot of meetings, I'd like to provide some justification for _why_ we didn't try to make a binary standard. 1) At the time (over ten years ago), we would have been happy to have _one_ MPI implementation on a given machine, and we weren't expecting to have multiple MPIs on the same hardware. (It was by no means a foregone conclusion that MPI would succeeed). 2) We didn't expect MPI to move into a commercial environment in which the people running the code wouldn't have the sources, and wouldn't be optimising for _their_ machine, which obviously requires recompilation, making an ABI irrelevant. 3) Not having a binary interface allows optimisations in the C MPI interface (such as using macros rather than functions in some places). 4) A binary interface based on no MPI implementation experience would likely be worse than no binary interface. 5) MPI is supposed to be machine and architecture independent, specifying a binary interface under those circumstances is hard. Maybe you can do it if you leverage the C ABI, however it's not clear that that is ideal, since that either changes with time, or suffers from poor vision of the future too (e.g. look at the required alignment of double in the x86 ABI). 6) It was a hard enough job to agree on the source level specification. If we'd tried to add an ABI we'd probably still be stuck in the Bristol Suites :-) You seem to think (maybe subconsciously) that the MPI forum added features the standard just to make life hard for implementors and to kill performance ;-) I can assure you that that was not the case, and that the standard was a compromise between features which users really wanted and what the implementors felt they could reasonably provide. If the standard had not provided things the users wanted (like wildcard receive), then it's quite possible that his whole discussion would be moot because MPI would by now be of only historical interest since the user community would have ignored it. If you _really_ believe that there is so much performance benefit for your customers in having an MPI-light with the restrictions you outlined which only runs on your hardware, then no-one's stopping you from providing it. The market will decide... -- -- Jim -- James Cownie Etnus, LLC. +44 117 9071438 http://www.etnus.com From rross at mcs.anl.gov Tue Feb 15 07:47:11 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: <33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com> References: <42110F0A.10408@valdosta.edu> <33007.212.159.87.168.1108452021.squirrel@webmail.streamline-computing.com> Message-ID: Thanks John! Rob On Tue, 15 Feb 2005, John Hearns wrote: > > Yes! > > > > PVFS2 (http://www.pvfs.org/pvfs2) is my favorite option for this :). My > > group at ANL along with Clemson University and Ohio Supercomputer Center > > and others are developing this. It's entirely open source and open > > development, and is in production use at ANL, OSC, and the University of > > Utah CHPC, among other places. > > > > GFS (http://www.redhat.com/software/rha/gfs/) is another; I believe that > > RPMs are available for it now through one source or another. This used to > > be Sistina's product, who was subsequently bought by RedHat. I'm sure > > this is used in production in many business environments, and we use it at > > ANL also. Can someone provide a URL for this one? > Source RPMs are of course available from RedHat, > and ou can get support for their version. > > The Scientific Linux distribution has prebuilt RPMs > ftp://ftp.scientificlinux.org/linux/scientific/304/i386/SL/RPMS/ > > From rross at mcs.anl.gov Tue Feb 15 07:48:17 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: Hello Guy, I wasn't aware that IBM would give that out for use on existing systems. Does anyone know the constraints under which they will provide such a copy? Thanks, Rob On Tue, 15 Feb 2005, Guy Coates wrote: > You missed out GPFS from IBM. It is no-cost free for academic > institutions. You can use it with or without SAN hardware. > > http://publib.boulder.ibm.com/clresctr/windows/public/gpfsbooks.html > > Guy From rross at mcs.anl.gov Tue Feb 15 08:42:56 2005 From: rross at mcs.anl.gov (Rob Ross) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211A95F.2010709@myri.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> Message-ID: On Tue, 15 Feb 2005, Patrick Geoffray wrote: > Rob, you are politically connected, could you make it happen, please ? > :-) If I had that level of connections, I'd be a DC lobbyist :). Maybe sell off some national parks to the oil industry or something. Rob From gmpc at sanger.ac.uk Tue Feb 15 08:56:10 2005 From: gmpc at sanger.ac.uk (Guy Coates) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Poor man's SANS In-Reply-To: References: <42110F0A.10408@valdosta.edu> Message-ID: On Tue, 15 Feb 2005, Rob Ross wrote: > Hello Guy, > > I wasn't aware that IBM would give that out for use on existing systems. > Does anyone know the constraints under which they will provide such a > copy? As an academic, you sign up for it under the IBM "scholars program". It comes at no cost but unsupported (well, best-efforts support via the GPFS mailing list). http://www-306.ibm.com/software/info/university/members/faq.html If you want support or want a commercial license, then you have to pay money. The "official" GPFS hardware support matrix is pretty tight, but if you don't care about support, you should find that it will run on pretty much any sort of disk hardware. Guy -- Dr. Guy Coates, Informatics System Group The Wellcome Trust Sanger Institute, Hinxton, Cambridge, CB10 1SA, UK Tel: +44 (0)1223 834244 ex 7199 From joachim at ccrl-nece.de Tue Feb 15 10:43:18 2005 From: joachim at ccrl-nece.de (Joachim Worringen) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211CB13.3050902@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> Message-ID: <421242C6.2050800@ccrl-nece.de> Patrick Geoffray wrote: > A last remark. I really think that the argument of using the same > swiss-army-knive MPI implementation such as ScaMPI or Intel MPI or even > MPI/Pro to infere interconnect characteristics is even worse that > looking at latency and bandwidth alone. These implementations are never > going to be designed to use all hardware efficiently, their design is > either historic (Scali used to provided software for SCI alone) or > politicaly motivated (Intel is using uDapl, hummm, wonder why), or both. The two most important things done to optimise performance of an MPI implementation for a hardware platform are: - low-level pt-2-pt communication - collective operations AFAIK, Myrinet's MPI (MPICH-GM), for example, does use the standard (partly naive) collective operations of MPICH. Considering this, plus the fact - that it's not all that hard to use GM for pt-2-pt efficiently. We have done this in our MPI, too, with the same level of performance. - that you probably do not know anything on ScaMPI's current internal design (Intel is MPICH2 plus some Intel-propietary device hacking) and little about it's performance (if this is wrong, let us know) - that all code apart from the device, and also the device architecture of MPICH-GM are more or less 10-year-old swiss-army-knive MPICH code (which is not a bad thing per se) you should maybe think again before judging on the efficiency of other MPI implementations. Joachim -- Joachim Worringen - NEC C&C research lab St.Augustin fon +49-2241-9252.20 - fax .99 - http://www.ccrl-nece.de From lindahl at pathscale.com Wed Feb 16 00:05:25 2005 From: lindahl at pathscale.com (Greg Lindahl) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4211B891.6020406@ccrl-nece.de> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> Message-ID: <20050216080525.GA3122@greglaptop.attbi.com> On Tue, Feb 15, 2005 at 09:53:37AM +0100, Joachim Worringen wrote: > This is an interesting issue. If you look at what Greg mentioned about > dump NICs (like InfiniPath, or SCI) and the latency numbers Ole posted > for ScaMPI on different interconnects (all(?) accessed through uDAPL), > you see that the dumb interface SCI has the lowest latency for both, > pingpong and random, with random being about twice of pingpong. In > contrast, the "smart" NIC Myrinet, which has much less CPU utilization, > has twice the pingpong latency, and a slightly worse random-to-pingpong > ratio. I would make 2 comments about this: First, you should be using the best MPI for each piece of hardware. Hardware architects pick their interface with a software implementation in mind. I don't expect any 3rd party MPI to get close to PathScale's MPI latency on PathScale's hardware, unless the 3rd party is flexible enough to change a lot of code. Second, you really can't generalize about dumb NICs by looking at SCI. SCI has a unique situation: its raw latency is much lower than the MPI latency of all MPI implementations for it. I suspect no hardware designer would be out to imitate that property! Both InfiniPath and the Quadrics STEN (forgive me for classing this as dumb, I happen to think dumb is a compliment...) get this right. Third (you knew I couldn't keep to my promise of 2), I wouldn't make any scaling generalizations based on a test with 16 nodes. Even at 128-256 nodes the picture is quite different, and that's the sweet spot that lots of today's clusters are at. So, if you want to make a scaling generalization, you should be quoting 256-512 node results. -- greg From patrick at myri.com Wed Feb 16 00:17:02 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108478089.4587.118.camel@s861954.sandia.gov> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <1108478089.4587.118.camel@s861954.sandia.gov> Message-ID: <4213017E.7060302@myri.com> Keith D. Underwood wrote: >>c) ban of the ANY_SENDER wildcard: a world of optimization goes away >>with this convenience. > > > Um, our apps guys say this is more than a convenience. Apparently, > sometimes you don't exactly know who you are going to receive from. > Would you rather them post receives from 4000 nodes and cancel the ones > that don't send to that node after a while? No, I would not post any receives and let them come unexpected, sing MPI_Probe() to post a matching receive when something show up. It leaves the MPI implementation a way to move most of the matching to the send side for most of the messages and, if the receive is posted early enough, remove the need for host CPU on the receive side when the application is potentially computing. And you remind me, I would ban MPI_Cancel also. It should have been the item #1 :-) Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Wed Feb 16 00:39:17 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108479093.4587.132.camel@s861954.sandia.gov> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> <4211C559.8070100@myri.com> <1108479093.4587.132.camel@s861954.sandia.gov> Message-ID: <421306B5.3080200@myri.com> Hi Keith, Keith D. Underwood wrote: > Inertia is a powerful thing. Billions of dollars have been invested in > MPI codes. Changing that will not be easy (or cheap). This is not as > simple as moving from vectors to distributed memory - there wasn't > nearly as much accumulated code then (and, it hurt back then). I would not drop the whole MPI standard, I would define a subset that is the recommanded API for performance. If your code is too old, link with a legacy MPI lib. If it's coded with the subset, link either with a legacy MPI lib and it works, or link with the optimized MPI lib and see what the MPI implementation can deliver. >>It's used because it's there, there is no other reason. If you don't >>know who sends you what in a message passing application, then you >>cannot get either performance or robustness. If really you cannot do >>otherwise (and I don't believe that), you can always use unexpected >>messages (post the receive after Probe()ing), That's ugly, but you get >>what you deserved :-) > > > That just isn't true. If I don't know how many messages I will get, or > from whom, but I can bound it, then I should prepost those receives. > This is particularly true in your standard physics code that runs for > days and does thousands of time steps. (i.e. you can maintain a circular > queue of these things). A few years back, I looked at a lot of real world code to see if triggering the communication from the receive side could be worth it, ie if most of the messages did not use ANY_SENDER. I was amazed that the vast majority of the messages sent across many applications used the tag to discriminate on the sender among other things, not the source. For the couple of large code I dissected (sorry, don't remember the names right now), there was no rationale. I guess doing bookkeeping on the source and the tag was too much for the developer(s). You can still do the receive-pull optimization and fall back on sender-push when you see a receive with ANY_SENDER, but if ANY_SENDER is the common case, that's useless. The best way to force developer to write code that can leverage optimization in the MPI lib is to remove the source of the ambiguity. So ANY_SENDER in the legacy API, not in the subset. > The user should always expose as much opportunity for optimization as > possible to the MPI layer. e.g. a load-store architecture like the X1 > (not what I am advocating for MPI performance, mind you) could do > excellent datatype processing. You would rather the user do the > gather/scatter themselves to prohibit the MPI from being able to do it? In general yes, more opportunities for optimization is better. Now, assuming that irregular datatypes can be optimized as much as regular ones is wrong. The hardware can gather/scatter better than the application for nice long strides. However, MPI libs should print insults when tiny segments are used (when the scatter/gather efficiency collapse). The developer assumes that's it's fine because he does not know or he does not care. I advocate to hide the guns instead of letting the developer shoot himself in the foot. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Wed Feb 16 02:07:27 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <421242C6.2050800@ccrl-nece.de> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> <421242C6.2050800@ccrl-nece.de> Message-ID: <42131B5F.8040100@myri.com> Joachim Worringen wrote: > AFAIK, Myrinet's MPI (MPICH-GM), for example, does use the standard > (partly naive) collective operations of MPICH. Considering this, plus > the fact Replacing the collectives from MPICH-1 was not high on the todo list because there was more important things to optimize, with more effects on applications that the scheduling of some collectives. For scaling real codes on large machines, your priority is not there, not enough bang for your time. > - that it's not all that hard to use GM for pt-2-pt efficiently. We have > done this in our MPI, too, with the same level of performance. You have then no idea how hard if to use GM efficiently and *correctly*. Enough to run pingpong ? sure, that's piece of cake. But how to recover from fatal errors on the wire, from resources exhaustion, to avoid to spend most of your time pinning/unpinning pages, to not trash the translation cache on the NIC, etc ? Did you address all of these issues in your MPI ? Maybe, but it requires some design characteristics that would be higher than the device layer. At one time you have to make choices, and in a Swiss-Army-Knive (SAK) implementation, you choose the common ground, or the existing ground. > - that you probably do not know anything on ScaMPI's current internal True, I know zip about ScaMPI design. This is exactely why I don't know how they use GM. Without knowing that, how can you infer hardware characteristics from benchmark results ?!? > design (Intel is MPICH2 plus some Intel-propietary device hacking) and > little about it's performance (if this is wrong, let us know) Intel MPI is MPICH2 plus some multi-device glue. Intel got something right in their design: they ask the vendor to provide the native device layers instead of doing everything themselves. That's how a (SAK) implementation could actually be decent. However, the reference implementation is using uDapl. That means that there is stuff above the device layers that are needed to make the MPI-over-uDapl performance decent. Some of it can be used for other devices, the rest not. The question is that if I need something above the device layer to make my stuff decent, could I have it ? I would think so. Now, if it conflicts with something needed for another device, what happens ? Someone makes a choice. > - that all code apart from the device, and also the device architecture > of MPICH-GM are more or less 10-year-old swiss-army-knive MPICH code > (which is not a bad thing per se) MPICH-1 is not a SAK. You cannot take an MPICH binary and run it on all of the devices on which MPICH has been ported. You can *compile* it on multiple targets, but nothing more. Furthermore, many ch2 things where not used in ch_gm. If you look at it, most of the common code of MPICH is not performance related, at the exception of the collectives (and again they are not that bad). MPICH-2 has been moving more things to the device-specific part, that's the good direction. > you should maybe think again before judging on the efficiency of other > MPI implementations. I could not care less about the efficiency of other MPI implementations. None of my business. My point is that assuming that using a SAK MPI implementation factorize the software part and all remaining performance differences are thus hardware related is ridiculous. As Greg pointed out, an interconnect is a software/hardware stack, all the way to the MPI lib. Throw away the native MPI lib and you have a lame duck. Compare lame ducks and you go nowhere. You don't have much choice when you have a commercial MPI than to support many interconnects. You cannot ask the vendors to write their part unless you are Intel, so you write it yourself. You do your best, because you need to sell your stuff, and you call it good. Is there a value ? Today yes, because it makes life easier to have binary compatibility. However, my second point is that binary compatibility should be addressed by the MPI community, not by commercial MPI implementations. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Wed Feb 16 02:14:45 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:52 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4212087F.6070809@verarisoft.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> <4212087F.6070809@verarisoft.com> Message-ID: <42131D15.4020305@myri.com> Rossen Dimitrov wrote: > Patrick, this is quite a broad statement. 4 years ago we had a paper > arguing that MPI's written to support many different interconnects and > messaging technologies through internal portability layers were probably > sub-optimal for at least some of the interconnects. Most of the reasons Yes, it's very logical. See my reply to Joachim, I don't critic the existence of SAK implementations (actually, yes, a little), all commercial implementations are essentially swiss-army-knives, they have to. My problem is to use results from one unique MPI implementations to connect dots at the hardware level. You don't know if the dots are from the MPI or the hardware, or both. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From eugen at leitl.org Wed Feb 16 02:31:53 2005 From: eugen at leitl.org (Eugen Leitl) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Mare Nostrum (not quite COTS) Message-ID: <20050216103153.GH1404@leitl.org> http://www-106.ibm.com/developerworks/library/pa-nl3-marenostrum.html Power Architecture Community Newsletter, 15 Feb 2005: MareNostrum: A new concept in Linux supercomputing e-mail it! Contents: The name and the history Meet MareNostrum Distinguishing technologies View from the crow's nest Resources About the author Rate this article Related content: Project MareNostrum site IBM eServer Cluster Servers Subscriptions: dW newsletters Level: Introductory developerWorks Power Architecture editors IBM 15 Feb 2005 The MareNostrum supercomputer at the Barcelona Supercomputing Center, ranked number four in the world in speed in November 2004, is constructed of such totally off-the-shelf parts as IBM BladeCenter JS20 servers, 64-bit 970FX PowerPC processors, TotalStorage DS4100 storage servers, and Linux 2.6. This is its story. IBM? has long been a supercomputing leader -- its heritage of innovation currently and spectacularly manifested in its most powerful supercomputer, Blue Gene?/L. The MareNostrum project is the latest bold experiment in supercomputing by IBM -- a small but powerful, rapidly deployed and built system that comes entirely from commercially available components. The Latin term mare nostrum means "our sea" (which to the Romans meant the Mediterranean, as familiar and available to the Italici as the air they breathed, but also the critical key to their success). MareNostrum is one of the world's most powerful supercomputers, ranked among the top five in the prestigious TOP500 (see Resources), yet it is constructed from products available for sale to any business, lives within a relatively small footprint, and was built on a tight schedule using blade servers, a Linux. operating environment, and other cost-efficient technologies. MareNostrum represents a new way of thinking about high-performance computing. Blade servers, some of the most thin and dense machines that can be slid into chassis with the ability to share sources such as power and network switches, became the base components of this supercomputer design. Those familiar with the IBM BladeCenter. JS20 servers' shared-resources architecture will recognize how these servers cost-effectively minimize power consumption and heat output. Running the Linux operating system, the servers exploit the capabilities of the 2.6 kernel on 64-bit PowerPC? processors. MareNostrum also demonstrates something very unique in its project timeline: Part of its mission was to prove the speed at which IBM Linux clusters could be implemented and unleashed. According to the IBM MareNostrum e-Science Lead, Dr. Juan Jose Porta (Open Systems Design and Development, IBM Boeblingen Laboratory): This is all about timely and focused execution. The speed at which this project was realized is important. Consider: from the initial concept in late December of 2003 to assembling the computer in Madrid took less than a year. Normally, this kind of supercomputer projects take years. To make a remarkable saga short, MareNostrum is here and will soon be put into operation by the Barcelona Supercomputer Center (BSC), a public consortium created by the Spanish Government, the Catalonian Government, and the Technical University of Catalonia (UPC), the hosts of the MareNostrum supercomputer. The Barcelona Supercomputing Center is located on the Polytechnic University of Catalonia (UPC) campus in Barcelona. Dr. Porta added, "The supercomputer is based upon commodity technology already developed and available. We were also playing with another piece of magic -- an open environment. This has been a collaborative community effort, where we closely worked with our partners." The name and the history Why "MareNostrum?" In the words of Dr. Porta: MareNostrum means literally "our sea," which is also the Latin name for the Mediterranean Sea on which Barcelona is a port. It carries other apt connotations. "Our sea" refers to a sea of processors and professors who are flocking to the MareNostrum project with a deep commitment to breakthrough science. MareNostrum also refers to the fact that our supercomputer is on the shores of the Mediterranean which, in the days of old Rome, was the middle of the world. This was the center of the Roman Empire, now to become the center of European e-Science on the shores of the nice Mediterranean Sea! Thus, we are talking about an ocean of many professors and a major hub around which such facilitation will grow and thrive to empower a new generation of scientists. Another significant aspect of the name is that, being Latin, it is more culturally inclusive. Not everyone is aware that Spain has actually four official languages, and we did not want to slight anyone. Latin was a safe choice. Spain now understandably becomes the proud home to the most powerful supercomputer in Europe. We see references to its having been assembled in Madrid, but also references to its permanent home as being in Barcelona. MareNostrum is a result of the burgeoning partnership between IBM and the Spanish Government, which has also led to the creation of the Barcelona Supercomputing Center (BSC). BSC is a public consortium created by the Spanish Government, the Catalonian Government, and the Technical University of Catalonia (UPC), which will host the MareNostrum supercomputer. Housed in a majestic 1920s chapel on the university grounds, MareNostrum serves a dual purpose: To serve as a primary high-performance computing resource for the European e-science community and to demonstrate the many benefits of Linux on POWER. in scale. Meet MareNostrum With peak system performance of 40 teraflops for the final system configuration, and a number four spot on the TOP500 list, MareNostrum continues the IBM tradition of high-performance computing breakthroughs in the service of scientific advancement with a twist: MareNostrum is built entirely of commercially available components, including: * 2,282 IBM eServer BladeCenter JS20 blade servers housed in 163 * BladeCenter chassis * 4,564 64-bit IBM PowerPC 970FX processors * 140 TB of IBM TotalStorage? DS4100 storage servers The thinking behind MareNostrum's construction represents a new way of looking at these and other compute-intensive areas. Today's typical high-performance computing installation runs a large, parallel RISC-based UNIX? system with performance instead of reliability being of utmost importance. MareNostrum, however, is a small-footprint Linux cluster made up entirely of off-the-shelf components. With the extreme density of IBM eServer BladeCenter JS20 servers, diskless nodes, and an open system environment, MareNostrum offers superior price/performance; greater reliability, availability, and serviceability; and significant cost efficiencies -- factors that are endearing Linux-based cluster servers to more and more businesses all the time. Distinguishing technologies The next sections explain the hardware and software technologies that distinguish the high-performance computing strategy behind MareNostrum. Hardware: Servers There are 2,282 IBM eServer BladeCenter JS20 servers housed in 163 BladeCenters chassis. Each server Blade has two PowerPC 970 processors running at 2.20GHz, providing superior performance for several varieties of Linux. The BladeCenter technology offers the highest commercially available computer density in the industry, which results in high performance with a small footprint. The BladeCenter technology allows for 84 dual processor servers in a single 42 U rack, giving more than 1.4 teraflops of compute power in a single rack. Hot-swappable JS20 servers also allow administrators to change servers without disrupting applications, maximizing availability. Its shared-resources architecture helps to minimize power consumption and heat output, as well. Hardware: Storage MareNostrum's storage subsystem consists of 20 storage server nodes with 7 terabytes of capacity each or 140 terabytes of total capacity. Its backbone is the IBM TotalStorage DS4100 storage server which, like the BladeCenter JS20, uses redundant hot-swappable components for high availability. IBM TotalStorage DS4100 technology enables tremendous scalability and a wide range of RAID data protection options. Hardware: Switching Four switch frames with Myrinet, including 10 CLOS 256+256 switches and 2 Spine 1280s and densely bundled Myrinet cabling enables faster parallel processing with less switching hardware. The redundant hot-swappable power supply ensures greater availability. The complete switch with 12 chassis provides for 2,560 uniform ports. This uniformity simplifies the programming model so researches can focus on their programs and not the system interconnect architecture. Software: The power of Linux on POWER The Linux 2.6 kernel offers an array of enterprise and performance features that exploit the Power Architecture.. The virtualization capabilities of Linux on POWER allow for more flexible partitioning, better balancing of workloads, and superior scalability should workloads increase. Dr. Porta explained, "It is the Linux 2.6 kernel which offers an array of enterprise and performance features that exploit the Power Architecture." Software: Diskless Image Management (DIM) DIM is a prototype utility for managing the Linux distribution for the compute nodes on the storage servers so that the compute node does not have to manage the root file system. All the files for operation are obtained through the cluster network. Because of this, blades can operate immediately without Linux installation. This is on-demand operation. The blades do have a disk drive but that is reserved for future application use such as checkpointing. DIM also supports the network boot environment in a highly distributed fashion. Software: IBM Linux on POWER clustering technologies The goal is to endow MareNostrum with the same benefits businesses in many industries derive from IBM Linux clusters, albeit on a larger scale. Benefits such as: * Superior density and improved operating efficiency, including smaller * space, power, and cooling requirements and related costs -- thanks to * the BladeCenter JS20 architecture * Record price/performance and system throughput for high-performance * computing workloads thanks to innovative POWER semiconductor * technology, specifically the eight-way superscalar design of the * PowerPC 970FX processor which fully supports symmetric multi-processing * (SMP) * The leading IBM 64-bit POWER microprocessors are capable of addressing * four billion times the amount of physical memory as traditional 32-bit * processors without resorting to complex memory-extension techniques. * Better systems management control thanks to embedded service processors * and software image management * Increased reliability, availability, and serviceability, as well as * lower installation and maintenance costs -- provided by diskless * compute nodes * Improved functionality and performance thanks to the Linux 2.6 kernel * Reduced switching hardware requirements and faster parallel processing * provided by Myrinet switch cabling * Improved storage subsystem costs and reliability thanks to TotalStorage * DS4100 storage technology View from the crow's nest When the power of MareNostrum is unleashed later this year, it will be at the service of scientific, engineering, and medical researchers in the Spanish and international scientific communities. Its to-do list includes issues that are familiar in the supercomputing world, such as protein folding, in silico (computer generated) drug screening and enzymatic reactions. MareNostrum will be used to support basic and applied research in areas that include biology, chemistry, physics, and information-based medicine. As Dr. Porta summed up: ...[T]he very thinking that drove MareNostrum's construction is a new way of looking at compute-intensive areas, particularly in the life sciences, as we prepare new work to resolve challenging problems in information based medicine -- including improvements in diagnostic and therapeutic treatments in hospitals. In the EU context, many of the projects will be conducted in collaboration with other leading European research institutions. We are building collaborative efforts across geographic borders and disciplines. And remember -- the name of the supercomputer is MareNostrum. Traditionally, it was the Mediterranean Sea which allowed commerce and communication to flourish in Europe and beyond. Resources * Visit the Project MareNostrum site, demonstrating the value of Linux * clustering for science, for business, for life itself. * MareNostrum is now at home at the Barcelona Supercomputing Center (BSC) * on the Polytechnic University of Catalonia (UPC) campus in Barcelona, a * prestigious public institution focused on higher education, research, * and technology transfer. * The TOP500 Supercomputer Sites project was started in 1993 to provide a * reliable basis for tracking and detecting trends in high-performance * computing -- twice a year, the project releases a list of the 500 sites * operating the most powerful computer systems. * See this chart for the Linpack benchmark for MareNostrum and others. * This news article examines MareNostrum, IBM's top-ranked, * off-the-shelf, blade-based supercomputer. * Connecting two or more IBM eServer Cluster Servers can create a single, * unified computing resource that will dramatically improve availability, * flexibility, and adaptability for essential services. * The IBM BladeCenter JS20 is well- suited for commercial mainstream * applications and 64-bit high performance computing (HPC) environments. * The IBM Redbook, The IBM eServer BladeCenter JS20, takes an in-depth * look at the two-way Blade eServer for applications requiring 64-bit * computing. * The Linux on IBM eServer product line is Linux-enabled to deliver * maximum performance, reliability, manageability, and price/performance * benefits. * See this site for more on how IBM supercomputing solutions can help * remove the barriers to deployment of clustered server systems. * IBM TotalStorage DS400 series has been enhanced with the DS4000 Storage * Manager V9.10, enhanced remote mirror option, DS4100 option for larger * capacity configurations, and support for EXP100 serial ATA expansion * units . * Take a look at the Myrinet switches used in MareNostrum. About the author The developerWorks Power Architecture editors welcome your comments on this article. E-mail them at dwpower@us.ibm.com. -- Eugen* Leitl leitl ______________________________________________________________ ICBM: 48.07078, 11.61144 http://www.leitl.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE http://moleculardevices.org http://nanomachines.net -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050216/14ca2081/attachment.bin From patrick at myri.com Wed Feb 16 02:53:03 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108477871.4587.115.camel@s861954.sandia.gov> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> <1108477871.4587.115.camel@s861954.sandia.gov> Message-ID: <4213260F.5040303@myri.com> Keith, Keith D. Underwood wrote: >>Looking for overlaping is actually not that hard: >>a) look for medium/large messages, don't waste time on small ones. > > > I contend that this particular item is bad advice. If you send a lot of > small messages, you should use MPI_Isend there as well to give the MPI > implementation every opportunity to do the right thing. As we go > forward, end-to-end acknowledgments are going to become a reality. The I agree. We are strongly considering acking at the lib level instead of at the firmware level in MX. It has many good side effects, and a few evil ones. > last thing you want is to spend a round-trip delay on every message you > send if you send a lot of them. Yes, the implementation can copy on the > sending side to allow the send to complete, but that wastes memory and > time. If you are reliable, you need to be able to resend the data if you don't receive the ack in time. If you don't want to do a copy, you have to wait for the ack before releasing the send buffer. For small messages, the copy is cheaper than the rtt, IMHO. Do you say that if someone use Isend for sending small messages, it's an hint that avoiding the copy is worth it because he tries to overlap and he does not care about latency ? Yes, that would be logical. But then you need to have blocking Send to hint the reverse, and then you assume smart people will use blocking Send because they know latency matters at that place, whereas clueless people will use it because it's simpler than Isend. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From patrick at myri.com Wed Feb 16 03:28:00 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <4212182C.60607@verarisoft.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> <4212182C.60607@verarisoft.com> Message-ID: <42132E40.1060001@myri.com> Rossen, Rossen Dimitrov wrote: > >> >> So if you run an MPI application and it sucks, this is because the >> application is poorly written ? > > > Patrick, here the argument is about whether and how you "measure" the > "performance of MPI". I guess you may have missed some of the preceding > postings. No, I was pulling your leg :-) The bigger picture is that MPI has no performance in itself, it's a middleware. You can only measure the way an MPI implementation enable a specific application to perform. Only benchmarking of applications is meaningful, you can argue that everything else is futile and bogus. >> You don't want to benchmark an application to evaluate MPI, you want >> to benchmark an application to find the best set of resources to get >> the job done. If the code stinks, it's not an excuse. Good MPI >> implementations are good with poorly written applications, but still >> let smart people do smart things if they want. > > > This is exactly my point made in my previous posting - you cannot design > a system that is optimal in a single mode for all cases of its use when > there are multiple parameters defining the usage and performance I agree completely, being able to apply different assumptions for the whole code and see which one match the best the applications behavior is better than nothing. However, I believe that some tradeoffs are just too intrusive: you should not have to choose between low latency for small messages or progress by interrupt for large ones, especially when you can have both at the same time. > I think it is fairly easy to show that overlapping and polling (or any > kind of communication completion synchronization) are not orthogonal. If > this was the case, you would see codes that show perfect overlapping > running on any MPI implementation/network pair. I am sure there is > plenty of evidence this is not the case. I can show you codes where people sprinkled some MPI_Test()s in some loops. They don't poll to death, just a little from time to time to improve overlap by improving progression. They poll and they overlap. They could as well block and not overlap. polling/blocking and overlap/not are not linked. Interrupts are useful to get overlap without help from the application, but it's not required to overlap. > There is an important point here that needs to be clarified: when I say > "polling" library, I assume that this library does both: polling > completion synchronization and polling progress. There is not much room > to define here these but I am sure MPI developers know what they are. I think this is where we don't understand each other. For me, polling means no interrupts. Wherever you progress in the context of MPI calls or in the context of a progression thread, you pay for the same CPU cyles. If the application is providing CPU cycles to the MPI lib at the right time, you can overlap perfectly without wasting cycles. > Here is a third one. Writing your code for overlapping with non-blocking > MPI calls and segmentation/pipelining, testing the code, and not seeing > any benefit of it. Yes. This is very true. But if it's not worse than with blocking, they should stick with non-blocking, even if it's bigger and more confusing. > stage I with communication in stage I+1. Then, there is the question how > many segments you use to break up the message for maximum speedup. The > pipelining theory says the more you can get the better, when they are > with equal duration, there aren't inter-stage dependencies, and the > stage setup time is low in proportion to the stage execution time. Also, The more steps, the more overhead. Small pipeline stages decrease your startup overhead (when the second stage is empty) but increase the number of segments and the total cost of the pipeline. The best is to find a piece of computation long enough to hide the communication. Pipelining would be overkill in my opinion. > The metric I mentioned earlier "degree of overlapping" with some > additional analysis can help designers _predict_ whether the design is > good or not and whether it will work well or not on a particular system > of interest (including the MPI library). Temporal dependency between buffers and computation is the metric for overlaping. The longuer you don't need a buffers, the better you can overlap a communication to/from it. Compilers could know that. > This is however too much detail for this forum though, as most of the > postings here discuss much more practical issues :) I am bored with cooling questions. However, it's quite time consuming to argue by email. I don't know how RGB can keep the distance :-) Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From ashley at quadrics.com Wed Feb 16 03:26:55 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050216080525.GA3122@greglaptop.attbi.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <20050216080525.GA3122@greglaptop.attbi.com> Message-ID: <1108553215.14604.9.camel@localhost.localdomain> On Wed, 2005-02-16 at 00:05 -0800, Greg Lindahl wrote: > Quadrics STEN (forgive me for classing this as > dumb, I happen to think dumb is a compliment...) get this right. In this context the STEN in used on the transmit side of the network as a way of doing effectively PIO writes directly into the network. On the receive side the NIC is anything but dumb and does the MPI tag matching. It's almost entirely bypasses the CPU leaving it free to do *whatever the application desires*. Interesting enough the STEN is a very good example of what is being discussed here, doing a remote write (Or MPI send) using the STEN is lower latency than using a DMA but uses more CPU cycles (as the STEN needs the data to be "pushed" from the main CPU whereas a (R)DMA only needs the DMA descriptor to be "pushed" and the NIC then "pulls" the actual data). Ashley, From patrick at myri.com Wed Feb 16 03:53:43 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <20050215130656.8572F1C818@amd64.cownie.net> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <4211CB13.3050902@myri.com> <20050215130656.8572F1C818@amd64.cownie.net> Message-ID: <42133447.9050207@myri.com> Hi James, James Cownie wrote: > As someone who was on the MPI Forum, and sat through an awful lot of > meetings, I'd like to provide some justification for _why_ we didn't try > to make a binary standard. No, I imagine the context was very different 10 years ago. I just don't understand why dynamic spawning, one-sided communications and MPI-IO were added to the Standard, but nobody wanted to address the mpi.h header compatibility issue. By that time, people knew that it was a problem. no ? > You seem to think (maybe subconsciously) that the MPI forum added > features the standard just to make life hard for implementors and > to kill performance ;-) Well, it was the right thing to be as exhaustive as possible to insure the wide adoption of the standard. It was expert friendly, but easy for the application folks to miss the points or take shortcuts. That's the cose of success. Now, I would hate to see a shared memory paradigm emerge to progressively replace MPI because existing applications don't really try to leverage the message passing paradigm capabilities. Some believe it will never happen, I am not so sure. > If you _really_ believe that there is so much performance benefit for > your customers in having an MPI-light with the restrictions you outlined > which only runs on your hardware, then no-one's stopping you from > providing it. This discussion is a beginning. It will only happen if all/most MPI implementators reach a point where it's clear that to move forward, some semantic has to be avoided and some ambiguities cleared, and that can only be done at the API level. I would prefer that the MPI forum focus on improving the core message passing functionalities instead of adding yet another vertical dimension (what's left for MPI-3 ?). The urgent thing however is the ABI. Can we do that ? Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From ashley at quadrics.com Wed Feb 16 03:55:37 2005 From: ashley at quadrics.com (Ashley Pittman) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <421306B5.3080200@myri.com> References: <20050214190737.GB1359@greglaptop.internal.keyresearch.com> <42112719.4060500@verarisoft.com> <4211A95F.2010709@myri.com> <4211B0E0.6030007@ccrl-nece.de> <4211C559.8070100@myri.com> <1108479093.4587.132.camel@s861954.sandia.gov> <421306B5.3080200@myri.com> Message-ID: <1108554937.14604.17.camel@localhost.localdomain> On Wed, 2005-02-16 at 03:39 -0500, Patrick Geoffray wrote: > In general yes, more opportunities for optimization is better. Now, > assuming that irregular datatypes can be optimized as much as regular > ones is wrong. The hardware can gather/scatter better than the > application for nice long strides. > However, MPI libs should print > insults when tiny segments are used (when the scatter/gather efficiency > collapse). The developer assumes that's it's fine because he does not > know or he does not care. I have seen code that used a multi megabyte array of 64bit float/short pairs, effectively having 10 bits of data and 6 bits of "space". Changing this to a 64bit float and two 32bit ints removed the void space and replaced it with deliberate zero data. The "data transferred" went up, application buffer sizes remained the same and performance was a whole lot better. The application writer had used a short to "save space" and was somewhat stunned at the performance improvement. This is a situation that would be best avoided, maybe user education is the key but it's a common problem and there are an awful lot of users. I'm not against complex datatypes on MPI but they are hard to deal with and do get mis-used. Ashley, From patrick at myri.com Wed Feb 16 04:04:31 2005 From: patrick at myri.com (Patrick Geoffray) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <1108553215.14604.9.camel@localhost.localdomain> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <1108398183.8243.54.camel@localhost.localdomain> <1108402962.8265.25.camel@localhost.localdomain> <4211B891.6020406@ccrl-nece.de> <20050216080525.GA3122@greglaptop.attbi.com> <1108553215.14604.9.camel@localhost.localdomain> Message-ID: <421336CF.5020505@myri.com> Ashley Pittman wrote: > Interesting enough the STEN is a very good example of what is being > discussed here, doing a remote write (Or MPI send) using the STEN is > lower latency than using a DMA but uses more CPU cycles (as the STEN > needs the data to be "pushed" from the main CPU whereas a (R)DMA only > needs the DMA descriptor to be "pushed" and the NIC then "pulls" the > actual data). It seems to be common practice to use PIO for small messages on the send side. MX/Myrinet does that too (whereas GM/Myrinet does not), SCI does it, Greg's IB on HT does it. I don't know who is not burning some cycles to get lower latency for small messages these days. Patrick -- Patrick Geoffray Myricom, Inc. http://www.myri.com From rgb at phy.duke.edu Wed Feb 16 04:17:10 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Re: Re: Home beowulf - NIC latencies In-Reply-To: <42132E40.1060001@myri.com> References: <200502112000.j1BK0DNm021457@bluewest.scyld.com> <420D1801.9090206@isaacdooley.com> <420D54DA.8000904@uiuc.edu> <420DA793.4000909@verarisoft.com> <421194C4.5050808@myri.com> <4212182C.60607@verarisoft.com> <42132E40.1060001@myri.com> Message-ID: On Wed, 16 Feb 2005, Patrick Geoffray wrote: > > This is however too much detail for this forum though, as most of the > > postings here discuss much more practical issues :) > > I am bored with cooling questions. However, it's quite time consuming to > argue by email. I don't know how RGB can keep the distance :-) > > Patrick > I stuck a hairpin into an electrical socket at age 2 (an "enlightening" experience I must say) and had a large rock fall on my head from a height of almost a meter at age 8. Since then, I hardly ever get bored with cooling questions, because I cannot remember that they've been asked. What were we talking about, again? Oh yeah, MPI and all that. I've actually been enjoying reading the discussion and not participating, since I'm a PVM kinda guy. But SINCE my name was invoked in vain, I'll make a single comment on the code quality issue, which is that underlying the discussion of communication pattern, blocking vs non-blocking, and directives is the fundamental scaling properties of the code and algorithm itself. So on the issue of whether MPI sucks because the application sucks -- well, possibly, but it seems more likely that the application sucks because its parallel scaling properties (with the algorithm chosen) suck. As to how "intelligent" the back end library should be at choosing algorithm -- I would say the BASIC library should be atomic, elementary, NOT algorithm level stuff. A thin skin on top of raw networking calls that provides the various things one always has to do oneself but not much more. Where one gets into trouble is where one uses a command that has a complex structure that doesn't fit your code without realizing it, and the reason you don't realize it is because all that detail is hidden, and isn't even uniform in RELATIVE performance across varying network hardware. In other words, to make MPI do more, either make it do less (in the form of commands that can be used to build "more" in a manner that is tuned to application and hardware) or be prepared to REALLY make it SMART behind the scenes. This isn't just MPI, BTW. PVM suffers from the same thing. I honestly think that both are limited tools in part BECAUSE they put too thick a skin between the programmer and the network. If you want real performance and complete control over communication algorithm, you probably have to use raw/low level networking commands, and write the appropriate "collective" operations for your particular application and hardware. Of course nobody does this -- not portable and a PITA to design/write/maintain. Or perhaps a few people DO do this, but they're programming gods. And this isn't crazy, really. rgb -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From mathog at mendel.bio.caltech.edu Wed Feb 16 08:16:25 2005 From: mathog at mendel.bio.caltech.edu (David Mathog) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? Message-ID: In most universities services like electricity, water, and A/C are paid for by the school. To do so they take "overhead" out of every grant. Partially as a consequence of this they typically have a very poor ability to meter usage on a room by room basis. Now somewhere between the 10 node Pentium II beowulf sitting on a lab bench and the 1000 node dual P4 Xeon beowulf in a machine room that takes up half the basement the cost of the electricity (both for power and A/C) goes from a minor expense to a major one. Really major. For instance, in that hypothetical large machine, at 10 cents per kilowatt hour (a round number), assuming 100 watts per CPU (another round number) that's: 1000 (nodes) * 2 (cpus/node) * .1 (kilowatts/cpu) * .1 (dollars/kilowatt-hour) * 365 (days /year) * 24 (hours/day) = ----------------------- 175200 dollars/year The A/C expense is going to vary tremendously depending upon the outside temperature. It's going to be much higher for us in Southern California than for a site in Anchorage. "Typical" lab usage is widely variable but I'd be amazed if most biology or chemistry labs burn through even 1/10th this much for the equivalent lab area. Some physics lab running a tokamak might come close. Anyway, the question is, have any of the universities said "enough is enough" and started charging these electricity costs directly? If so, what did they use for a cutover level, where usage was "above and beyond" overhead? >From an economic perspective having electricity and A/C come out of overhead (without limit) grossly distorts the true cost of the project over time and can lead to choices which increase the total overall cost. For instance, the use of Xeons instead of Opterons has little effect on TCO if somebody else is picking up the electricity tab, but could change the power consumption significantly on a large project. Regards, David Mathog mathog@caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech From rgb at phy.duke.edu Wed Feb 16 09:22:35 2005 From: rgb at phy.duke.edu (Robert G. Brown) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: References: Message-ID: On Wed, 16 Feb 2005, David Mathog wrote: > In most universities services like electricity, water, and > A/C are paid for by the school. To do so they take "overhead" > out of every grant. Partially as a consequence of this they > typically have a very poor ability to meter usage on a room > by room basis. > > Now somewhere between the 10 node Pentium II beowulf sitting on > a lab bench and the 1000 node dual P4 Xeon beowulf in a machine > room that takes up half the basement the cost of the electricity > (both for power and A/C) goes from a minor expense to a major > one. Really major. For instance, in that hypothetical large machine, > at 10 cents per kilowatt hour (a round number), assuming 100 watts > per CPU (another round number) that's: > > 1000 (nodes) * > 2 (cpus/node) * > .1 (kilowatts/cpu) * > .1 (dollars/kilowatt-hour) * > 365 (days /year) * > 24 (hours/day) = > ----------------------- > 175200 dollars/year I usually assume $1/watt/year (including AC) which is likely to be good within 20% or so depending on the actual cost of electricity in your area and amount of AC required on a seasonally averaged basis. That yields an estimate of $200K in your example -- not really different, just easier to do in your head as a round number. > > The A/C expense is going to vary tremendously depending upon > the outside temperature. It's going to be much higher for us > in Southern California than for a site in Anchorage. > > "Typical" lab usage is widely variable but I'd be amazed > if most biology or chemistry labs burn through even 1/10th this > much for the equivalent lab area. Some physics lab running > a tokamak might come close. > > > Anyway, the question is, have any of the universities said "enough > is enough" and started charging these electricity costs directly? > If so, what did they use for a cutover level, where usage was > "above and beyond" overhead? This issue has most definitely come up at Duke, although we're still seeking a formula that will permit us to deal with it equitably. This is only one of several pieces of overhead associated with clusters that go above and beyond the assumptions that went in to the original indirect costs formulas. For example, Duke now charges grants a "recycling fee" for certain pieces of environmentally toxic end-of-life hardware (e.g. monitors, with their lead-filled screens). Then there are the really HUGE costs for physical space renovations as valuable and scarce campus space is converted for use in the burgeoning clusters. As our Dean of A&S recently remarked, if there aren't any checks and balances or cost-equity in funding and installing clusters, they may well continue to grow nearly exponentially, without bound (Duke's cluster population is doubling almost according to Moore's Law -- every couple of years). Costs associated with those clusters from the space to hold them, the power to run them, and the people to operate them, all grow roughly linearly with the number of nodes. This much is known. What isn't known is the details of the income stream. Each cluster (or part of a cluster) is typically connected with a specific grant-funded project and its associated income stream. Indirect costs >>are<< assessed on those grants; it may be that on average, enough income comes from those indirect costs to easily support the clusters. This isn't crazy -- it is really a question of just what the ratio is of supported people and other IC-producing expenses are to the number of cluster nodes associated with the research. I wish I knew this number -- it would be very useful in a CWM column;-) -- but I don't, and last I heard Duke still didn't know either, although they are perhaps moving slowly towards expending the energy required to find out. Finding out isn't trivial -- it involves running down ALL the clusters on campus, figuring out whom ALL those nodes "belong" to, determining ALL the grant support associated with all those people and projects and clusters (since even research done without a cluster by a person who runs a cluster has to be considered as contributing, as the cluster may be "essential" to retaining that person), figuring out what the sum of the indirect costs are on all those grants, and finally connecting that total to the estimated cost of running all the nodes. By enabling more research projects, postdocs, laboratory operations, and other grant-funded activity to occur their presence on campus might MAKE the university money, who knows? Indirect cost formulas actually tend to EXCLUDE capital equipment such as clusters. If it didn't the University would have made something on the order of 50% indirect costs on the roughly $2M the hardware in your example above would cost, and out of the resulting $1M (noting that the total grant would have had to be $3M for the hardware alone) plus overhead on the salary of the 2-3 people likely to be hired to run the 1000 node cluster, they could have easily paid for power for 3-5 years. So one proposal is to no longer exclude clusters from indirect cost assessments. Of course this "solution" creates another problem just as big -- will granting agencies stand for this? There is a reason indirect costs aren't charged on capital equipment and it isn't because Universities don't WANT to charge them, it is because many granting agencies flatly refuse to pay them. Some do -- IIRC, NIH is pretty tolerant about indirect costs associated with hardware, probably because in medical research they "expect" to have to support entire labs as there is less likelihood of having a teaching stream of income to partially defray the costs. NSF does not, and I don't believe the DoD or DOE grants like to as well. Another is to just force clusters to budget and pay their own utility bills. I don't know how this would fly with grant agencies. They might be irritated if they had to pay for both the utilities and for indirect costs on the utility money (basically paying 1.5x or so of the cost of the power/AC used, so that the University would actually make another $100K in overhead in your example above, but they might hold still for the $200K/year for power alone. They almost certainly WOULD pay for utilities for clusters in places other than Universities, so this isn't so big a jump. > >From an economic perspective having electricity and A/C come out > of overhead (without limit) grossly distorts the true cost > of the project over time and can lead to choices which increase > the total overall cost. For instance, the use of Xeons instead of > Opterons has little effect on TCO if somebody else is picking > up the electricity tab, but could change the power consumption > significantly on a large project. Absolutely. Or, using shelved tower units vs 2U rackmounts vs 1U rackmount nodes, when space is "scarce" and hence expensive. Or requiring each node to have remote management hardware, PXE network cards, 3 year onsite service plans -- all of these choices will be very differently made depending on how the chooser is constrained and who is paying for what. I don't have a really perfect solution to this dilemna, and indeed I think it is a bit premature to expect one. When SOME institution does a real CBA on the total cash flow associated with grant-funded cluster-based research projects, including the more esoteric benefits such as "institutional prestige" (which is serious business, don't forget -- a weight factor that affects ALL grants submitted from an institution) perhaps we can start to think about which clever idea for recovering costs is realistic and fair. In the meantime, budgets of the groups that actuall pay these costs continue to get a wee bit strained as the number of nodes and associated costs continue to spiral upward. Maybe I'll do a column on this soon. I did a whole article on infrastructure for Linux Mag a year or two ago, but the particular aspect of infrastructure that you raise is still unresolved. I wonder if I could get Duke people to expedite collecting and assembling the data required to get the big picture on this...? rgb > > Regards, > > David Mathog > mathog@caltech.edu > Manager, Sequence Analysis Facility, Biology Division, Caltech > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb@phy.duke.edu From James.P.Lux at jpl.nasa.gov Wed Feb 16 10:56:03 2005 From: James.P.Lux at jpl.nasa.gov (Jim Lux) Date: Sat Jul 4 01:03:53 2009 Subject: [Beowulf] Academic sites: who pays for the electricity? In-Reply-To: References: Message-ID: <6.1.1.1.2.20050216104144.07664e40@mail.jpl.nasa.gov> At 09:22 AM 2/16/2005, Robert G. Brown wrote: >On Wed, 16 Feb 2005, David Mathog wrote: > > > In most universities services like electricity, water, and > > A/C are paid for by the school. To do so they take "overhead" > > out of every grant. Partially as a consequence of this they > > typically have a very poor ability to meter usage on a room > > by room basis. > > > > >I don't have a really perfect solution to this dilemna, and indeed I >think it is a bit premature to expect one. When SOME institution does a >real CBA on the total cash flow associated with grant-funded >cluster-based research projects, including the more esoteric benefits >such as "institutional prestige" (which is serious business, don't >forget -- a weight factor that affects ALL grants submitted from an >institution) perhaps we can start to think about which clever idea for >recovering costs is realistic and fair. In the meantime, budgets of the >groups that actuall pay these costs continue to get a wee bit strained >as the number of nodes and associated costs continue to spiral upward. > Such issues come up ALL the time in any government funded research. And, the more govermnent oversight, the more data you have to collect on such "burden" and "overhead". An extreme might be a Defense Department (or NASA) Cost Reimbursement type contract (Aka Cost Plus... note well.. There are NO government contracts that are cost plus percentage of cost.. they're illegal... The fee amount is fixed, or based on award criteria, but does not depend on on the amount spent, except perhaps in a negative fashion (bust a spending cap, and your award/incentive fee gets smaller)) In such cases, the funding source is VERY interested in just how you calcualated "cost", and therein lies much accounting. There's a sort of pendulumn type swing back and forth for certain types of costs (and management philosophies). Do you count telephone service as an overall burden (raising your "overhead" percentage, but reducing the project's "Other direct costs (ODC)") or, do you chargeback the project for the cost of the phoneline, plus usage, plus some management "tax"? The latter reduces your overhead percentage, but increases the "direct costs". Same dollars flow either way, but in the latter case you WILL spend more time accounting for the other direct costs. I suppose that in academia, the grantee might be sheltered a bit by the institutional processes, but in most other environments, it's been a reality for a long time. Different companies have different philosophies on the approach, and either works, and will generally pass muster with the auditors. It does make evaluating proposals a bit trickier. Taken to an extreme, we have the health care industry approach of "code and cost every item", so that the acetaminophen they give you after delivering a baby or having your gall bladder removed shows up on the bill as "Dispense acetaminophen, 2 tablets at 100mg" and "Administer acetaminophen, 100mg", each with separate charges near $10. Sadly, that $10 probably is a realistic cost, too, considering that some non-zero amount of time was spent to enter the transactions into a database, requiring the use of trained "medical coders" who know the procedure codes for everything, as well as the capital and operating costs of the terminal and computer they're using. I'm sure that clusters in industry face the question of Cost/Benefit analysis, including infrastructure impact. Certainly this is the case for desktop PCs and mainframes in at least one industry where my wife is employed. Questions such as David raised are only going to become more and more common as the drive for "accountability" increases. Even within government agencies, such as NASA, the drive for "Full Cost Accounting" (which essentia