From ajt at rri.sari.ac.uk Wed Oct 1 03:49:11 2008 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: <05B21F91-7E08-4015-B23E-AED651AE457A@cesr.fr> References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E1EDE3.2080205@rri.sari.ac.uk> <05B21F91-7E08-4015-B23E-AED651AE457A@cesr.fr> Message-ID: <48E355A7.6070008@rri.sari.ac.uk> J?rgen Kn?dlseder wrote: > Hi Tony, > > I'm in the same situation as your are: I'm running an openMosix cluster, > but since it's more and more > difficult to integrate new hardware with an old 2.4.26 kernel I think > that I have to move to MOSIX2. > I just got the latest version of MOSIX2 sent from Amnon Barak (I'm using > the cluster for academic > work), and started to play around with installations ... yet I had some > kernel crash problems that > I could not yet resolve on one of the machines. So I guess without the > $1,000 per year support > fee it'll also be difficult to run a MOSIX2 cluster ... Hello, J?rgen. I ported 2.4.26-om1 to compile it using the 'new' GNU tool-chain and run it under Ubuntu 6.06.1. I also patched the kernel.org 2.4.32 sources for openMosix (to get SATA drivers). However, I couldn't get openMosix process migration working properly between 2.4.32-om1 kernels... Ironically, process migration worked between a 2.4.26-om1 and 2.4.32.om1 kernel! Anyway, I gave up the attempt at 2.4.32-om1, and just continued to use my version of 2.4.26-om1 in production with SCSI controllers. If you're interested, my Ubuntu 6.06.1 openMosix deb's are at: http://bioinformatics.rri.sari.ac.uk/openmosix Are you going to continue using MOSIX2? Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:ajt@rri.sari.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From ajt at rri.sari.ac.uk Wed Oct 1 04:10:13 2008 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E1EDE3.2080205@rri.sari.ac.uk> Message-ID: <48E35A95.7020908@rri.sari.ac.uk> Vincent Diepeveen wrote: > I agree tony that paying for such crap is not very good idea. Hello, Vincent. I don't think MOSIX2 is crap! However, I don't like the idea of having to pay for 'updates'. > You might want to move to open-ssi in this case; the project is alive > and there is in theory work getting > performed on support for cards over infiniband as well. I have looked at OpenSSI, which uses the openMosix load-balancer, but process migration is more coarsely grained than in openMosix. Only the active pages of the user context of openMosix processes are migrated. I've been looking at alternatives and I think Kerrighed looks very promising but, in our hands, Kerrighed is very fragile: I've mentioned on this list before that if one Kerrighed node goes down you lose the entire cluster. We've been talking to Christine Morin's group at INRIA and they tell us that the next release of Kerrighed with be more robust: http://www.kerrighed.org > Most importantly is that you are gonna get more replies. Yes, thanks for yours :-) > Additionally the manner open-ssi implements shared memory is very > transparant; in principle on each write it migrates a page > to the node writing. > > Maybe the only big lack of open-ssi is its limited support so far for > highend network cards. What bothers me about OpenSSI is that it's based on an open-sourced version HP's (now Compaq) discontinued commercial product "non-stop clusters for Unix". The OpenSSI project also came in for a lot of criticism from the openMosix community for stealing ideas, so my concerns about it might not be all that well founded ;-) The main reason I didn't use OpenSSI, previously, was that many features had not been implemented fully and, like Kerrighed, it wasn't really a viable option for a 'production' cluster even though it was interesting as a research project. What makes me take MOSIX2 seriously now is that it is a commercially supported 'product' with all the same virtues (and most of the vices) of openMosix. Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:ajt@rri.sari.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From ajt at rri.sari.ac.uk Wed Oct 1 04:13:00 2008 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: <200810010212.00791.mm@yuhu.biz> References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E1EDE3.2080205@rri.sari.ac.uk> <200810010212.00791.mm@yuhu.biz> Message-ID: <48E35B3C.20401@rri.sari.ac.uk> Marian Marinov wrote: > There are a few developers that continue to work on openMosix and to port it > to 2.6 kernels. > > They forked a project called LinuxPMI - Linux Process Migration Infrastructure > > http://linuxpmi.org > > Currently the site is unavailable but there was one Russion guy who commented > big parts of the openMosix code he started workin on that in the begining of > February this year and in May he and a few other developers forked LinuxPMI > from openMosix. > > I had a working 2.6 kernel with openMosix functionality in Jul but I never had > the chance to test its process migration capabilities as the nodes from my > home cluster died :( Hello, Marian. Great! I didn't know about LinuxMPI: I've not been following openMosix developments recently. I can't access their web site - Is it mirrored anywhere? Thanks, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:ajt@rri.sari.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From ajt at rri.sari.ac.uk Wed Oct 1 04:33:38 2008 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: <48E35B3C.20401@rri.sari.ac.uk> References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E1EDE3.2080205@rri.sari.ac.uk> <200810010212.00791.mm@yuhu.biz> <48E35B3C.20401@rri.sari.ac.uk> Message-ID: <48E36012.1070406@rri.sari.ac.uk> Tony Travis wrote: > [...] > Hello, Marian. > > Great! I didn't know about LinuxMPI: I've not been following openMosix > developments recently. I can't access their web site - Is it mirrored > anywhere? Following up my own message, I meant LinuxPMI of course ;-) Not much evidence that the LinuxPMI project is still active, though. Google reports a lot of hits for the pending DNS de-registration of "linuxmpi.com'... Anyone else know what's happening with LinuxPMI? Thanks, Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:ajt@rri.sari.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From ajt at rri.sari.ac.uk Wed Oct 1 05:16:29 2008 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E1EDE3.2080205@rri.sari.ac.uk> <05B21F91-7E08-4015-B23E-AED651AE457A@cesr.fr> <48E355A7.6070008@rri.sari.ac.uk> Message-ID: <48E36A1D.2020901@rri.sari.ac.uk> J?rgen Kn?dlseder wrote: > Hi Tony, > > in fact, I also patched SATA drivers in the 2.4.26-om kernel. Yet I > could so far not manage to > get the kernel working for my latest DELL PE1950 with a Perc 6i driver > ... maybe the megaraid_sas > driver I use is outdated ... Hello, Jurgen. I did have 'some' SATA support using 2.4.27-om1 from the ClusterKnoppix sources, but the drivers were very old. The 2.4 kernel *is* still supported at Kernel.org and the more recent SATA drivers worked a lot better. However, I don't want to invest a lot of time and effort keeping 2.4.xx-om1 alive without a critical mass of other openMosix users and developers. Sadly, whatever the virtues of openMosix, I think it is now becoming unsupportable for use on a 'production' Beowulf cluster. > I also search for LinuxPMI project information, but without any success. > So I share your feeling that > this projects is probably dead ... It's probably just as well, or I would be tempted to join in ;-) Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:ajt@rri.sari.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Oct 1 05:52:29 2008 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: References: Message-ID: On Tue, 30 Sep 2008, Donald Becker wrote: > Ahhh, your first flawed assumption. > > You believe that the OS needs to be statically provisioned to the nodes. > That is incorrect. Well, you also make the flawed assumption that the best technical solutions are always preferred. From my position I have seen many cases where political or administrative reasons have very much restricted the choice of technical solutions that could be used. Other reasons are related to the lack of flexibility from ISVs which provide applications in binary form only and make certain assumptions about the way the target cluster works. Yet another reason is the fact that a solution like Scyld's limits the whole cluster to running one distribution (please correct me if I'm wrong), while a solution with node "images" allows mixing Linux distributions at will. > The only times that it is asked to do something new (boot, accept a > new process) it's communicating with a fully installed, up-to-date > master node. It has, at least temporarily, complete access to a > reference install. I think that this is another assumption that holds true for the Scyld system, but there are situations where this is not true. Some years ago I have developed a rudimentary batch system for which the master node only contacted the first node allocated/desired for the job; this node was then responsible to contact the other nodes allocated/desired and start the rest of the job. This was very much modelled after the way the naive rsh/ssh based launchers for MPI jobs work: once mpirun is running, there is no connection to the master node, only between the node where mpirun is running and the rest of the nodes specified in the hosts file. I think that Torque also has a similar design (Mother Superior being in control of the job), but I haven't look closely at the details so I might be wrong. > If you design a cluster system that installs on a local disk, it's > very difficult to adapt it to diskless blades. If you design a > system that is as efficient without disks, it's trivial to > optionally mount disks for caching, temporary files or application > I/O. If you design a system that is flexible enough to allow you to use either diskless or diskfull installs, what do you have to loose ? The same node "image" can be used in several ways: - copied to the local disk and booted from there (where the copying could be done as a separate operation followed by a reboot or it can be done from initrd) - used over NFS-root - used as a ramdisk, provided that the node "image" is small enough Note: I have used "image" in this and previous e-mails to signify the collection of files that the node needs for booting; most likely this is not a FS image (like an ISO one), but it could also be one. Various documents call this a "virtual node FS", "chroot-ed FS", etc. -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850 E-mail: bogdan.costescu@iwr.uni-heidelberg.de From rgb at phy.duke.edu Wed Oct 1 07:18:20 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] precise synchronization of system clocks In-Reply-To: <20081001020421.GA8126@compegg.wr.niftyegg.com> References: <20081001020421.GA8126@compegg.wr.niftyegg.com> Message-ID: On Tue, 30 Sep 2008, Nifty niftyompi Mitch wrote: > Also while focusing on network/transport in this discussion none of us > made a comment on rotational latency as a source of uncertainty for the > kernel state. If we had the ability to synchronize the systems exactly > starting a process would lag for want of rotational/seek disk latency > in beowulf'y. Sure, but starting processes is presumed to occur once and then (an efficient computation) proceeds for a long time. All the parallelized tasks effectively arrive at the same point at the computation at the first barrier AFTER all this is finished. If the communications involve giving the kernel a slice at (say) the end when the barrier is released for all nodes at "the same time", then the distributed kernels CAN have roughly the same state IF one suppresses enough of the "random" sources of state-noise -- asynchronous demands for the kernel's attention. To the extent that the kernel can accumulate all of its regular housekeeping and do it an elective time, if one gets the system clocks together within a usec and the kernel does its elective work on clock ticks, that work will end up being done (mostly) synchronously across the nodes. Truthfully, all one is trying to do is to generalize your parallel process to have a double synchronous barrier, with one phase of the computation being "kernel housekeeping". Compute (in parallel) -> barrier (IPCs) -> Kernel (in parallel) -> barrier -> Compute -> barrier -> Kernel -> barrier ... ad inifinitum. If the work done by the kernel is fairly tightly bounded -- predictably completes in (say) 100 usec (which is a LOT of time, far more than one almost ever sees one STILL has 900 usec per tick to work on your compute task. If the kernel (more reasonably) completes in 1-10 usec your cluster should have a 99+% duty cycle but avoid the "noise" that desynchronizes everything. > > Shared memory machines and transports will behave differently. > > The very high accuracy and high precision clock synchronization is a very real > problem for some data gathering systems. Once the data is gathered the > computation should be less sensitive. These are different problems and > might be addressed by the data sampling devices. > > Synchronization brings problems.... for example a well synchronized campus > can hammer yp server and file servers when cron triggers the same actions on > 5000+ systems... I try never to fetchmail at the hour, half hour... > > I suspect that some system cron tasks should no longer run from cron. Common > housekeeping tasks necessary for system health should be run via the batch system > in a way that is fashionably late enough to not hammer site services. Absolutely. In fact, you'd want the nodes to be isolated and not running ANY of this stuff, I'd guess. You'd want the nodes to have quiescent, non-demanding hardware (except for devices doing the bidding of the running parallel process) so that nothing "random" needed to be done that couldn't be saved for the kernel slices. > One site service of interest is AC power. A modern processor sitting > in an idle state that then starts a well optimized loop will jump from > a couple of watts to 100 watts in as many clocks as the set of pipelines > is deep behind the instruction decode and instruction cache fill. A 1000 > processor (4000 cores) might jump from 4000 watts to 100000 watts in the > blink of an eye (err did the lights blink). Buffer that dI/dT through > the PS and it is less but still interesting on the mains which are synchronized. Interesting. I never have seen the lights blink although I don't run synchronous computations. One wonders if the power supply capacitors (which should be quite large, I would think) don't soak up the transient, though, even on very large clusters. Also, I think that the power differential is smaller than you are allowing for -- I don't think most idle processors draw "no" power... rgb > > > > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From prentice at ias.edu Wed Oct 1 08:30:10 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? Message-ID: <48E39782.4020006@ias.edu> In the ongoing saga that is my new cluster, we were just told today that Cisco is no longer manufacturing DDR IB cables, which we, uhh, need. Has DDR IB gone the way of the dodo bird and been supplanted by QDR? If so, why would anyone spec a brand new cluster with DDR? -- Prentice From prentice at ias.edu Wed Oct 1 08:44:01 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Advanced Clustering's Breakin References: 20670.89.180.225.196.1217678257.squirrel@www.di.fct.unl.pt Message-ID: <48E39AC1.7080001@ias.edu> > We have a tool on our website called "breakin" that is Linux 2.6.25.9 > patched with K8 and K10f Opteron EDAC reporting facilities. It can > usually find and identify failed RAM in fifteen minutes (two hours at > most). The EDAC patches to the kernel aren't that great about naming > the correct memory rank, though. > > Make sure you have multibit (sometimes says 4-bit) ECC enabled in your BIOS. > > http://www.advancedclustering.com/software/breakin.html I've been using breakin for the past week or two on my new cluster. I get some results that seem to be inconsistent. For example on a node I'll get this: Test | Pass | Fail | Last Message ------------------------------------------ hdhealth | 315 | 0 | No disk devices found Then in the log section: 00h 57m 40s: Disabling burnin test 'hdhealth' If I reboot and restart the testing, it will see a hard disk. Why is breaking not always seeing the disk? I've tried to dump logs to a USB drive, but breakin refuses to mount the correct partition on my usb drive (/dev/sdb vs. /dev/sdb1, or vice versa). I sent e-mail to Advanced Clustering regarding these issues, but didn't get any response, so I"m hoping I have better luck here. -- Prentice From jurgen.knodlseder at cesr.fr Wed Oct 1 04:54:44 2008 From: jurgen.knodlseder at cesr.fr (=?ISO-8859-1?Q?J=FCrgen_Kn=F6dlseder?=) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: <48E355A7.6070008@rri.sari.ac.uk> References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E1EDE3.2080205@rri.sari.ac.uk> <05B21F91-7E08-4015-B23E-AED651AE457A@cesr.fr> <48E355A7.6070008@rri.sari.ac.uk> Message-ID: Hi Tony, in fact, I also patched SATA drivers in the 2.4.26-om kernel. Yet I could so far not manage to get the kernel working for my latest DELL PE1950 with a Perc 6i driver ... maybe the megaraid_sas driver I use is outdated ... I also search for LinuxPMI project information, but without any success. So I share your feeling that this projects is probably dead ... J?rgen Le 1 oct. 08 ? 12:49, Tony Travis a ?crit : > J?rgen Kn?dlseder wrote: >> Hi Tony, >> I'm in the same situation as your are: I'm running an openMosix >> cluster, but since it's more and more >> difficult to integrate new hardware with an old 2.4.26 kernel I >> think that I have to move to MOSIX2. >> I just got the latest version of MOSIX2 sent from Amnon Barak (I'm >> using the cluster for academic >> work), and started to play around with installations ... yet I had >> some kernel crash problems that >> I could not yet resolve on one of the machines. So I guess without >> the $1,000 per year support >> fee it'll also be difficult to run a MOSIX2 cluster ... > > Hello, J?rgen. > > I ported 2.4.26-om1 to compile it using the 'new' GNU tool-chain > and run it under Ubuntu 6.06.1. I also patched the kernel.org > 2.4.32 sources for openMosix (to get SATA drivers). However, I > couldn't get openMosix process migration working properly between > 2.4.32-om1 kernels... > > Ironically, process migration worked between a 2.4.26-om1 and > 2.4.32.om1 kernel! Anyway, I gave up the attempt at 2.4.32-om1, and > just continued to use my version of 2.4.26-om1 in production with > SCSI controllers. If you're interested, my Ubuntu 6.06.1 openMosix > deb's are at: > > http://bioinformatics.rri.sari.ac.uk/openmosix > > Are you going to continue using MOSIX2? > > Tony. > -- > Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition > and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK > tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk > mailto:ajt@rri.sari.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From mm at yuhu.biz Wed Oct 1 06:03:10 2008 From: mm at yuhu.biz (Marian Marinov) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: <48E36012.1070406@rri.sari.ac.uk> References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E35B3C.20401@rri.sari.ac.uk> <48E36012.1070406@rri.sari.ac.uk> Message-ID: <200810011603.10435.mm@yuhu.biz> I lost touch with Juri (the guy who started the documentation project of openMosix) 2-3 months ago. More info about LinuxPMI you can find on Freenode #linuxpmi or #openmosix As far as I can see the domain is registered until next year and there is some routing problem New York before the server. I'm sorry but I do not know of any mirrors as Juri's machine was the central storage of the documented patches. Hi actually split the openMosix into series of patches so that it can be ported to 2.6 easier. Regards Marian On Wednesday 01 October 2008 14:33:38 Tony Travis wrote: > Tony Travis wrote: > > [...] > > Hello, Marian. > > > > Great! I didn't know about LinuxMPI: I've not been following openMosix > > developments recently. I can't access their web site - Is it mirrored > > anywhere? > > Following up my own message, I meant LinuxPMI of course ;-) > > Not much evidence that the LinuxPMI project is still active, though. > Google reports a lot of hits for the pending DNS de-registration of > "linuxmpi.com'... > > Anyone else know what's happening with LinuxPMI? > > Thanks, > > Tony. From becker at scyld.com Wed Oct 1 09:35:37 2008 From: becker at scyld.com (Donald Becker) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: Message-ID: On Wed, 1 Oct 2008, Bogdan Costescu wrote: > On Tue, 30 Sep 2008, Donald Becker wrote: > > Ahhh, your first flawed assumption. > > You believe that the OS needs to be statically provisioned to the nodes. > > That is incorrect. > Well, you also make the flawed assumption that the best technical > solutions are always preferred. From my position I have seen many ... > a solution like Scyld's limits the whole cluster to running one > distribution (please correct me if I'm wrong), while a solution with > node "images" allows mixing Linux distributions at will. That's correct. Our model is that a "cluster" is a single system -- and a single install. That's for a good reason: To keep the simplicity and consistency of managing a single installation, you pretty much can have... only a single installation. There is quite a bit of flexibility. The system automatically detects the hardware and loads the correct kernel modules. Nodes can be specialized, including mounting different file systems and running different start-up scripts. But the bottom line is that to make the assertion that remote processes will run the same as local processes, they have to be running pretty much the same system. If you are running different distributions on nodes, you discard many of the opportunities of running a cluster. More importantly, it's much more knowledge- and labor-intensive to maintain the cluster while guaranteeing consistency. > > The only times that it is asked to do something new (boot, accept a > > new process) it's communicating with a fully installed, up-to-date > > master node. It has, at least temporarily, complete access to a > > reference install. > > I think that this is another assumption that holds true for the Scyld > system, but there are situations where this is not true. Yes, there are scenarios where you want a different model. But "connected during important events" is true for most clusters. We discard the ability for a node to boot and run independently in order to get the advantages of zero-install, zero-config consistent compute nodes. > > If you design a cluster system that installs on a local disk, it's > > very difficult to adapt it to diskless blades. If you design a > > system that is as efficient without disks, it's trivial to > > optionally mount disks for caching, temporary files or application > > I/O. > > If you design a system that is flexible enough to allow you to use > either diskless or diskfull installs, what do you have to loose ? In theory that sounds good. But historically changing disk-based installations to work on diskless machines has been very difficult, and the results unsatisfactory. Disk-based installations want to do selective installation based on the hardware present, and write/modify many links and configuration files on installation -- many more than they "need" to. > The same node "image" can be used in several ways: > - copied to the local disk and booted from there (where the copying > could be done as a separate operation followed by a reboot or it can > be done from initrd) > - used over NFS-root > - used as a ramdisk, provided that the node "image" is small enough While memory follows the price-down capacity-up curve, we aren't quite to the point where holding a full OS distribution in memory is negligible. Most distributions (all the commercially interesting ones) are workstation-oriented, and the trade-off is "disk is under $1/GB, so we will install everything". It's foreseeable that holding an 8GB install image in memory will be trivial, but that will be a few years in the future, not today. And we will need better VM and PTE management to make it efficient. -- Donald Becker becker@scyld.com Penguin Computing / Scyld Software www.penguincomputing.com www.scyld.com Annapolis MD and San Francisco CA From landman at scalableinformatics.com Wed Oct 1 10:23:00 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <48E39782.4020006@ias.edu> References: <48E39782.4020006@ias.edu> Message-ID: <48E3B1F4.3000702@scalableinformatics.com> Prentice Bisbal wrote: > In the ongoing saga that is my new cluster, we were just told today that > Cisco is no longer manufacturing DDR IB cables, which we, uhh, need. > > Has DDR IB gone the way of the dodo bird and been supplanted by QDR? > > If so, why would anyone spec a brand new cluster with DDR? Hmmm.... Cisco isn't the only provider of IB cables. Last I understood, they buy theirs from others. Cisco does appear to be transitioning to *other* technologies. Again, they aren't the only IB provider out there (seems such a shame, gobbling up TopSpin and then effectively discarding them). Qlogic, Voltaire, Mellanox are happy to sell you stuff. You can even call some of the tier2+ players and they will help. Joe -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From Bogdan.Costescu at iwr.uni-heidelberg.de Wed Oct 1 10:39:37 2008 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: <48E2C6C1.60108@neuralbs.com> References: <48E2C6C1.60108@neuralbs.com> Message-ID: On Tue, 30 Sep 2008, Eric Thibodeau wrote: > This has given me much flexibility and a very fast path to upgrade > the nodes (LIVE!) since they would only need to be rebooted if I > changed the kernel. I can install/upgrade the node's environment by > simply chrooting into it and using the node's package manager and > utilities as if it were a regular system). Only the first is an advantage of using NFS-root; the second is shared by most methods that use a node "image". However random installations or modifications of configuration file within the chroot become very difficult to reproduce when you build the next node "image" - either scripting everything or using cfengine/puppet/etc. can save a lot of time in the long run, despite the initial effort to set up. > But I am in a special case where, if I break the cluster, I can fix > it quickly and I always have a backup copy of the boot "root" image > ready to switch to if my fiddling goes wrong. Why not keeping several "images" around and only point the nodes to mount the one considered current or "good" ? -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850 E-mail: bogdan.costescu@iwr.uni-heidelberg.de From prentice at ias.edu Wed Oct 1 10:49:43 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Advanced Clustering's Breakin In-Reply-To: References: <48E39AC1.7080001@ias.edu> Message-ID: <48E3B837.4020306@ias.edu> billycrook@gmail.com wrote: > On 2008-10-01, Prentice Bisbal wrote: > ... >> I sent e-mail to Advanced Clustering regarding these issues, but didn't >> get any response, so I"m hoping I have better luck here. > > Prentice, > > I'm the primary support contact at Advanced Clustering. Our Support > email address is Support@AdvancedClustering.com. I looked through our > ticket system, but didn't see a message from you there. We test > breakin against clusters we sell, but would like it to work on as much > hardware as possible. When hdhealth doesn't see hard drives, is > usually because of some raid hardware sheilding SMART data, or the > hard drive controller driver not being in the kernel yet. When its > intermittent, it could be cabeling, heat related, or the drives > actually failing. > > What hardware specifically are you testing? The motherboard model, > and any storage adapters involved would help. > > I will open a ticket momentarily from our helpdesk to prentice@ias.edu > to look at this issue off list and where it is accessible to all of > our support personnel until the issue is resolved. > > Thanks, > Billy Crook > > Advanced Clustering Customer Support > Support: 866.802.8222 x2 > Support: 913.643.0300 x2 > Fax: 913.378.9117 > Billy, I got your support e-mail off list, and replied already. We can continue this disucssion off-list. Thanks for the help. I sent my previous e-mail to some address on the web page for breakin. -- Prentice From prentice at ias.edu Wed Oct 1 10:59:05 2008 From: prentice at ias.edu (Prentice Bisbal) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <48E3B1F4.3000702@scalableinformatics.com> References: <48E39782.4020006@ias.edu> <48E3B1F4.3000702@scalableinformatics.com> Message-ID: <48E3BA69.2000405@ias.edu> Joe Landman wrote: > Prentice Bisbal wrote: >> In the ongoing saga that is my new cluster, we were just told today that >> Cisco is no longer manufacturing DDR IB cables, which we, uhh, need. >> >> Has DDR IB gone the way of the dodo bird and been supplanted by QDR? >> >> If so, why would anyone spec a brand new cluster with DDR? > > Hmmm.... Cisco isn't the only provider of IB cables. Last I understood, > they buy theirs from others. I'm sure you're right. It's my understanding that IB is a well-documented standard, so all IB cables should be created equally, but you know vendors, they'll say that their cables are more equal than others and refuse support if we're not using all Cisco kit. And what is the status of DDR? Are people still using it, or has it already been replaced by QDR in the marketplace? > > Cisco does appear to be transitioning to *other* technologies. Again, > they aren't the only IB provider out there (seems such a shame, gobbling > up TopSpin and then effectively discarding them). What *other* technolgies are you talking about QDR IB, or something other than IB altogether, like 10 Gb Ethernet, or something all new and proprietary? > > Qlogic, Voltaire, Mellanox are happy to sell you stuff. You can even > call some of the tier2+ players and they will help. > > Joe > > -- Prentice From jan.heichler at gmx.net Wed Oct 1 11:19:11 2008 From: jan.heichler at gmx.net (Jan Heichler) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <48E3BA69.2000405@ias.edu> References: <48E39782.4020006@ias.edu> <48E3B1F4.3000702@scalableinformatics.com> <48E3BA69.2000405@ias.edu> Message-ID: <1848483940.20081001201911@gmx.net> Hallo Prentice, Mittwoch, 1. Oktober 2008, meintest Du: PB> And what is the status of DDR? Are people still using it, or has it PB> already been replaced by QDR in the marketplace? DDR is still the most important Infiniband in the market. DDR provides enough bandwidth for most applications at the moment because most applications suffer from latency - not limited bandwidth. QDR is more interesting for building spine networks. And QDR might get more important if we see more cores in typical compute nodes. >> Cisco does appear to be transitioning to *other* technologies. Again, >> they aren't the only IB provider out there (seems such a shame, gobbling >> up TopSpin and then effectively discarding them). PB> What *other* technolgies are you talking about QDR IB, or something PB> other than IB altogether, like 10 Gb Ethernet, or something all new and PB> proprietary? Cisco is dropping IB - probably to go for 10GE. It was a short time for Cisco offering IB solutions - bur probably they weren't very successful. Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081001/5967224f/attachment.html From diep at xs4all.nl Wed Oct 1 11:22:57 2008 From: diep at xs4all.nl (Vincent Diepeveen) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: <48E35A95.7020908@rri.sari.ac.uk> References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E1EDE3.2080205@rri.sari.ac.uk> <48E35A95.7020908@rri.sari.ac.uk> Message-ID: <61D42F6C-2115-40E3-A394-A531AD2B12DA@xs4all.nl> Well Tony, Things are pretty simple for using SSI at your beowulf cluster. The short summary: a) openmosix is dead b) open-ssi is still alive Simple tip: go for open-ssi. A bit longer explanation: a) openmosix is dead. I can confirm that even wikipedia has that. Openmosix was an Israeli project and therefore died when It's Israeli main developer left, for whatever reason. He has left a very clear statement that he no longer works on openmosix. What usually happens is that 1 or 2 enthusiasts then do an effort to support it. If that doesn't work for you, then consider it dead. Using kernels 2.4.x is not realistic for todays quadcores. That was the status in 2005, that is in 2008 still the status. In 2005 if i remember well i used kernel 2.6.7 for a quad opteron dual core. Using kernel 2.4.x versus 2.6.x numa gave at a dual opteron dual core a speedloss of 50% for my chessproggie. 50% is *a lot* to lose in speed. Supporting a thing like openmosix requires a lot more than 1 guy who in order to modify 3 bytes needs 1000 dollar. What you see a lot is that open source projects get hijacked by people who want to make cash out of it. In the world i come from, computerchess, i remember already since 1988 that this happens every year several times. The developers usually get really demotivated when they work for hundreds of hours at their 'money project', and then just make under a 1000 dollar; at the macdonalds you make more money. In short such projects usually die soon also, as there is no money for them into it. Most nerds are social seen total robots. b) open-ssi is there fore several distributions and actively supported by several developers. It is there, it works, it improves and it works for latest kernels also (yes also 2.6.x). In itself it would be GREAT if there is 1 open source project there, as that joins forces more. My hope is that open-ssi will work great for highend nic's also and slowly get to a phase that all features work great. OpenMosix nor openssi could migrate processes that work with shared memory, a nerd feature i like personally a lot. Just claiming that they use 'stolen' features like page migration is like claiming that linux stole multithreading from unix; it is a bad attempt to smear dirt just to earn a 1000 dollar. It is not a reason to not use it. It is a bad attempt to spit at these guys who donate time and their money without asking for payment. Vincent On Oct 1, 2008, at 1:10 PM, Tony Travis wrote: > Vincent Diepeveen wrote: >> I agree tony that paying for such crap is not very good idea. > > Hello, Vincent. > > I don't think MOSIX2 is crap! > > However, I don't like the idea of having to pay for 'updates'. > >> You might want to move to open-ssi in this case; the project is >> alive and there is in theory work getting >> performed on support for cards over infiniband as well. > > I have looked at OpenSSI, which uses the openMosix load-balancer, > but process migration is more coarsely grained than in openMosix. > Only the active pages of the user context of openMosix processes > are migrated. > > I've been looking at alternatives and I think Kerrighed looks very > promising but, in our hands, Kerrighed is very fragile: I've > mentioned on this list before that if one Kerrighed node goes down > you lose the entire cluster. We've been talking to Christine > Morin's group at INRIA and they tell us that the next release of > Kerrighed with be more robust: > > http://www.kerrighed.org > >> Most importantly is that you are gonna get more replies. > > Yes, thanks for yours :-) > >> Additionally the manner open-ssi implements shared memory is very >> transparant; in principle on each write it migrates a page >> to the node writing. >> Maybe the only big lack of open-ssi is its limited support so far >> for highend network cards. > > What bothers me about OpenSSI is that it's based on an open-sourced > version HP's (now Compaq) discontinued commercial product "non-stop > clusters for Unix". The OpenSSI project also came in for a lot of > criticism from the openMosix community for stealing ideas, so my > concerns about it might not be all that well founded ;-) > > The main reason I didn't use OpenSSI, previously, was that many > features had not been implemented fully and, like Kerrighed, it > wasn't really a viable option for a 'production' cluster even > though it was interesting as a research project. What makes me take > MOSIX2 seriously now is that it is a commercially supported > 'product' with all the same virtues (and most of the vices) of > openMosix. > > Tony. > -- > Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition > and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK > tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk > mailto:ajt@rri.sari.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From tegner at nada.kth.se Wed Oct 1 12:09:39 2008 From: tegner at nada.kth.se (Jon Tegner) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: References: Message-ID: <48E3CAF3.3070800@nada.kth.se> There seem to be significant advantages using Scyld ClusterWare, I did try it (Scyld?) many years ago (when it was free?) and I was impressed then. However, when looking at penguincomputing.com I don't find any price quotes. It seems - unless I miss something - one has to fill in a rather lengthy form in order to get that information? In order to consider the "Scyld solution" I think it would be good to have at least an estimate of the price? Regards, /jon Donald Becker wrote: > On Wed, 1 Oct 2008, Bogdan Costescu wrote: > > >> On Tue, 30 Sep 2008, Donald Becker wrote: >> >>> Ahhh, your first flawed assumption. >>> You believe that the OS needs to be statically provisioned to the nodes. >>> That is incorrect. >>> >> Well, you also make the flawed assumption that the best technical >> solutions are always preferred. From my position I have seen many >> > ... > >> a solution like Scyld's limits the whole cluster to running one >> distribution (please correct me if I'm wrong), while a solution with >> node "images" allows mixing Linux distributions at will. >> > > That's correct. Our model is that a "cluster" is a single system -- and a > single install. > > That's for a good reason: To keep the simplicity and consistency of > managing a single installation, you pretty much can have... only a single > installation. > > There is quite a bit of flexibility. The system automatically detects the > hardware and loads the correct kernel modules. Nodes can be specialized, > including mounting different file systems and running different start-up > scripts. But the bottom line is that to make the assertion that remote > processes will run the same as local processes, they have to be running > pretty much the same system. > > If you are running different distributions on nodes, you discard many of > the opportunities of running a cluster. More importantly, it's much > more knowledge- and labor-intensive to maintain the cluster while > guaranteeing consistency. > > >>> The only times that it is asked to do something new (boot, accept a >>> new process) it's communicating with a fully installed, up-to-date >>> master node. It has, at least temporarily, complete access to a >>> reference install. >>> >> I think that this is another assumption that holds true for the Scyld >> system, but there are situations where this is not true. >> > > Yes, there are scenarios where you want a different model. But "connected > during important events" is true for most clusters. We discard the > ability for a node to boot and run independently in order to get the > advantages of zero-install, zero-config consistent compute nodes. > > >>> If you design a cluster system that installs on a local disk, it's >>> very difficult to adapt it to diskless blades. If you design a >>> system that is as efficient without disks, it's trivial to >>> optionally mount disks for caching, temporary files or application >>> I/O. >>> >> If you design a system that is flexible enough to allow you to use >> either diskless or diskfull installs, what do you have to loose ? >> > > In theory that sounds good. But historically changing disk-based > installations to work on diskless machines has been very difficult, and > the results unsatisfactory. Disk-based installations want to do selective > installation based on the hardware present, and write/modify many links > and configuration files on installation -- many more than they "need" to. > > >> The same node "image" can be used in several ways: >> - copied to the local disk and booted from there (where the copying >> could be done as a separate operation followed by a reboot or it can >> be done from initrd) >> - used over NFS-root >> - used as a ramdisk, provided that the node "image" is small enough >> > > While memory follows the price-down capacity-up curve, we aren't quite to > the point where holding a full OS distribution in memory is negligible. > Most distributions (all the commercially interesting ones) are > workstation-oriented, and the trade-off is "disk is under $1/GB, so we > will install everything". It's foreseeable that holding an 8GB install > image in memory will be trivial, but that will be a few years in the > future, not today. And we will need better VM and PTE management to make > it efficient. > > > From lindahl at pbm.com Wed Oct 1 13:03:38 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <48E3BA69.2000405@ias.edu> References: <48E39782.4020006@ias.edu> <48E3B1F4.3000702@scalableinformatics.com> <48E3BA69.2000405@ias.edu> Message-ID: <20081001200338.GB32180@bx9.net> On Wed, Oct 01, 2008 at 01:59:05PM -0400, Prentice Bisbal wrote: > I'm sure you're right. It's my understanding that IB is a > well-documented standard, so all IB cables should be created equally, > but you know vendors, they'll say that their cables are more equal than > others and refuse support if we're not using all Cisco kit. Then buy from an IB vendor that isn't silly like that. QLogic and Voltaire are choices. Is Cisco still re-selling QLogic switches? Joe wrote: > > Cisco does appear to be transitioning to *other* technologies. Again, > > they aren't the only IB provider out there (seems such a shame, gobbling > > up TopSpin and then effectively discarding them). Apparently the main attraction of TopSpin was their virtualization software. Cisco, of course, supports all the major technologies, so it can be hard to tell if they really are transitioning to something else. At a minimum Cisco will want to sell gateways for all non-Ethernet technologies. After TopSpin was bought, Cisco continued to sell gateways developed by TopSpin, but sold rebranded QLogic DDR switches. -- greg From niftyompi at niftyegg.com Wed Oct 1 13:18:04 2008 From: niftyompi at niftyegg.com (NiftyOMPI Mitch) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <1848483940.20081001201911@gmx.net> References: <48E39782.4020006@ias.edu> <48E3B1F4.3000702@scalableinformatics.com> <48E3BA69.2000405@ias.edu> <1848483940.20081001201911@gmx.net> Message-ID: <88815dc10810011318tf540ef2m51ce4692b7c042a9@mail.gmail.com> On Wed, Oct 1, 2008 at 11:19 AM, Jan Heichler wrote: > Hallo Prentice, > > Mittwoch, 1. Oktober 2008, meintest Du: > > PB> And what is the status of DDR? Are people still using it, or has it > > PB> already been replaced by QDR in the marketplace? > > > DDR is still the most important Infiniband in the market. DDR provides > enough bandwidth for most applications at the moment because most > applications suffer from latency - not limited bandwidth. QDR is more > interesting for building spine networks. And QDR might get more important if > we see more cores in typical compute nodes. > > > >> Cisco does appear to be transitioning to *other* technologies. Again, > > >> they aren't the only IB provider out there (seems such a shame, gobbling > > >> up TopSpin and then effectively discarding them). > > > PB> What *other* technolgies are you talking about QDR IB, or something > > PB> other than IB altogether, like 10 Gb Ethernet, or something all new and > > PB> proprietary? > > > Cisco is dropping IB - probably to go for 10GE. It was a short time for > Cisco offering IB solutions - bur probably they weren't very successful. > Yes folks tell me that Cisco is backing away from the IB business -- all the parts on the Cisco price book that I know of were rebranded products and I suspect that some of their providers were too happy to sell directly to customers at a discount sometimes below the OEM price. Perhaps more importantly IB is not a router and management rich layer. i.e. It does not facilitate all the routing and management value add that Cisco focuses on for their bread and butter. They are however well involved and commited in Open MPI which is agnostic to the transport layer beyond the want to go fast part. I know that QLogic has well specified and tested cables for sale. Call the Qlogic "King of Prussia, Penn" office and tell them what you want. Gore and Leoni make excellent cables as do others. QDR is interesting... in all likelyhood the QDR game will be optical for any link further away than a single rack. Once IB goes optical there will be a lot of reason to install IB in machine rooms and campus sites that are just out of reach today. I should say that copper QDR is hard but not impossible. IMO optical has advantages and once it becomes easy for the end user (site) to install optical it should change the game. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081001/3613d7c6/attachment.html From alscheinine at tuffmail.us Wed Oct 1 14:07:39 2008 From: alscheinine at tuffmail.us (Alan Louis Scheinine) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <88815dc10810011318tf540ef2m51ce4692b7c042a9@mail.gmail.com> References: <48E39782.4020006@ias.edu> <48E3B1F4.3000702@scalableinformatics.com> <48E3BA69.2000405@ias.edu> <1848483940.20081001201911@gmx.net> <88815dc10810011318tf540ef2m51ce4692b7c042a9@mail.gmail.com> Message-ID: <48E3E69B.4030202@tuffmail.us> NiftyOMPI Mitch wrote > QDR is interesting... in all likelyhood the > QDR game will be optical for any link further away than a single rack. > Once IB goes optical there will be a lot of reason to install IB in > machine rooms and campus sites that are just out of reach today. When will IB go optical at a reasonable price? Perhaps you are not an expert but if you happen to have any pointers where we can learn more about the time frame it would be useful. Just a few days ago heard colleagues trying to figure-out the solution to connecting file system hardware just a bit too far from a cluster in another building. Ethernet and NFS would work but latency is a problem. Best regards, Alan -- Alan Scheinine 5010 Mancuso Lane, Apt. 621 Baton Rouge, LA 70809 Email: alscheinine@tuffmail.us Office phone: 225 578 0294 Mobile phone USA: 225 288 4176 [+1 225 288 4176] From kyron at neuralbs.com Wed Oct 1 14:27:11 2008 From: kyron at neuralbs.com (Eric Thibodeau) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: References: <48E2C6C1.60108@neuralbs.com> Message-ID: <48E3EB2F.2070102@neuralbs.com> Bogdan Costescu wrote: > On Tue, 30 Sep 2008, Eric Thibodeau wrote: > >> This has given me much flexibility and a very fast path to upgrade >> the nodes (LIVE!) since they would only need to be rebooted if I >> changed the kernel. I can install/upgrade the node's environment by >> simply chrooting into it and using the node's package manager and >> utilities as if it were a regular system). > > Only the first is an advantage of using NFS-root; the second is shared > by most methods that use a node "image". More or less, NFS-root changes are "propagated" instantly, most other approaches require a re-sync. Another way to see this, the NFS root approach only does changes on the head node and changed files don't need to be propagated and are accessed on a as-needed basis, this might have significant impacts on large deployments....not that I suggest that they use this approach ;) > However random installations or modifications of configuration file > within the chroot become very difficult to reproduce when you build > the next node "image" Document document document...which no one does...but document. > - either scripting everything or using cfengine/puppet/etc. can save a > lot of time in the long run, despite the initial effort to set up. I'll take your word for it that they have a version tracking mechanism. >> But I am in a special case where, if I break the cluster, I can fix >> it quickly and I always have a backup copy of the boot "root" image >> ready to switch to if my fiddling goes wrong. > Why not keeping several "images" around and only point the nodes to > mount the one considered current or "good" ? Well, that's what I meant... probably not clearly. ;) Eric From niftyompi at niftyegg.com Wed Oct 1 14:33:42 2008 From: niftyompi at niftyegg.com (NiftyOMPI Mitch) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] precise synchronization of system clocks In-Reply-To: References: <20081001020421.GA8126@compegg.wr.niftyegg.com> Message-ID: <88815dc10810011433m1fa59641g417643b15726a770@mail.gmail.com> On Wed, Oct 1, 2008 at 7:18 AM, Robert G. Brown wrote: > On Tue, 30 Sep 2008, Nifty niftyompi Mitch wrote: > ..... > One site service of interest is AC power. A modern processor sitting > in an idle state that then starts a well optimized loop will jump from > a couple of watts to 100 watts in as many clocks as the set of pipelines > is deep behind the instruction decode and instruction cache fill. A 1000 > processor (4000 cores) might jump from 4000 watts to 100000 watts in the > blink of an eye (err did the lights blink). Buffer that dI/dT through > the PS and it is less but still interesting on the mains which are > synchronized. > Interesting. I never have seen the lights blink although I don't run > synchronous computations. One wonders if the power supply capacitors > (which should be quite large, I would think) don't soak up the > transient, though, even on very large clusters. Also, I think that the > power differential is smaller than you are allowing for -- I don't think > most idle processors draw "no" power... > The dI/dT for processors can be quite high. AMD Phenom? X4 Quad-Core is listed as a 140 watt part (thermal) it is unlikely that all 450 million transistors are active in an idle loop. Tom's Hardware list the idle power at 21 watts. The speed at which a modern processor can go from idle to full power is astonishing. The local on board power supply regulation must respond very quickly. The delta from 21 to 140 fit inside one half cycle of a 50/60 Hz AC mains service. So inside of one AC cycle the part can move from 21 to 140.... which is large when multiplied by a 1000 node dual socket cluster. I do know of clusters and labs of workstations that power on hosts and disks in sequence to limit the startup power surge. Lots of us have been at it long enough to know that induction motors like elevators, refrigeration compressors and even vacuums can hit the mains hard enough to trigger errors. My home vacuum does dim the lights a little bit. In normal practice I doubt that this is an issue but synchronization in the extreme is interesting in its details and side effects. -- NiftyOMPI T o m M i t c h e l l -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081001/e9e6f732/attachment.html From andrew at moonet.co.uk Wed Oct 1 15:22:08 2008 From: andrew at moonet.co.uk (andrew holway) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <48E39782.4020006@ias.edu> References: <48E39782.4020006@ias.edu> Message-ID: > Has DDR IB gone the way of the dodo bird and been supplanted by QDR? I don't think front side busses are fast enough for QDR yet. From rpatienc at cisco.com Wed Oct 1 16:58:51 2008 From: rpatienc at cisco.com (Rob Patience) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <20081001200338.GB32180@bx9.net> References: <48E39782.4020006@ias.edu> <48E3B1F4.3000702@scalableinformatics.com> <48E3BA69.2000405@ias.edu> <20081001200338.GB32180@bx9.net> Message-ID: <1F19606C-6310-40F9-B2F3-AC32671D620E@cisco.com> Only the larger switches were from QLogic (144 and 288 ports) all other switches were made by TopSpin/Cisco directly. Cisco is not leaving the HPC space only IB. Btw: you can still buy cables from Cisco. But as mentioned, all cables come from 3rd party vendors. ~rob On Oct 1, 2008, at 1:03 PM, Greg Lindahl wrote: > On Wed, Oct 01, 2008 at 01:59:05PM -0400, Prentice Bisbal wrote: > >> I'm sure you're right. It's my understanding that IB is a >> well-documented standard, so all IB cables should be created equally, >> but you know vendors, they'll say that their cables are more equal >> than >> others and refuse support if we're not using all Cisco kit. > > Then buy from an IB vendor that isn't silly like that. QLogic and > Voltaire are choices. Is Cisco still re-selling QLogic switches? > > Joe wrote: > >>> Cisco does appear to be transitioning to *other* technologies. >>> Again, >>> they aren't the only IB provider out there (seems such a shame, >>> gobbling >>> up TopSpin and then effectively discarding them). > > Apparently the main attraction of TopSpin was their virtualization > software. Cisco, of course, supports all the major technologies, so it > can be hard to tell if they really are transitioning to something > else. At a minimum Cisco will want to sell gateways for all > non-Ethernet technologies. After TopSpin was bought, Cisco continued > to sell gateways developed by TopSpin, but sold rebranded QLogic DDR > switches. > > -- greg > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf From rgb at phy.duke.edu Wed Oct 1 21:14:06 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] precise synchronization of system clocks In-Reply-To: <88815dc10810011433m1fa59641g417643b15726a770@mail.gmail.com> References: <20081001020421.GA8126@compegg.wr.niftyegg.com> <88815dc10810011433m1fa59641g417643b15726a770@mail.gmail.com> Message-ID: On Wed, 1 Oct 2008, NiftyOMPI Mitch wrote: > The dI/dT for processors can be quite high. > AMD Phenom??? X4 Quad-Core is listed as a 140 watt part (thermal) > it is unlikely that all 450 million transistors are active in an idle > loop. Tom's Hardware list the idle power at 21 watts. The speed at > which a modern processor can go from idle to full power is astonishing. > The local on board power supply regulation must respond very quickly. The > delta > from 21 to 140 fit inside one half cycle of a 50/60 Hz AC mains service. So > inside > of one AC cycle the part can move from 21 to 140.... which is large when > multiplied by a 1000 node dual socket cluster. I do know of clusters and > labs of workstations that power on hosts and disks in sequence to limit the > startup power surge. Lots of us have been at it long enough to know that > induction motors like elevators, refrigeration compressors and even vacuums > can hit the mains > hard enough to trigger errors. My home vacuum does dim the lights a little > bit. I understand inductive surge when powering up, I understand in detail browning out a primary power transformer, but I think those are different issues and irrelevant here. So far, using my trusty Kill-a-Watt on real world nodes, I haven't seen more than a 30% differential draw loaded to unloaded. Large parts of the CPU require power at all times to function. Memory, for example, both on and offboard. Nearly everything inside a computer has a nontrivial idle draw, plus (sure) peak draw when it or one of its subsystems are in use. Exceptions are modern laptops -- with variable speed clocks, they draw much less idling than they do at speed, in part because power (idle or otherwise) IS very nearly proportional to CPU clock in at least parts of the system. And I don't really know how the latest designs do in this regard -- but there is a tendency to design the bleeding edge PERFORMANCE CPUs to work at "constant heat", as a major rate limiting factor in CPU design is getting rid of heat from the package. It's one reason they don't just crank the clock to infinity and beyond -- not the only one, but a major one. Multicores, of course, may function like hybrid cars, and somehow run more nearly idle when they are idle. But I'd have to hear from someone who slapped a KaW on an actual system and clocked it from idle (solidly post-boot, running the OS, at idle "equilibrium") to loaded (running flat out on e.g. a benchmark suite that loads all cores and/or the memory etc.). Has anyone actually done this and observed (say) a 2 or 3 to 1 increase in power draw loaded to idle? 50W idle to 200W loaded in 1 second? 150W idle to 200W loaded is more like what I've seen... > In normal practice I doubt that this is an issue but synchronization in the > extreme is interesting in its details and side effects. I completely agree with this, both parts. Although if one IS bumping from 50->200W "instantly" on not even an entire cluster but just all the nodes on a single circuit, that's popping over a KW on a 20A line -- ballpark where one MIGHT see something inductive (although as I said, probably nothing that the power supply capacitor(s) cannot buffer, although I'm too tired to compute the number of joules (watt-seconds) one can probably deliver and what RC probably is, etc). Popping multiple (as in 10+) KW in less than a 60 Hz cycle would very likely be hard on the primary, no doubt about it. rgb -- NiftyOMPI T o m M i t c h e l l -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From hearnsj at googlemail.com Thu Oct 2 02:05:30 2008 From: hearnsj at googlemail.com (John Hearns) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Linux Magazine - What He Said Message-ID: <9f8092cc0810020205t66a8a8d7wab9e034fe6979b05@mail.gmail.com> I just read Douglas Eadline's article on Linux Magazine, entitled "What He Said" http://www.linux-mag.com/id/7087 Very thought provoking article, and took me back to thinking about genetic algorithms, a subject I flirted with 20 years ago. I didn't find it worthwhile on a Sparc1 system with a whopping 2 Mbytes of RAM. I guess I should encourage responses to be made on the Linux Mag site. John Hearns -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081002/bac57c28/attachment.html From ajt at rri.sari.ac.uk Thu Oct 2 02:25:33 2008 From: ajt at rri.sari.ac.uk (Tony Travis) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Re: MOSIX2 In-Reply-To: <61D42F6C-2115-40E3-A394-A531AD2B12DA@xs4all.nl> References: <039801c92260$ab3630f0$1f9886a5@its99fd46g> <48E1EDE3.2080205@rri.sari.ac.uk> <48E35A95.7020908@rri.sari.ac.uk> <61D42F6C-2115-40E3-A394-A531AD2B12DA@xs4all.nl> Message-ID: <48E4938D.202@rri.sari.ac.uk> Vincent Diepeveen wrote: > [...] > Supporting a thing like openmosix requires a lot more than 1 guy who in > order to modify 3 bytes needs 1000 dollar. Hello, Vincent. I've been involved in the openMosix project from the start, and I still use it. My comment about the $1,000 for 'updates' concerned the cost an 'academic' update and support subscription to MOSIX2, which has nothing to do with openMosix. I think that is quite a reasonable cost, actually, and I dodn't criticise people who want to pay for support. It's not the cost that bothers me most, but the restrictions of the MOSIX2 licence. > [...] > In itself it would be GREAT if there is 1 open source project there, as > that joins forces more. My hope is that open-ssi > will work great for highend nic's also and slowly get to a phase that > all features work great. OK, maybe I should have another look at the most recent OpenSSI... > OpenMosix nor openssi could migrate processes that work with shared > memory, a nerd feature i like personally a lot. Actually there is a version of openMosix that can migrate programs using shared memory. Unfortunately, when I tried it, processes would migrate away from the 'home' node but wouldn't migrate back! This is a problem for openMosix, because processes have to migrate back to the 'home' node to make system calls. One of my (small) contributions to openMosix was to patch it to avoid migrations just to read the time. This is slightly relevant to another thread on the Beowulf list, because reading the time locally on a node to avoid the process migration overhead requires that the nodes all run NTP otherwise you get problems with clock skew... > Just claiming that they use 'stolen' features like page migration is > like claiming that linux stole multithreading from unix; > it is a bad attempt to smear dirt just to earn a 1000 dollar. > > It is not a reason to not use it. It is a bad attempt to spit at these > guys who donate time and their money without asking for payment. Sadly, there was a lot of bad feeling between the openMosix and OpenSSI communities and I should not have tried to make light of it. In fact it is a compliment to openMosix that the load-balancer algorithm was used in OpenSSI, not an insult, and BTW I'm one of those guys who donated their time without asking for payment. I did put a wry smile after my comment ;-) Tony. -- Dr. A.J.Travis, University of Aberdeen, Rowett Institute of Nutrition and Health, Greenburn Road, Bucksburn, Aberdeen AB21 9SB, Scotland, UK tel +44(0)1224 712751, fax +44(0)1224 716687, http://www.rowett.ac.uk mailto:ajt@rri.sari.ac.uk, http://bioinformatics.rri.sari.ac.uk/~ajt From Bogdan.Costescu at iwr.uni-heidelberg.de Thu Oct 2 05:03:30 2008 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: References: Message-ID: On Wed, 1 Oct 2008, Donald Becker wrote: > That's correct. Our model is that a "cluster" is a single system -- > and a single install. That's the idea that I've also started with, almost 10 years ago ;-) Not using Beo*/bproc, but NFS-root which allowed a single install in the node "image" to be used on all nodes - although you'd probably call this 2 system installs (the master node itself and the node "image"). But over the course of the years I have changed my mind... > If you are running different distributions on nodes, you discard > many of the opportunities of running a cluster. More importantly, > it's much more knowledge- and labor-intensive to maintain the > cluster while guaranteeing consistency. It indeed requires more work, however in some cases it cannot be avoided. From my own experience: a quantum chemistry program was distributed some 5 years ago as a binary statically compiled on RH9 or RHEL3 (kernel 2.4 based) with MPICH included. This meant that when I wanted to switch to running a 2.6 kernel this program could not run anymore so some of the nodes had to be kept to an older distribution until a newer program version could be obtained (that took about a year); it also meant that whenever there were discussions about using higher performance interconnects than GigE, this software's users were insisting on buying more nodes rather than a faster interconnect. This situation has caused both technical and administrative issues and the possibility of running different distributions has solved all of them easily. Having the possibility to run several distributions side-by-side requires spending some effort in organizing the other installed software, normally shared through NFS or a parallel FS to the nodes. But once you make the jump from 1 to 2, you might as well make it from 1 to many. This leads me to observe that we have non-similar points of view: you are a maker of a cluster-oriented distribution, trying to promote it and its underlying ideas (which are fine ideas, no question about that :-)), and sure that it works because it was bought and used successfully. I, on the other hand, have to find solutions to keep the scientists productive (whatever productive means ;-)) and to keep them as far as possible from the system details so that they can concentrate on their work. So it's not surprising that we come to different conclusions - at least they sustain an interesting discussion :-) I would be interested to hear Mark Hahn's opinion on this, as from how he presented himself to this list it seemed to me that he is in a very similar position to mine: supporting a variety of users with a variety of needs. But others should not feel left out, write your opinions as well ;-) > Most distributions (all the commercially interesting ones) are > workstation-oriented I don't really agree with this statement (looking at RHEL and SLES), but anyone who installs a workstation-oriented distribution on a cluster node gets what (s)he pays for :-) I have seen very recently (identity hidden to protect the guilty ;-)) such a node "image" which contained OpenOffice - to be fair, it was used via NFS-root so it wasn't wasting node memory, only master disk space... -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850 E-mail: bogdan.costescu@iwr.uni-heidelberg.de From Bogdan.Costescu at iwr.uni-heidelberg.de Thu Oct 2 05:18:30 2008 From: Bogdan.Costescu at iwr.uni-heidelberg.de (Bogdan Costescu) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: <48E3EB2F.2070102@neuralbs.com> References: <48E2C6C1.60108@neuralbs.com> <48E3EB2F.2070102@neuralbs.com> Message-ID: On Wed, 1 Oct 2008, Eric Thibodeau wrote: > the NFS root approach only does changes on the head node and changed > files don't need to be propagated and are accessed on a as-needed > basis, this might have significant impacts on large deployments NFS-root doesn't scale too well, the implementation of NFS in Linux is quite chatty. > I'll take your word for it that they have a version tracking mechanism. Take my word that if you're going into larger installations with the least amount of non-homogeneity you'll want to at least read about, if not use, such mechanisms :-) -- Bogdan Costescu IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850 E-mail: bogdan.costescu@iwr.uni-heidelberg.de From hearnsj at googlemail.com Thu Oct 2 05:25:17 2008 From: hearnsj at googlemail.com (John Hearns) Date: Tue Dec 2 01:07:55 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: References: Message-ID: <9f8092cc0810020525u4fb55634obd9b0100e11e8ff8@mail.gmail.com> 2008/10/1 Donald Becker > > > It's foreseeable that holding an 8GB install > image in memory will be trivial, but that will be a few years in the > future, not today. And we will need better VM and PTE management to make > it efficient. > > > Hmmmm.... can I forsee Puppy Linux HPC Edition ???? http://www.puppylinux.org/ Being half serious here, is it worth trying to get one of these slimmed-down distros to the state where it will run an HPC job? Oh, and in addition to a barebones install for our contemplative cluster, they have a onebones install which cuts out the GUI (cue more dog puns). http://www.puppylinux.com/pfs/ looks interesting... -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081002/7727b883/attachment.html From smulcahy at aplpi.com Thu Oct 2 05:56:41 2008 From: smulcahy at aplpi.com (stephen mulcahy) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: <9f8092cc0810020525u4fb55634obd9b0100e11e8ff8@mail.gmail.com> References: <9f8092cc0810020525u4fb55634obd9b0100e11e8ff8@mail.gmail.com> Message-ID: <48E4C509.3000407@aplpi.com> John Hearns wrote: > Hmmmm.... can I forsee Puppy Linux HPC Edition ???? > http://www.puppylinux.org/ > > > Being half serious here, is it worth trying to get one of these > slimmed-down distros to the state where it will run an HPC job? > Oh, and in addition to a barebones install for our contemplative > cluster, they have a onebones install which cuts out the GUI (cue more > dog puns). Is there that much of a difference between Puppy and a minimal Debian where you install only the "standard server" task (I understand Puppy is a Debian derivative, apologies if this is incorrect)? Or am I swinging my Debian swiss-army chainsaw indiscriminately here? -stephen -- Stephen Mulcahy Applepie Solutions Ltd. http://www.aplpi.com Registered in Ireland, no. 289353 (5 Woodlands Avenue, Renmore, Galway) From diep at xs4all.nl Thu Oct 2 06:28:38 2008 From: diep at xs4all.nl (Vincent Diepeveen) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Linux Magazine - What He Said In-Reply-To: <9f8092cc0810020205t66a8a8d7wab9e034fe6979b05@mail.gmail.com> References: <9f8092cc0810020205t66a8a8d7wab9e034fe6979b05@mail.gmail.com> Message-ID: To comment a bit on the article: >What if a high level programing description language was developed. Note I did not say programming language. >This description language would allow you to ?describe? what you needed to do and not how to do it (as >discussed before). This draft description would then be presented to an AI based clarifier which would examine >the description, look for inconsistencies or missing information and work with the programmer to create a formal >description of the problem. At that point the description is turned over to a really smart compiler that could target >a particular hardware platform and produce the needed optimized binaries. >Perhaps a GA could be thrown in to help optimize everything. The typical average PHD way of approaching problems they need to solve: Suppose you need to solve a huge problem A at a major parallel machine. We throw the problem A into a blackbox B giving output A'. Now our fantastic new brilliant method/algorithm comes into place that works on A' what our paper is about, and we solve the problem with that. I see that scenario a lot in all kind of different variations. A great example of it is: We invent algorithm f( x ) Now in our program we have if( magicfunction(x) && f(x) ) then printf("problem solved. jippie i got my research done.\n"); They test it at 1 case C and shout victory. typical AI way of doing things. There is so little AI dudes who do more than fill 50 pages of paper a year with ideas that keep untested, and because of never testing their understanding of the problem never advances and the next guy still shows up with the same untested solution. So the above guy who actually is *doing* something and testing something, already gets cheered for (with some reasons). But of course magicfunction(x) only works for their case C. There is no statistic significance. The number of AI guys who test their algorithm/method at state of the art software, be it selfwritten or from someone else, AND test is statistical significant without using some flawed testset where magicfunction(x) is always true, those guys you can really count on 2 hands the past 30 years. In parallel AI it is even worse in fact. There is for example just 2 chessprograms that scale well (so without losing first factor 50 to the parallel search frame) at supercomputers. There is some fuzz now about go programs using UCT/Monte Carlo and a few other algorithms combined; but these random algorithms miss such simple tactics, which for their computing power should be easy to not miss, that they really should think out a better way to search there parallel. Each AI solution that is not embarrassingly parallel is so difficult to parallellize well, so not losing a factor 50, that it is just real hard to make generic frameworks that work real efficient. Such a framework would pose a solution for only 1 program; as soon as you improve 1 year later the program with a new enhancement the entire parallel framework might need to get rewritten from scratch again, to not again lose some factors in speed in scaling. The commercial way most AI guys look to the parallel problem is therefore real simple: "can i get faster than i am at a PC, as it is a government machine anyway, i didn't pay for efficient usage of it, i just want to be faster than my home PC without too much effort". I'm not gonna fight that lemma. However a generic framework that works like that is not gonna get used of course. It just speeds you up so much to make a custom parallel solution, that everyone is doing it. Besides, the hardware is that expensive, that it's worth doing it. >This process sounds like it would take a lot of computing resources. Guess what? We have that. >Why not throw a cluster at this problem. Maybe it would take a week to create a binary, >but it would be cluster time and not your time. There would be no edit/make/run cycle >because the description tells the compiler what the program has to do. The minutia >(or opportunities for bugs) of programming whether it be serial or parallel would be >handled by the compiler. Talk about a killer application. On Oct 2, 2008, at 11:05 AM, John Hearns wrote: > I just read Douglas Eadline's article on Linux Magazine, entitled > "What He Said" > http://www.linux-mag.com/id/7087 > > Very thought provoking article, and took me back to thinking about > genetic algorithms, > a subject I flirted with 20 years ago. I didn't find it worthwhile > on a Sparc1 system with a whopping 2 Mbytes of RAM. > > I guess I should encourage responses to be made on the Linux Mag site. > > John Hearns > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From james.p.lux at jpl.nasa.gov Thu Oct 2 06:37:58 2008 From: james.p.lux at jpl.nasa.gov (Lux, James P) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] precise synchronization of system clocks In-Reply-To: Message-ID: Rgb wrote: > > I understand inductive surge when powering up, I understand in detail > browning out a primary power transformer, but I think those are > different issues and irrelevant here. Inductive surge -> magnetizing current in large iron core inductors (depends on where you are in line frequency cycle at "switch on") Sag from overload -> impedance (both R and L) in transformer and wires from transformer to load. 2% voltage drop is the NEC guideline for "in premises wires".. From panel to load. The voltage at the utility entrance could probably be +/- 5% at any given time. > > So far, using my trusty Kill-a-Watt on real world nodes, I haven't seen > more than a 30% differential draw loaded to unloaded. Large parts of > the CPU require power at all times to function. Memory, for example, > both on and offboard. Nearly everything inside a computer has a > nontrivial idle draw, plus (sure) peak draw when it or one of its > subsystems are in use. Very much true. DRAM needs refresh, for instance. > > Exceptions are modern laptops -- with variable speed clocks, they draw > much less idling than they do at speed, in part because power (idle or > otherwise) IS very nearly proportional to CPU clock in at least parts of > the system. And I don't really know how the latest designs do in this > > Multicores, of course, may function like hybrid cars, and somehow run > more nearly idle when they are idle. But I'd have to hear from someone > who slapped a KaW on an actual system and clocked it from idle (solidly > post-boot, running the OS, at idle "equilibrium") to loaded (running > flat out on e.g. a benchmark suite that loads all cores and/or the > memory etc.). Has anyone actually done this and observed (say) a 2 or 3 > to 1 increase in power draw loaded to idle? 50W idle to 200W loaded in > 1 second? 150W idle to 200W loaded is more like what I've seen... Don't forget that the power supply efficiency drops dramatically when DC load drops, too. They don't spend a penny more on sophisticated design than required to get that "energy star" rating, and that has more to do with having a good "low power hibernate" mode than good efficiency at 25% load. > >> In normal practice I doubt that this is an issue but synchronization in the >> extreme is interesting in its details and side effects. > > I completely agree with this, both parts. Although if one IS bumping > from 50->200W "instantly" on not even an entire cluster but just all the > nodes on a single circuit, that's popping over a KW on a 20A line -- > ballpark where one MIGHT see something inductive (although as I said, > probably nothing that the power supply capacitor(s) cannot buffer, > although I'm too tired to compute the number of joules (watt-seconds) > one can probably deliver and what RC probably is, etc). Popping > multiple (as in 10+) KW in less than a 60 Hz cycle would very likely be > hard on the primary, no doubt about it. If one considers that a single wire is about 1 uH/meter (typical electrical wiring will be much less, because it's a pair, with currents flowing opposite directions), the series L might be a few tens of uH. At, say, 20 A, there's just not much energy stored there. From rgb at phy.duke.edu Thu Oct 2 07:41:24 2008 From: rgb at phy.duke.edu (Robert G. Brown) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] precise synchronization of system clocks In-Reply-To: References: Message-ID: On Thu, 2 Oct 2008, Lux, James P wrote: > If one considers that a single wire is about 1 uH/meter (typical electrical > wiring will be much less, because it's a pair, with currents flowing > opposite directions), the series L might be a few tens of uH. At, say, 20 > A, there's just not much energy stored there. I was thinking in terms of load on transformers, but yeah, I forgot (again) that switching power supplies ain't got no transformers. Voltage regulators and MAYBE UPS have transformers, but they also have bloody damn big capacitors. I'm just not used to a transformer-free world. Back in the very old days we had power problems in our server room (that I think might have been connected to people using really big physics apparatus in the building) and we bought a honker power conditioner to run a subset of our systems. If one set up a monitor within two meters of the sucker, the CRT fuzzed and distorted -- the rapidly varying field was strong enough to deflect electrons at two meters. We kept it far far away from our backup tapes...;-) rgb -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 From jamesb at loreland.org Thu Oct 2 08:37:50 2008 From: jamesb at loreland.org (James Braid) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: References: <48E2C6C1.60108@neuralbs.com> <48E3EB2F.2070102@neuralbs.com> Message-ID: <25446eb90810020837v6d8e91bdt6dc36a634dca18b5@mail.gmail.com> 2008/10/2 Bogdan Costescu : > On Wed, 1 Oct 2008, Eric Thibodeau wrote: > >> the NFS root approach only does changes on the head node and changed files >> don't need to be propagated and are accessed on a as-needed basis, this >> might have significant impacts on large deployments > > NFS-root doesn't scale too well, the implementation of NFS in Linux is quite > chatty. It's scaled great in our experience. We run 1000+ machines off NFS root, running lots of large third-party applications from there as well. The NFS servers are just a pair of Linux servers, nothing fancy. From peter.st.john at gmail.com Thu Oct 2 09:41:13 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Linux Magazine - What He Said In-Reply-To: <9f8092cc0810020205t66a8a8d7wab9e034fe6979b05@mail.gmail.com> References: <9f8092cc0810020205t66a8a8d7wab9e034fe6979b05@mail.gmail.com> Message-ID: John, After I first thought up my nutty GA scheme, I was then astonished by the John Holland Scientific American artifcle. I was aghast that anyone could even imagine thinking along those lines with 1960's hardware, as he had done. I got my first working version done on a 386 (daughtercard on a 286 motherboard) with one and a half MB. Peter On 10/2/08, John Hearns wrote: > > I just read Douglas Eadline's article on Linux Magazine, entitled "What He > Said" > http://www.linux-mag.com/id/7087 > > Very thought provoking article, and took me back to thinking about genetic > algorithms, > a subject I flirted with 20 years ago. I didn't find it worthwhile on a > Sparc1 system with a whopping 2 Mbytes of RAM. > > I guess I should encourage responses to be made on the Linux Mag site. > > John Hearns > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081002/97a131d6/attachment.html From hearnsj at googlemail.com Thu Oct 2 09:54:13 2008 From: hearnsj at googlemail.com (John Hearns) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Linux Magazine - What He Said In-Reply-To: References: <9f8092cc0810020205t66a8a8d7wab9e034fe6979b05@mail.gmail.com> Message-ID: <9f8092cc0810020954mca589feyd27bdea53d3b7b7f@mail.gmail.com> 2008/10/2 Peter St. John > John, > After I first thought up my nutty GA scheme, I was then astonished by the > John Holland Scientific American artifcle. I was aghast that anyone could > even imagine thinking along those lines with 1960's hardware, as he had > done. > > I believe I read the same article. In fact, the first algorithm I tried was simulated annealing (I know this does not equal a GA). It swapped so horrendously that I was discouraged, and the thing took all night to run even for a simple 2D case (I was working on radiation therapy planning). I guess I should have been smarter and worked out how to do it within the constraints of RAM that I had. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081002/c9edb986/attachment.html From peter.st.john at gmail.com Thu Oct 2 11:06:55 2008 From: peter.st.john at gmail.com (Peter St. John) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Linux Magazine - What He Said In-Reply-To: <9f8092cc0810020954mca589feyd27bdea53d3b7b7f@mail.gmail.com> References: <9f8092cc0810020205t66a8a8d7wab9e034fe6979b05@mail.gmail.com> <9f8092cc0810020954mca589feyd27bdea53d3b7b7f@mail.gmail.com> Message-ID: John, When I tried to ressurect my thing a couple years ago, I realized my original code was all wrong in trading time for space (plenty of time on the 386, then SunOS servers; not enough space, but new machine had plenty of unused RAM). I thought some about redesigning to reverse the trade-off, which would be helpful, but I'm sure it would not just be easier, but more effective, to run it on a cluster (many nodes, not much ram per node needed, and any nonzero amount of communication sufficient, but more can be usefull). Peter On 10/2/08, John Hearns wrote: > > > > 2008/10/2 Peter St. John > >> John, >> After I first thought up my nutty GA scheme, I was then astonished by the >> John Holland Scientific American artifcle. I was aghast that anyone could >> even imagine thinking along those lines with 1960's hardware, as he had >> done. >> >> I believe I read the same article. > In fact, the first algorithm I tried was simulated annealing (I know this > does not equal a GA). It swapped so horrendously that I was discouraged, and > the thing took all night to run even for a simple 2D case (I was working on > radiation therapy planning). > I guess I should have been smarter and worked out how to do it within the > constraints of RAM that I had. > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081002/9661c9b8/attachment.html From gdjacobs at gmail.com Thu Oct 2 11:42:58 2008 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: <48E4C509.3000407@aplpi.com> References: <9f8092cc0810020525u4fb55634obd9b0100e11e8ff8@mail.gmail.com> <48E4C509.3000407@aplpi.com> Message-ID: <48E51632.8090001@gmail.com> stephen mulcahy wrote: > John Hearns wrote: >> Hmmmm.... can I forsee Puppy Linux HPC Edition ???? >> http://www.puppylinux.org/ >> >> >> Being half serious here, is it worth trying to get one of these >> slimmed-down distros to the state where it will run an HPC job? >> Oh, and in addition to a barebones install for our contemplative >> cluster, they have a onebones install which cuts out the GUI (cue more >> dog puns). > > Is there that much of a difference between Puppy and a minimal Debian > where you install only the "standard server" task (I understand Puppy is > a Debian derivative, apologies if this is incorrect)? > > Or am I swinging my Debian swiss-army chainsaw indiscriminately here? > > -stephen > Damn Small (DSL) is, Puppy is not. I believe Puppy is it's own beast. This is an interesting topic, though. How much difference does shrinking the size of a kernel build make in improving the performance of a tightly coupled lockstep algorithm scaled to hundreds, or thousands of nodes? Anything in the literature which is directly comparable, not just the ASCI Q paper? -- Geoffrey D. Jacobs From xclski at yahoo.com Thu Oct 2 12:10:40 2008 From: xclski at yahoo.com (Ellis Wilson) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Linux Magazine - What He Said Message-ID: <217759.53687.qm@web37906.mail.mud.yahoo.com> In the article: "What if a high level programing description language was developed. Note I did not say programming language. This description language would allow you to ?describe? what you needed to do and not how to do it (as discussed before)." I would ask then, how does one "describe what you need to do"? As a brief (and heinously simplistic) example, let us say that we wanted to replace a specific string in a long text file with another string. In assembler, this would be a heinous and explicit long process where we tell it exactly what we are doing. One could even argue at the C level we have a fair amount of control, but once we hit fairly high-level programming languages one merely can say text.replaceAll(oldstring,newstring); and you are done. You have told the program what you want done, not how to do it. Would I call Java, C#, etc. "Programming Description Languages"? No. Therefore I wouldn't call an even higher level HPC language a description language either. In the article: "This draft description would then be presented to an AI based clarifier which would examine the description, look for inconsistencies or missing information and work with the programmer to create a formal description of the problem." Sounds like regular programming in an intolerant IDE with fancy terminology. In the article: "At that point the description is turned over to a really smart compiler that could target a particular hardware platform and produce the needed optimized binaries. Perhaps a GA could be thrown in to help optimize everything." Later on it is also mentioned that "Maybe it would take a week to create a binary, but it would be cluster time and not your time", where in reality with those really troublesome (useful) problems there are truly terribly long running times. With a GA (which produces eons more bad solutions than good) we would not only have to ascertain the fitness of the really nice solution (for those useful problems it could take a week or more at fastest) but also the fitness of the really really poor solution that swaps out constantly and computes redundantly. That could take years... The basic premise of the GA for code is Genetic Programming or an Evolutionary Algorithm, and so with these the same problems exist - bad solutions that monopolize time on the cluster. Compilers will eventually be entirely AI (though I doubt I will see it) and when they are, singularity will have already happened and infinite resources will be available since designing hardware is naturally more space constrained than software. All I'm saying is for right now, we are making the most of what we have without involving AI that extensively in our programming. Just my opinions, and no hard feelings towards Doug. Typically I enjoy thoroughly his articles. Ellis From hearnsj at googlemail.com Thu Oct 2 12:15:31 2008 From: hearnsj at googlemail.com (John Hearns) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: <48E51632.8090001@gmail.com> References: <9f8092cc0810020525u4fb55634obd9b0100e11e8ff8@mail.gmail.com> <48E4C509.3000407@aplpi.com> <48E51632.8090001@gmail.com> Message-ID: <9f8092cc0810021215q71060486w5d7f0b6cb540acf5@mail.gmail.com> > > Damn Small (DSL) is, Puppy is not. I believe Puppy is it's own beast. > > The FAQ on the site says its a Slackware derivative, if I'm not wrong. What goes around comes around I guess :-) Maybe those kipper ties from the 70s will be back too. Actually, and here I toss in a handgrenade, if we are considering a Damn Small Puppy PXE bootable Linux in RAM, what difference does the distro make, so long as the GLIBC/maths libraries/MPI libraries are the appropriate ones. I guess that statement goes for any Linux install really. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081002/4144d932/attachment.html From dgs at slac.stanford.edu Thu Oct 2 11:47:48 2008 From: dgs at slac.stanford.edu (David Simas) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Linux Magazine - What He Said In-Reply-To: References: <9f8092cc0810020205t66a8a8d7wab9e034fe6979b05@mail.gmail.com> <9f8092cc0810020954mca589feyd27bdea53d3b7b7f@mail.gmail.com> Message-ID: <20081002184748.GA29815@horus.slac.stanford.edu> > When I tried to ressurect my thing a couple years ago, I realized my > original code was all wrong in trading time for space (plenty of time on the > 386, then SunOS servers; not enough space, but new machine had plenty of > unused RAM). I thought some about redesigning to reverse the trade-off, > which would be helpful, but I'm sure it would not just be easier, but more > effective, to run it on a cluster (many nodes, not much ram per node needed, > and any nonzero amount of communication sufficient, but more can be > usefull). In case you don't know about PGAPack: http://www-fp.mcs.anl.gov/CCST/research/reports_pre1998/comp_bio/stalk/pgapack.html It's a genetic algorithm with MPI support. I've used the serial version, and it works great. I made a half-effort at getting the MPI version working, without success. DGS From YXU11 at PARTNERS.ORG Thu Oct 2 13:09:36 2008 From: YXU11 at PARTNERS.ORG (Xu, Jerry) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <200810021900.m92J05RP012792@bluewest.scyld.com> References: <200810021900.m92J05RP012792@bluewest.scyld.com> Message-ID: <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> Hello, Currently I generate nearly one TB data every few days and I need to pass it along enterprise network to the storage center attached to my HPC system, I am thinking about compressing it (most tiff format image data) as much as I can, as fast as I can before I send it crossing network ... So, I am wondering whether anyone is familiar with any hardware based accelerator, which can dramatically improve the compressing procedure.. suggestion for any file system architecture will be appreciated too.. I have couple of contacts from some vendors but not sure whether it works as I expected, so if anyone has experience about it and want to share, it will be really appreciated ! Thanks, Jerry Jerry Xu PhD HPC Scientific Computing Specialist Enterprise Research Infrastructure Systems (ERIS) Partners Healthcare, Harvard Medical School http://www.partners.org The information transmitted in this electronic communication is intended only for the person or entity to whom it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this information in error, please contact the Compliance HelpLine at 800-856-1983 and properly dispose of this information. From niftyompi at niftyegg.com Thu Oct 2 14:29:44 2008 From: niftyompi at niftyegg.com (NiftyOMPI Mitch) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Has DDR IB gone the way of the Dodo? In-Reply-To: <48E3E69B.4030202@tuffmail.us> References: <48E39782.4020006@ias.edu> <48E3B1F4.3000702@scalableinformatics.com> <48E3BA69.2000405@ias.edu> <1848483940.20081001201911@gmx.net> <88815dc10810011318tf540ef2m51ce4692b7c042a9@mail.gmail.com> <48E3E69B.4030202@tuffmail.us> Message-ID: <20081002212944.GA6110@compegg.wr.niftyegg.com> On Wed, Oct 01, 2008 at 04:07:39PM -0500, Alan Louis Scheinine wrote: > NiftyOMPI Mitch wrote >> QDR is interesting... in all likelyhood the >> QDR game will be optical for any link further away than a single rack. >> Once IB goes optical there will be a lot of reason to install IB in > > machine rooms and campus sites that are just out of reach today. > > When will IB go optical at a reasonable price? Perhaps you are not > an expert but if you happen to have any pointers where we can learn more > about the time frame it would be useful. Just a few days ago heard > colleagues trying to figure-out the solution to connecting file system > hardware just a bit too far from a cluster in another building. Ethernet > and NFS would work but latency is a problem. I have no good information and I do not know what a reasonable price is to you. If you wish to connect to a cluster in another building with optical links and IB there are some optical solutions today. Without "a lot of links" (=money) bandwidth maps will be lumpy. Building to building links and beyond solutions will be expensive many have active boxes switch-Cu<-magicbox->---optical---Cu-switch. For the current solutions scan the vendor lists from recent Supercomputer Shows and IB interoperability events. Floor to floor and room to room solutions like the Intel Optical IB cables are close to par with copper when you multiply by the link length. If you want good information put out a sane RFP and see what the vendors can come up with. Most importantly describe what you want solved and see what solutions are offered. Building to building clustering just seems hard to me but not all clustering requirements are equal. -- T o m M i t c h e l l Found me a new hat, now what? From landman at scalableinformatics.com Thu Oct 2 14:40:31 2008 From: landman at scalableinformatics.com (Joe Landman) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> Message-ID: <48E53FCF.3020008@scalableinformatics.com> Xu, Jerry wrote: > Hello, > > Currently I generate nearly one TB data every few days and I need to pass it > along enterprise network to the storage center attached to my HPC system, I am > thinking about compressing it (most tiff format image data) as much as I can, as > fast as I can before I send it crossing network ... So, I am wondering whether > anyone is familiar with any hardware based accelerator, which can dramatically > improve the compressing procedure.. suggestion for any file system architecture > will be appreciated too.. I have couple of contacts from some vendors but not > sure whether it works as I expected, so if anyone has experience about it and > want to share, it will be really appreciated ! Hi Jerry: Sounds like a bunch of sequencers or arrays going, and dumping data to a file system. I am not sure you can get deterministic compression run times from a compressor, or even deterministic compression ratios on random (binary) data. I have heard of some "xml accelerators" in the past (back when XML was considered a good buzzword) that did on-the-fly compression. I guess it boils down to if T(compression) + T(comppressed transfer) << T(uncompressed_transfer) And the cost benefit analysis would focus upon the cost of T(compression) as compared to faster networks. That is, if you spent $1000/node more to get a faster fabric, which dropped your transfer time to 20%, is this better/more cost effective than spending $10,000 or so on an accelerate that may get 70% file compression, and double the overall time? Obviously the above numbers are made up, but you get the idea. Will look around. If you want to talk to a group doing FPGA stuff in other markets, let me know and I can hook you up. Just be aware that this might not be cost/time effective. Joe > > > > Thanks, > > Jerry > > Jerry Xu PhD > HPC Scientific Computing Specialist > Enterprise Research Infrastructure Systems (ERIS) > Partners Healthcare, Harvard Medical School > http://www.partners.org > > The information transmitted in this electronic communication is intended only > for the person or entity to whom it is addressed and may contain confidential > and/or privileged material. Any review, retransmission, dissemination or other > use of or taking of any action in reliance upon this information by persons or > entities other than the intended recipient is prohibited. If you received this > information in error, please contact the Compliance HelpLine at 800-856-1983 and > properly dispose of this information. > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics LLC, email: landman@scalableinformatics.com web : http://www.scalableinformatics.com http://jackrabbit.scalableinformatics.com phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 From lindahl at pbm.com Thu Oct 2 14:55:02 2008 From: lindahl at pbm.com (Greg Lindahl) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <48E53FCF.3020008@scalableinformatics.com> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> <48E53FCF.3020008@scalableinformatics.com> Message-ID: <20081002215502.GA6059@bx9.net> On Thu, Oct 02, 2008 at 05:40:31PM -0400, Joe Landman wrote: > I have heard of some "xml accelerators" in the > past (back when XML was considered a good buzzword) that did on-the-fly > compression. Well, given how wordy the tags are, simply compressing those is inexpensive and is a big win, if they're a large % of the data. To get back to the question that was asked, (1) no hardare compresser compresses smaller than a good software compresser (hardware compressors tend to be faster but can't compress as well), and (2) sounds like your data can be compressed in an embarrassingly parallel fashion, so on a quad-core box you might find that you can keep up with your link with software compression in parallel. -- greg From niftyompi at niftyegg.com Thu Oct 2 15:07:16 2008 From: niftyompi at niftyegg.com (Nifty niftyompi Mitch) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> Message-ID: <20081002220716.GB6110@compegg.wr.niftyegg.com> On Thu, Oct 02, 2008 at 04:09:36PM -0400, Xu, Jerry wrote: > > Currently I generate nearly one TB data every few days and I need to pass it > along enterprise network to the storage center attached to my HPC system, I am > thinking about compressing it (most tiff format image data) as much as I can, as > fast as I can before I send it crossing network ... So, I am wondering whether > anyone is familiar with any hardware based accelerator, which can dramatically > improve the compressing procedure.. suggestion for any file system architecture > will be appreciated too.. I have couple of contacts from some vendors but not > sure whether it works as I expected, so if anyone has experience about it and > want to share, it will be really appreciated ! If I recall correctly TIFF files are hard to compress any more than they already are. Linux has a handful of compression tools -- how much compression are you able to get on your data with each of these tools and each command line set of options. My guess is that the best and most cost effective hardware solution you will find is a hot Opteron or Intel box with not too many cores and a good chunk fast DRAM in it. You might find that contrary to the rest of linux a good optimizing compiler like PGI, Pathscale, Intel... will speed up the compression code enough to matter so consider rebuilding things like bzip2, gzip, p7zip and benchmarking the best compression tool for speed and correctness. And when you find the best compression program for your data you might look for some good DSP cards to run your compression on. My bet is that a new hot box will win. Since this is a cluster mailing list just bang the compression out to as a cluster job. -- T o m M i t c h e l l Found me a new hat, now what? From reuti at staff.uni-marburg.de Thu Oct 2 15:09:39 2008 From: reuti at staff.uni-marburg.de (Reuti) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> Message-ID: <84083434-B04C-4081-8BD0-3F288403F2F7@staff.uni-marburg.de> Hi, Am 02.10.2008 um 22:09 schrieb Xu, Jerry: > Currently I generate nearly one TB data every few days and I need > to pass it > along enterprise network to the storage center attached to my HPC > system, I am > thinking about compressing it (most tiff format image data) is it plain tiff or already using any compression like RLE or LZW inside? Do you want or must stay with tiff? -- Reuti > as much as I can, as > fast as I can before I send it crossing network ... So, I am > wondering whether > anyone is familiar with any hardware based accelerator, which can > dramatically > improve the compressing procedure.. suggestion for any file system > architecture > will be appreciated too.. I have couple of contacts from some > vendors but not > sure whether it works as I expected, so if anyone has experience > about it and > want to share, it will be really appreciated ! > > > > Thanks, > > Jerry > > Jerry Xu PhD > HPC Scientific Computing Specialist > Enterprise Research Infrastructure Systems (ERIS) > Partners Healthcare, Harvard Medical School > http://www.partners.org > > The information transmitted in this electronic communication is > intended only > for the person or entity to whom it is addressed and may contain > confidential > and/or privileged material. Any review, retransmission, > dissemination or other > use of or taking of any action in reliance upon this information by > persons or > entities other than the intended recipient is prohibited. If you > received this > information in error, please contact the Compliance HelpLine at > 800-856-1983 and > properly dispose of this information. > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf From gdjacobs at gmail.com Thu Oct 2 15:26:39 2008 From: gdjacobs at gmail.com (Geoff Jacobs) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Compute Node OS on Local Disk vs. Ram Disk In-Reply-To: <9f8092cc0810021215q71060486w5d7f0b6cb540acf5@mail.gmail.com> References: <9f8092cc0810020525u4fb55634obd9b0100e11e8ff8@mail.gmail.com> <48E4C509.3000407@aplpi.com> <48E51632.8090001@gmail.com> <9f8092cc0810021215q71060486w5d7f0b6cb540acf5@mail.gmail.com> Message-ID: <48E54A9F.6020002@gmail.com> John Hearns wrote: > Damn Small (DSL) is, Puppy is not. I believe Puppy is it's own beast. > > The FAQ on the site says its a Slackware derivative, if I'm not wrong. > What goes around comes around I guess :-) Maybe those kipper ties from > the 70s will be back too. First question on the page... http://puppylinux.org/wiki/archives/old-wikka-wikki/categorydocumentation/historypuppy > Actually, and here I toss in a handgrenade, if we are considering a Damn > Small Puppy PXE bootable Linux in RAM, what difference does the distro > make, so long as the GLIBC/maths libraries/MPI libraries are the > appropriate ones. > I guess that statement goes for any Linux install really. Well, you want to use something similar to the head in the compute nodes, and you want the head node install to have all the necessary infrastructure. Technically, you could use a different install on each the nodes, but it wouldn't be smart. Building a pill for the compute nodes a la Scyld or Perseus seems like the best compromise in terms of reducing the compute node operating environment while maintaining a common software base. -- Geoffrey D. Jacobs From grumiche at integrityit.com.br Thu Oct 2 15:37:19 2008 From: grumiche at integrityit.com.br (Rodrigo Grumiche Silva) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> Message-ID: Hi Jerry I think HDF5 can help you in some way... - http://www.hdfgroup.org/HDF5/ Rodrigo 2008/10/2 Xu, Jerry > Hello, > > Currently I generate nearly one TB data every few days and I need to pass > it > along enterprise network to the storage center attached to my HPC system, I > am > thinking about compressing it (most tiff format image data) as much as I > can, as > fast as I can before I send it crossing network ... So, I am wondering > whether > anyone is familiar with any hardware based accelerator, which can > dramatically > improve the compressing procedure.. suggestion for any file system > architecture > will be appreciated too.. I have couple of contacts from some vendors but > not > sure whether it works as I expected, so if anyone has experience about it > and > want to share, it will be really appreciated ! > > > > Thanks, > > Jerry > > Jerry Xu PhD > HPC Scientific Computing Specialist > Enterprise Research Infrastructure Systems (ERIS) > Partners Healthcare, Harvard Medical School > http://www.partners.org > > The information transmitted in this electronic communication is intended > only > for the person or entity to whom it is addressed and may contain > confidential > and/or privileged material. Any review, retransmission, dissemination or > other > use of or taking of any action in reliance upon this information by persons > or > entities other than the intended recipient is prohibited. If you received > this > information in error, please contact the Compliance HelpLine at > 800-856-1983 and > properly dispose of this information. > > > > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081002/ca184f03/attachment.html From bill at cse.ucdavis.edu Thu Oct 2 18:11:10 2008 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> Message-ID: <48E5712E.7070504@cse.ucdavis.edu> Xu, Jerry wrote: > Hello, > > Currently I generate nearly one TB data every few days and I need to pass it > along enterprise network to the storage center attached to my HPC system, I am > thinking about compressing it (most tiff format image data) tiff uncompressed, or tiff compressed files? If uncompressed I'd guess that bzip2 might do well with them. > as much as I can, as > fast as I can before I send it crossing network ... So, I am wondering whether > anyone is familiar with any hardware based accelerator, which can dramatically > improve the compressing procedure.. Improve? You mean compression ratio? Wall clock time? CPU utilization? Adding forward error correction? > suggestion for any file system architecture > will be appreciated too.. Er, hard to imagine a reasonable recommendation without much more information. Organization, databases (if needed), filenames and related metadata are rather specific to the circumstances. Access patterns, retention time, backups, and many other issues would need consideration. > I have couple of contacts from some vendors but not > sure whether it works as I expected, so if anyone has experience about it and > want to share, it will be really appreciated ! Why hardware? I have some python code that managed 10MB/sec per CPU (or 80MB on 8 CPUs if you prefer) that compresses with zlib, hashes with sha256, and encrypts with AES (256 bit key). Assuming the compression you want isn't substantially harder than doing zlib, sha256, and aes a single core from a dual or quad core chip sold in the last few years should do fine. 1TB every 2 days = 6MB/sec or approximately 15% of a quad core or 60% of a single core for my compress, hash and encrypt in python. Considering how cheap cores are (quad desktops are often under $1k) I'm not sure what would justify an accelerator card. Not to mention picking the particular algorithm could make a huge difference to the CPU and compression ratio achieved. I'd recommend taking a stack of real data and trying out different compression tools and settings. In any case 6MB/sec of compression isn't particularly hard these days.... even in python on a 1-2 year old mid range cpu. From hahn at mcmaster.ca Thu Oct 2 19:31:48 2008 From: hahn at mcmaster.ca (Mark Hahn) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> Message-ID: > Currently I generate nearly one TB data every few days and I need to pass it Bill's right - 6 MB/s is really not much to ask from even a complex WAN. I think the first thing you should do is find the bottleneck. to me it sounds like you have a sort of ropey path with a 100 Mbps hop somewhere. > thinking about compressing it (most tiff format image data) as much as I can tiff is a fairly generic container that can hold anything from a horrible uncompressed 4-byte-per-pixel to jpeg or rle. looking at the format you're really using would be wise. I'm guessing that if you transcode to png, you'll get better compression than gzip/etc. dictionary-based compression is fundamentally inappropriate for most non-text data - not images, not double-precision dumps of physical simulations, etc. png is quite a lot smarter about most kinds of images than older formats, and can be lossy or lossless. hardware compression would be a serious mistake unless you've already pursued these routes. specialized hardware is a very short-term and quite narrow value proposition. I would always prefer to improve the infrastructure. > The information transmitted in this electronic communication is intended only uh, email is publication. regards, mark hahn. From coutinho at dcc.ufmg.br Thu Oct 2 19:42:32 2008 From: coutinho at dcc.ufmg.br (Bruno Coutinho) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <48E5712E.7070504@cse.ucdavis.edu> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> <48E5712E.7070504@cse.ucdavis.edu> Message-ID: 2008/10/2 Bill Broadley <...> Why hardware? I have some python code that managed 10MB/sec per CPU (or > 80MB > on 8 CPUs if you prefer) that compresses with zlib, hashes with sha256, and > encrypts with AES (256 bit key). Assuming the compression you want isn't > substantially harder than doing zlib, sha256, and aes a single core from a > dual or quad core chip sold in the last few years should do fine. > > 1TB every 2 days = 6MB/sec or approximately 15% of a quad core or 60% of a > single core for my compress, hash and encrypt in python. Considering how > cheap cores are (quad desktops are often under $1k) I'm not sure what would > justify an accelerator card. Not to mention picking the particular > algorithm > could make a huge difference to the CPU and compression ratio achieved. > I'd > recommend taking a stack of real data and trying out different compression > tools and settings. > > In any case 6MB/sec of compression isn't particularly hard these days.... > even > in python on a 1-2 year old mid range cpu. > > > In Information Retrieval, they compress almost everything and they have papers showing that using compression can result in a *faster* system. You process a little more, but get great gains in disk throughput. If you compress before even storing data, your system could store faster by using less disk/storage bandwidth per stored file. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20081002/d60c8b87/attachment.html From diep at xs4all.nl Fri Oct 3 01:13:16 2008 From: diep at xs4all.nl (Vincent Diepeveen) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <48E5712E.7070504@cse.ucdavis.edu> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> <48E5712E.7070504@cse.ucdavis.edu> Message-ID: <692A035F-D1C2-4F49-9DF5-A04C29B30233@xs4all.nl> Bzip2, gzip, Why do you guys keep quoting those total outdated compressors :) there is 7-zip for linux, it's open source and also part of LZMA. On average remnants are 2x smaller than what gzip/bzip2 is doing for you (so bzip2/gzip is factor 2 worse). 7-zip also works parallel, not sure whether it works in linux parallel. 7za is command line version. Linux distributions should include it default. Uses PPM, that's a new form of multidimensional compression that all that old junk like bzip2/gzip doesn't use. TIFF files compress real bad of course. Maybe convert them to some more inefficient format, which increases its size probably, which then compresses real great with PPM. When googling for the best compressors, don't try PAQ, that's a benchmark compressor. Was worse for my terabyte of data than even 7-zip (which is not by far best PPM compressor, but it's open source). Vincent On Oct 3, 2008, at 3:11 AM, Bill Broadley wrote: > Xu, Jerry wrote: >> Hello, Currently I generate nearly one TB data every few days and >> I need to pass it >> along enterprise network to the storage center attached to my HPC >> system, I am >> thinking about compressing it (most tiff format image data) > > tiff uncompressed, or tiff compressed files? If uncompressed I'd > guess that > bzip2 might do well with them. > >> as much as I can, as >> fast as I can before I send it crossing network ... So, I am >> wondering whether >> anyone is familiar with any hardware based accelerator, which can >> dramatically >> improve the compressing procedure.. > > Improve? You mean compression ratio? Wall clock time? CPU > utilization? > Adding forward error correction? > >> suggestion for any file system architecture >> will be appreciated too.. > > Er, hard to imagine a reasonable recommendation without much more > information. > Organization, databases (if needed), filenames and related metadata > are rather > specific to the circumstances. Access patterns, retention time, > backups, and many other issues would need consideration. > >> I have couple of contacts from some vendors but not >> sure whether it works as I expected, so if anyone has experience >> about it and >> want to share, it will be really appreciated ! > > Why hardware? I have some python code that managed 10MB/sec per > CPU (or 80MB > on 8 CPUs if you prefer) that compresses with zlib, hashes with > sha256, and > encrypts with AES (256 bit key). Assuming the compression you want > isn't > substantially harder than doing zlib, sha256, and aes a single core > from a > dual or quad core chip sold in the last few years should do fine. > > 1TB every 2 days = 6MB/sec or approximately 15% of a quad core or > 60% of a > single core for my compress, hash and encrypt in python. > Considering how > cheap cores are (quad desktops are often under $1k) I'm not sure > what would > justify an accelerator card. Not to mention picking the particular > algorithm > could make a huge difference to the CPU and compression ratio > achieved. I'd > recommend taking a stack of real data and trying out different > compression > tools and settings. > > In any case 6MB/sec of compression isn't particularly hard these > days.... even > in python on a 1-2 year old mid range cpu. > > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > From bill at cse.ucdavis.edu Fri Oct 3 02:17:52 2008 From: bill at cse.ucdavis.edu (Bill Broadley) Date: Tue Dec 2 01:07:56 2008 Subject: [Beowulf] Accelerator for data compressing In-Reply-To: <692A035F-D1C2-4F49-9DF5-A04C29B30233@xs4all.nl> References: <200810021900.m92J05RP012792@bluewest.scyld.com> <552609B844F30844A3EF4B50A11ECF369F8818@PHSXMB32.partners.org> <48E5712E.7070504@cse.ucdavis.edu> <692A035F-D1C2-4F49-9DF5-A04C29B30233@xs4all.nl> Message-ID: <48E5E340.6080408@cse.ucdavis.edu> Vincent Diepeveen wrote: > Bzip2, gzip, > > Why do you guys keep quoting those total outdated compressors :) Path of least resistance, not to mention python bindings. > there is 7-zip for linux, it's open source and also part of LZMA. On > average remnants > are 2x smaller than what gzip/bzip2 is doing for you (so bzip2/gzip is > factor 2 worse). > 7-zip also works parallel, not sure whether it works in linux parallel. > 7za is command line > version. Seems like the question is related to CPU utilization as well as compression ratios. Assuming the TIFF files are not already compressed, how fast would you expect 7-zip to be relative to bzip2 and gzip's compression and decompression speeds? I was looking for decent bandwidth, and I did look around a bit and it seemed like things often would compress somewhat better, often the bandwidth achieved was 5-6x worse. So for squeezing the most out of a 28k modem... sure. For keeping up with a 100mbit or GigE connection on a l