[Beowulf] best archetecture / tradeoffs
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Michael Will mwill at penguincomputing.comTue Sep 6 09:59:07 PDT 2005
- Previous message: [Beowulf] matlab license counting & PBS scheduller
- Next message: [Beowulf] Anyone know the LinkAggregation(Trunking) on a switch?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Looks like you inspired quite a technical discussion. Maybe I can answer some of the questions: > My requirements are easy, I think, since my program is already broken > up into a lot of different programs communicating via STDIN/OUT. I > benchmarked and found my problem is CPU intensive. The overall data > transfer is small, but all the different parts need to be assembled > before the final pass on the data. The final pass cannot be broken up, > but the final pass is fast. So my model is input data -> break up > into N workers -> assemble results -> process -> done. Perfect. You just have to make sure that all your N workers take about the same time to complete, and that your compute nodes have about the same peformance characteristics because otherwise a job will have to wait for the slowest node/subjob before assembly. > 1) diskless vs disk > > I am thinking diskless is better. I don't worry about network traffic > as much as power consumption, overall node cost, reliability, and ease > of management. My nodes are all identical, so I figure diskless, > right? Well I am having a few problems... > > I still don't know exactly about swap. One of the clusters I set up > was an NFS mounted root file system that did something with swap to > /dev/loop0, but I don't really understand that, is the swap going onto > the nfs drive or is it just back into memory? What is the best ( > fastest ) way to handle swap on diskless nodes that might sometimes be > processing jobs using more than the physical RAM? To find out what it did, go to a node that has something mounted on /dev/loop0 and run 'mount|grep loop' which should show you what it has mounted where via /dev/loop0. If it was a ramdisk, that would be silly since it could have been used to avoid swap in the first place. If it is a remote network device, that is already better. Maybe a central memory server somewhere that offers swappartitions via nbd to write to directly rather than going over say NFS with all its additional swap-irrelevant overhead? Generally you want to avoid swap because it slows you down magnitutes. It might be more advisable to run two subsets of smaller partial problems in sequence instead before assembling the results. Some of our customers buy large enough clusters to fit their problem sizes into the caches so they even avoid too many RAM accesses and get an overlinear speedup at that point. > Also, is it really true you need a separate copy of the root nfs drive > for every node? I don't see why this is. I have it working with just > one for all nodes, but am I missing something here? Scyld Beowulf (commercial product) does have an initial ramdisk for the root filesystem of a compute node and does caching to be able to be diskless without the rootnfs performance penalty. It takes up very little space. I just executed 'free' on an idle compute node that has 2G of RAM, and it reports that 53M are used. \ At boot time a compute node uses PXE to boot a small kernel off of the master node and only starts two processes: one to report back status and one to accept new jobs. It does not load all the workstation/server related daemons and mechanisms that are only in the way of a compute nodes job. > 2) message passing vs roll yer own > > I have played with a few different packages, and written a bunch of > perl networking code, and read a bunch and I am still not sure what is > better. Please chime in: Generally: If you are the only developer and have tight control over your code then roll-your-own will be more tailored to your problem and therefore more efficient. If you plan to throw other developers at it, you will have to teach them your methodology though. If you use a standard library like MPI or PVM (mpi has better performance) then you might find developers that already know a lot about it and get up and running quicker. > - what is the fastest way to run perl on worker nodes. Remember I > don't need to do anything too fancy, just grab a bunch of workers, > send jobs to them, assemble the results, send the results to another > worker, etc. I don't need to broadcast to all nodes or anything else. I have not looked at perl for beowulf computing, but I assume they also have some ::MPI implementation that you could toy with. However easiest would be to have one process write N data files with the input for the compute jobs and then queue up N jobs that are calling perl with the script and input parameters that tells it where to find its input data. Each job will transform their data file into an output data file and then remove (or rename) the input file to flag that it is done. The splitter program could then hang around checking once every 5 seconds or so if all those input data files are gone and then collect all the output data files and assemble them. The load balancing only happens in terms of finding idle nodes to run new jobs on, and can be done by the queueing system / scheduler. The cool thing is that your master process can queue up say 20 jobs, and your 7 compute nodes are being used with one (or two for SMP) job at a time until they are all processed, and then your master node collects the results. In scyld beowulf that scheduler would be BBQ or with an add-on moab, torque or sge, other beowulf distributions might have torque and sge as well. BBQ would be good for you since it is so simple and does not require any complicated setup for a simple task as yours. > - what is the easiest way to do it. I wrote the whole thing in perl > already, and I was not really impressed with the speed or reliability. > Certainly this was at least partially programmer error, but my > question stands, what is the easiest way to reliably control a cluster > of worker nodes running different perl programs, and assembling the > data. This includes load balancing. In order to be fast and efficient I would recommend to at least program in c/c++. Perl is great for prototyping, but once you want to go into production I would reimplement it in C. > - I saw some information on clusters that were linked in the kernel > and acted as a single machine. Is this a working reality? How does the > performance of such a system compare with message passing for > dedicated processing such as my own. Scyld Beowulf does that to a certain extend but still relies on message passing. So you have a single process table on the headnode, if you type 'ps' you see all the processes running on the cluster without having to log into a compute node. However there is NO shared memory and you still have to use message passing because the programmer will always know better what data needs to be communicated between processes than an automated system could. openmosix does dynamic process migration, and is very interesting theoretically but I would never use it for getting actual work done. It does automatic process migration, but needs to communicate back through the network to where it had files open thus wasting bandwidth on the network. AFAIK it does not have shared memory across nodes either, which you mostly need a fast low latency interconnect for like quadrics. > - I was playing with MPICH-2, is this better than LAM? What about > other message passing libraries what is the best one? any with direct > hooks into perl? With the simple things you want to do even mpich-1 can do the job. > - how fast is NFS and RSH. If I were to change the code so it works > with a NFS mounted file instead of STDIN/OUT and I use RSH to > communicate how would the speed compare with message passing? with > direct perl networking? You should try it out. Scyld beowulf uses bproc instead of RSH, so it's a system-call away instead of invoking an rsh login procedure. That makes it more efficient. > 3) Distribution and kernel > > I create my NFS system by copying directories off my RH9 distribution. > I had lots of problems and could never get everything working. I think > it would be loads easier if I could find a standard distribution image > already constructed somewhere out there... I don't really care what > distribution as long as I can run perl. If you want something that is free and unsupported, go with Rocks. It does not have diskless booting, bproc or a single process table as Scyld would have, but it does come with schedulers and perl and is a complete cluster distribution. If you want to try out scyld beowulf and experiment for a week, I can give you an account to one of our 4-compute node demo clusters (dual opteron) so that is 8 cpus for your partial jobs. > I keep seeing people advising against the NFS root option and > advocating ram disk images. Opinions here? Where can I get ram disk > images? I would be nice to download a basic complete ram disk image, > that boots with root rsh working already. Use a distribution that has gotten that all right already. Rocks does not do the diskless stuff but still uses NFS for your home directories. > Well I guess that is enough for one day. Thank you for taking the time > to read this email. If you have the time please send me your opinions > on this stuff. I hope this helped some. Note I work for Penguin Computing with the subsidiary Scyld and so even through I try to be objective my experience comes from that angle. Michael -- Michael Will Penguin Computing Corp. Sales Engineer 415-954-2822 415-954-2899 fx mwill at penguincomputing.com
- Previous message: [Beowulf] matlab license counting & PBS scheduller
- Next message: [Beowulf] Anyone know the LinkAggregation(Trunking) on a switch?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
