Beowulf 1 and 2 (Was: CLIC = Mandrake Cluster Distribution)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Josip Loncaric josip at icase.eduTue Sep 24 12:18:15 PDT 2002
- Previous message: CLIC = Mandrake Cluster Distribution
- Next message: Beowulf 1 and 2 (Was: CLIC = Mandrake Cluster Distribution)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Richard C Ferri wrote: > > The real point that Don Becker was making is that the state of the > art evolved past Beowulf 1 back in mid-2000, to a Beo 2 model that doesn't > rely on a full install and therefore isn't subject to version skew -- nodes > getting out of sync over time. This "out of sync over time" does not happen by itself. Some hardware, software or operator glitch causes it, and those are infrequent events. Detecting version skew and fixing it can easily be done periodically using simple Linux tools (semi-automatic procedures can help), but if the problem can be avoided altogether by system design, even better. > The Beo 2 model also provides single system > image with regard to process space, and methods to manage remote processes > from a central point of control. This is an attractive feature, but in my personal view, not essential. I'm reasonably happy with separate process spaces and periodic central collection of machine status data. We operate a heterogeneous cluster built in several phases with different hardware and different networks, so it makes more sense for me to consider each class of machines separately. On the other hand, a central point of control and unified view of the entire cluster open the possibility of rather sophisticated management. In my experience, the most time consuming task in operating a Beowulf 1 cluster is the initial node installation or reinstallation after suffering disk damage. The Beowulf 2 model makes this much easier. The second most annoying problem is loss of network connectivity (usually due to some hardware glitch or network driver flakiness under heavy load). The Beowulf 2 model depends on network connectivity more than the Beowulf 1 model, where one can operate the problematic node in stand-alone mode and debug the problem. The third issue on my list is specialized hardware with its own drivers and daemons, which are usually designed for Beowulf 1 operation but may be difficult to convert to the Beowulf 2 model. Finally, the issue of process startup time may concern some people: the Beowulf 1 model usually goes through standard Linux login, which is typically slower than the Beowulf 2 model, where all processes are started on the head node and then immediately migrated to the appropriate compute node. Sincerely, Josip P.S. Our cluster (almost) never goes down, even during system upgrades. Robustness comes from redundancy and compartmentalization. We typically upgrade one section of the system, monitor new software for bugs it (re)introduced, find fixes, then complete the cluster upgrade. Multiple system images allow this kind of experimentation. The single system image approach is much more dependent on this single image working reliably on all of your hardware immediately. One of the following two approaches may appeal to you: 1) Don't keep all of your eggs in one basket -->multiple images, multiple versions 2) Keep all of your eggs in one basket and WATCH that basket -->thoroughly debugged single image, no version skew P.P.S. Version skew can begin before Linux gets loaded, in BIOS. Diagnosing that one machine in a hundred becomes flaky under heavy load because its BIOS settings are subtly different is a very time consuming process... By comparison, detecting and fixing Linux version skew is easy (start by checking /var/log/rpmpkgs, automatically generated daily on each machine). -- Dr. Josip Loncaric, Research Fellow mailto:josip at icase.edu ICASE, Mail Stop 132C PGP key at http://www.icase.edu./~josip/ NASA Langley Research Center mailto:j.loncaric at larc.nasa.gov Hampton, VA 23681-2199, USA Tel. +1 757 864-2192 Fax +1 757 864-6134
- Previous message: CLIC = Mandrake Cluster Distribution
- Next message: Beowulf 1 and 2 (Was: CLIC = Mandrake Cluster Distribution)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
